One major criticism against the use of  synthetic  data  has  been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them.  We  argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably  reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be  transferred  to  generate  a  similar  data  product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.
    
    - 
        
        IMPROVING THE SYNTHETIC LONGITUDINAL BUSINESS DATABASE
        
 February 2014
             
                Working Paper Number:CES-14-12
            
            In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments'  confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata,  i.e., individual establishment records simulated from statistical models de- signed to mimic the  joint distribution of the underlying observed data. Previously, we used this approach to generate a  public-use version'now available for public use'of the U. S. Census Bureau's Longitudinal Business  Database (LBD), a longitudinal cen- sus of establishments dating back to 1976.  While the synthetic  LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis  models and adding features.  This article describes our efforts to create the second generation of  the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.
         View Full
            Paper PDF
 
- 
        
        Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database
        
 February 2011
             
                Working Paper Number:CES-11-04
            
            In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments\' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.
         View Full
            Paper PDF
 
- 
        
        LOOKING BACK ON THREE YEARS OF USING THE SYNTHETIC LBD BETA
        
 February 2014
             
                Working Paper Number:CES-14-11
            
            Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau's Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.
         View Full
            Paper PDF
 
- 
        
        Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics
        
 February 2016
             
                Working Paper Number:CES-16-10
            
            We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).
         View Full
            Paper PDF
 
- 
        
        Distribution Preserving Statistical Disclosure Limitation
        
 September 2006
             
                Working Paper Number:tp-2006-04
            
            One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed,
 partially synthetic data sets. These are data on actual respondents, but with confidential data
 replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate
 inferences because the distribution of synthetic data is completely determined by the model used
 to generate them. We present two practical methods of generating synthetic values when the imputer 
 has only limited information about the true data generating process. One is applicable when
 the true likelihood is known up to a monotone transformation. The second requires only limited
 knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential 
 data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility
 and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and
 sampling error in the estimated transformation. We validate the approach with a simulation and
 application to a large linked employer-employee database.
         View Full
            Paper PDF
 
- 
        
        Redesigning the Longitudinal Business Database
        
 May 2021
             
                Working Paper Number:CES-21-08
            
            In this paper we describe the U.S. Census Bureau's redesign and production implementation of the Longitudinal Business Database (LBD) first introduced by Jarmin and Miranda (2002). The LBD is used to create the Business Dynamics Statistics (BDS), tabulations describing the entry, exit, expansion, and contraction of businesses. The new LBD and BDS also incorporate information formerly provided by the Statistics of U.S. Businesses program, which produced similar year-to-year measures of employment and establishment flows. We describe in detail how the LBD is created from curation of the input administrative data, longitudinal matching, retiming of economic census-year births and deaths, creation of vintage consistent industry codes and noise factors, and the creation and cleaning of each year of LBD data. This documentation is intended to facilitate the proper use and understanding of the data by both researchers with approved projects accessing the LBD microdata and those using the BDS tabulations.
         View Full
            Paper PDF
 
- 
        
        Dynamically Consistent Noise Infusion and Partially Synthetic Data as Confidentiality Protection Measures for Related Time Series
        
 July 2012
             
                Working Paper Number:CES-12-13
            
            The Census Bureau's Quarterly Workforce Indicators (QWI) provide detailed quarterly statistics on employment measures such as worker and job flows, tabulated by worker characteristics in various combinations. The data are released for several levels of NAICS industries and geography, the lowest aggregation of the latter being counties. Disclosure avoidance methods are required to protect the information about individuals and businesses that contribute to the underlying data. The QWI disclosure avoidance mechanism we describe here relies heavily on the use of noise infusion through a permanent multiplicative noise distortion factor, used for magnitudes, counts, differences and ratios. There is minimal suppression and no complementary suppressions. To our knowledge, the release in 2003 of the QWI was the first large-scale use of noise infusion in any official statistical product. We show that the released statistics are analytically valid along several critical dimensions { measures are unbiased and time series properties are preserved. We provide an analysis of the degree to which confidentiality is protected. Furthermore, we show how the judicious use of synthetic data, injected into the tabulation process, can completely eliminate suppressions, maintain analytical validity, and increase the protection of the underlying confidential data.
         View Full
            Paper PDF
 
- 
        
        LEHD Infrastructure files in the Census RDC - Overview
        
 June 2014
             
                Working Paper Number:CES-14-26
            
            The Longitudinal Employer-Household Dynamics (LEHD) Program at the U.S. Census Bureau, with the support of several national research agencies, maintains a set of infrastructure files using administrative data provided by state agencies, enhanced with information from other administrative data sources, demographic and economic (business) surveys and censuses. The LEHD Infrastructure Files provide a detailed and comprehensive picture of workers, employers, and their interaction in the U.S. economy. This document describes the structure and content of the 2011 Snapshot of the LEHD Infrastructure files as they are made available in the Census Bureaus secure and restricted-access Research Data Center network. The document attempts to provide a comprehensive description of all researcher-accessible files, of their creation, and of any modifcations made to the files to facilitate researcher access.
         View Full
            Paper PDF
 
- 
        
        Disclosure Limitation and Confidentiality Protection in Linked Data
        
 January 2018
             
                Working Paper Number:CES-18-07
            
            Confidentiality protection for linked administrative data is a combination of access modalities and statistical disclosure limitation. We review traditional statistical disclosure limitation methods and newer methods based on synthetic data, input noise infusion and formal privacy. We discuss how these methods are integrated with access modalities by providing three detailed examples. The first example is the linkages in the Health and Retirement Study to Social Security Administration data. The second example is the linkage of the Survey of Income and Program Participation to administrative data from the Internal Revenue Service and the Social Security Administration. The third example is the Longitudinal Employer-Household Dynamics data, which links state unemployment insurance records for workers and firms to a wide variety of censuses and surveys at the U.S. Census Bureau. For examples, we discuss access modalities, disclosure limitation methods, the effectiveness of those methods, and the resulting analytical validity. The final sections discuss recent advances in access modalities for linked administrative data.
         View Full
            Paper PDF
 
- 
        
        LEHD Data Documentation LEHD-OVERVIEW-S2008-rev1
        
 December 2011
             
                Working Paper Number:CES-11-43
            
            
         View Full
            Paper PDF