CREAT - Census Bureau

Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database

February 2011

Written by: Arnold P Reznek, Ron Jarmin, Javier Miranda, John M. Abowd, Jerome P. Reiter, Satkartar K. Kinney

Working Paper Number:

CES-11-04

Abstract

In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments\' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.

Document Tags and Keywords

Keywords:

data, statistical, enterprise, database, report, microdata, survey, statistical agencies, disclosure, agency, confidentiality, information, financial, incorporated, employment data, privacy, business data, discrepancy, establishments data, record, census business, datasets, store, statistical disclosure, publicly

Tags:

Internal Revenue Service, Standard Industrial Classification, Longitudinal Research Database, National Science Foundation, County Business Patterns, Company Organization Survey, Establishment Micro Properties, Organization for Economic Cooperation and Development, Longitudinal Business Database, Bureau of Labor, COMPUSTAT, Chicago Census Research Data Center, Survey of Income and Program Participation, Economic Census, Research Data Center, North American Industry Classification System, Business Register, Special Sworn Status, Local Employment Dynamics, Business Dynamics Statistics

Similar Working Papers

The 10 most similar working papers to the working paper 'Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database' are listed below in order of similarity.

Working Paper
🔥

IMPROVING THE SYNTHETIC LONGITUDINAL BUSINESS DATABASE

February 2014

Authors: Javier Miranda, Jerome P. Reiter, Satkartar K. Kinney

Working Paper Number:

CES-14-12

In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments' confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version'now available for public use'of the U. S. Census Bureau's Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.
View Full Paper PDF
Working Paper
🔥

LOOKING BACK ON THREE YEARS OF USING THE SYNTHETIC LBD BETA

February 2014

Authors: Lars Vilhuber, Miranda, Javier

Working Paper Number:

CES-14-11

Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau's Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.
View Full Paper PDF
Working Paper
🔥

Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics

February 2016

Authors: Javier Miranda, Lars Vilhuber

Working Paper Number:

CES-16-10

We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).
View Full Paper PDF
Working Paper
🔥

A FIRST STEP TOWARDS A GERMAN SYNLBD: CONSTRUCTING A GERMAN LONGITUDINAL BUSINESS DATABASE

February 2014

Authors: Lars Vilhuber, Jorg Drechsler

Working Paper Number:

CES-14-13

One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.
View Full Paper PDF
Working Paper
🔥

Dynamically Consistent Noise Infusion and Partially Synthetic Data as Confidentiality Protection Measures for Related Time Series

July 2012

Authors: Lars Vilhuber, John M. Abowd, Kevin L. McKinney, Bryce Stephens, Simon Woodcock, Kaj Gittings

Working Paper Number:

CES-12-13

The Census Bureau's Quarterly Workforce Indicators (QWI) provide detailed quarterly statistics on employment measures such as worker and job flows, tabulated by worker characteristics in various combinations. The data are released for several levels of NAICS industries and geography, the lowest aggregation of the latter being counties. Disclosure avoidance methods are required to protect the information about individuals and businesses that contribute to the underlying data. The QWI disclosure avoidance mechanism we describe here relies heavily on the use of noise infusion through a permanent multiplicative noise distortion factor, used for magnitudes, counts, differences and ratios. There is minimal suppression and no complementary suppressions. To our knowledge, the release in 2003 of the QWI was the first large-scale use of noise infusion in any official statistical product. We show that the released statistics are analytically valid along several critical dimensions { measures are unbiased and time series properties are preserved. We provide an analysis of the degree to which confidentiality is protected. Furthermore, we show how the judicious use of synthetic data, injected into the tabulation process, can completely eliminate suppressions, maintain analytical validity, and increase the protection of the underlying confidential data.
View Full Paper PDF
Working Paper
🔥

Distribution Preserving Statistical Disclosure Limitation

September 2006

Authors: Simon Woodcock, Gary Benedetto

Working Paper Number:

tp-2006-04

One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with confidential data replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate inferences because the distribution of synthetic data is completely determined by the model used to generate them. We present two practical methods of generating synthetic values when the imputer has only limited information about the true data generating process. One is applicable when the true likelihood is known up to a monotone transformation. The second requires only limited knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and sampling error in the estimated transformation. We validate the approach with a simulation and application to a large linked employer-employee database.
View Full Paper PDF
Working Paper
🔥

Redesigning the Longitudinal Business Database

May 2021

Authors: Martha Stinson, T. Kirk White, Teresa C. Fort, Christopher Goetz, Nathan Goldschlag, Melissa Chow, Elisabeth Ruth Perlman, James Lawrence

Working Paper Number:

CES-21-08

In this paper we describe the U.S. Census Bureau's redesign and production implementation of the Longitudinal Business Database (LBD) first introduced by Jarmin and Miranda (2002). The LBD is used to create the Business Dynamics Statistics (BDS), tabulations describing the entry, exit, expansion, and contraction of businesses. The new LBD and BDS also incorporate information formerly provided by the Statistics of U.S. Businesses program, which produced similar year-to-year measures of employment and establishment flows. We describe in detail how the LBD is created from curation of the input administrative data, longitudinal matching, retiming of economic census-year births and deaths, creation of vintage consistent industry codes and noise factors, and the creation and cleaning of each year of LBD data. This documentation is intended to facilitate the proper use and understanding of the data by both researchers with approved projects accessing the LBD microdata and those using the BDS tabulations.
View Full Paper PDF
Working Paper

Total Error and Variability Measures with Integrated Disclosure Limitation for Quarterly Workforce Indicators and LEHD Origin Destination Employment Statistics in On The Map

January 2017

Authors: Lars Vilhuber, John M. Abowd, Kevin L. McKinney, Andrew S. Green

Working Paper Number:

CES-17-71

We report results from the rst comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total employment, beginning-of-quarter employment, full-quarter employment, total payroll, and average monthly earnings of full-quarter employees. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM). The evaluation is conducted by generating multiple threads of the edit and imputation models used in the LEHD Infrastructure File System. These threads conform to the Rubin (1987) multiple imputation model, with each thread or implicate being the output of formal probability models that address coverage, edit, and imputation errors. Design-based sampling variability and nite population corrections are also included in the evaluation. We derive special formulas for the Rubin total variability and its components that are consistent with the disclosure avoidance system used for QWI and LODES/OTM workplace reports. These formulas allow us to publish the complete set of detailed total quality measures for QWI and LODES. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs have quality in the range generally deemed acceptable. Tabulations involving zero, one or two jobs, which are generally suppressed in the QWI and synthesized in LODES, have substantial total variability but their publication in LODES allows the formation of larger custom aggregations, which will in general have the accuracy estimated for tabulations in the QWI based on a similar number of workers.
View Full Paper PDF
Working Paper

R&D, Attrition and Multiple Imputation in BRDIS

January 2017

Authors: Juana Sanchez, Sydney Noelle Kahmann

Working Paper Number:

CES-17-13

Multiple imputation in business establishment surveys like BRDIS, an annual business survey in which some companies are sampled every year or multiple years, may enhance the estimates of total R&D in addition to helping researchers estimate models with subpopulations of small sample size. Considering a panel of BRDIS companies throughout the years 2008 to 2013 linked to LBD data, this paper uses the conclusions obtained with missing data visualization and other explorations to come up with a strategy to conduct multiple imputation appropriate to address the item nonresponse in R&D expenditures. Because survey design characteristics are behind much of the item and unit nonresponse, multiple imputation of missing data in BRDIS changes the estimates of total R&D significantly and alters the conclusions reached by models of the determinants of R&D investment obtained with complete case analysis.
View Full Paper PDF
Working Paper

Releasing Earnings Distributions using Differential Privacy: Disclosure Avoidance System For Post Secondary Employment Outcomes (PSEO)

April 2019

Authors: Kevin L. McKinney, Andrew Foote, Ashwin Machanavajjhala

Working Paper Number:

CES-19-13

The U.S. Census Bureau recently released data on earnings percentiles of graduates from post secondary institutions. This paper describes and evaluates the disclosure avoidance system developed for these statistics. We propose a differentially private algorithm for releasing these data based on standard differentially private building blocks, by constructing a histogram of earnings and the application of the Laplace mechanism to recover a differentially-private CDF of earnings. We demonstrate that our algorithm can release earnings distributions with low error, and our algorithm out-performs prior work based on the concept of smooth sensitivity from Nissim, Raskhodnikova and Smith (2007).
View Full Paper PDF

Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database

February 2011

Working Paper Number:

CES-11-04

Abstract

Document Tags and Keywords

The 10 most similar working papers to the working paper 'Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database' are listed below in order of similarity.

February 2014

Working Paper Number:

CES-14-12

February 2014

Working Paper Number:

CES-14-11

February 2016

Working Paper Number:

CES-16-10

February 2014

Working Paper Number:

CES-14-13

July 2012

Working Paper Number:

CES-12-13

September 2006

Working Paper Number:

tp-2006-04

May 2021

Working Paper Number:

CES-21-08

January 2017

Working Paper Number:

CES-17-71

January 2017

Working Paper Number:

CES-17-13

April 2019

Working Paper Number:

CES-19-13