This paper describes ongoing research to protect confidentiality in longitudinal linked
data through creation of multiply-imputed, partially synthetic data. We present two enhancements to the methods
of [2]. The first is designed to preserve marginal distributions in the partially synthetic data. The second is
designed to protect confidential links between sampling frames.
-
Distribution Preserving Statistical Disclosure Limitation
September 2006
Working Paper Number:
tp-2006-04
One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed,
partially synthetic data sets. These are data on actual respondents, but with confidential data
replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate
inferences because the distribution of synthetic data is completely determined by the model used
to generate them. We present two practical methods of generating synthetic values when the imputer
has only limited information about the true data generating process. One is applicable when
the true likelihood is known up to a monotone transformation. The second requires only limited
knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential
data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility
and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and
sampling error in the estimated transformation. We validate the approach with a simulation and
application to a large linked employer-employee database.
View Full
Paper PDF
-
Total Error and Variability Measures with Integrated Disclosure Limitation for Quarterly Workforce Indicators and LEHD Origin Destination Employment Statistics in On The Map
January 2017
Working Paper Number:
CES-17-71
We report results from the rst comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total employment, beginning-of-quarter employment, full-quarter employment, total payroll, and average monthly earnings of full-quarter employees. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM). The evaluation is conducted by generating multiple threads of the edit and imputation models used in the LEHD Infrastructure File System. These threads conform to the Rubin (1987) multiple imputation model, with each thread or implicate being the output of formal probability models that address coverage, edit, and imputation errors. Design-based sampling variability and nite population corrections are also included in the evaluation. We derive special formulas for the Rubin total variability and its components that are consistent with the disclosure avoidance system used for QWI and LODES/OTM workplace reports. These formulas allow us to publish the complete set of detailed total quality measures for QWI and LODES. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs have quality in the range generally deemed acceptable. Tabulations involving zero, one or two jobs, which are generally suppressed in the QWI and synthesized in LODES, have substantial total variability but their publication in LODES allows the formation of larger custom aggregations, which will in general have the accuracy estimated for tabulations in the QWI based on a similar number of workers.
View Full
Paper PDF
-
Total Error and Variability Measures for the Quarterly Workforce Indicators and LEHD Origin Destination Employment Statistics in OnTheMap
September 2020
Working Paper Number:
CES-20-30
We report results from the first comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total flow-employment, beginning-of-quarter employment, full quarter employment, average monthly earnings of full-quarter employees, and total quarterly payroll. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in On-TheMap (OTM), including OnTheMap for Emergency Management. We account for errors due to coverage; record-level non response; edit and imputation of item missing data; and statistical disclosure limitation. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs are a transition zone, where cells may be fit for use with caution. Tabulations involving one or two jobs, which are generally suppressed on fitness-for-use criteria in the QWI and synthesized in LODES, have substantial total variability but can still be used to estimate statistics for untabulated aggregates as long as the job count in the aggregate is more than 10.
View Full
Paper PDF
-
Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics
February 2016
Working Paper Number:
CES-16-10
We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).
View Full
Paper PDF
-
IMPROVING THE SYNTHETIC LONGITUDINAL BUSINESS DATABASE
February 2014
Working Paper Number:
CES-14-12
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments' confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version'now available for public use'of the U. S. Census Bureau's Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.
View Full
Paper PDF
-
Synthetic Data and Confidentiality Protection
September 2003
Working Paper Number:
tp-2003-10
View Full
Paper PDF
-
LOOKING BACK ON THREE YEARS OF USING THE SYNTHETIC LBD BETA
February 2014
Working Paper Number:
CES-14-11
Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau's Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.
View Full
Paper PDF
-
Firm Market Power and the Earnings Distribution
December 2011
Working Paper Number:
CES-11-41
Using the Longitudinal Employer Household Dynamics (LEHD) data from the United States Census Bureau, I compute firm-level measures of labor market (monopsony) power. To generate these measures, I extend the dynamic model proposed by Manning (2003) and estimate the labor supply elasticity facing each private non-farm firm in the US. While a link between monopsony power and earnings has traditionally been assumed, I provide the first direct evidence of the positive relationship between a firm\'s labor supply elasticity and the earnings of its workers. I also contrast the semistructural method with the more traditional use of concentration ratios to measure a firm\'s labor market power. In addition, I provide several alternative measures of labor market power which account for potential threats to identification such as endogenous mobility. Finally, I construct a counterfactual earnings distribution which allows the effects of firm market power to vary across the earnings distribution. I estimate the average firm\'s labor supply elasticity to be 1.08, however my findings suggest there to be significant variability in the distribution of firm market power across US firms, and that dynamic monopsony models are superior to the use of concentration ratios in evaluating a firm\'s labor market power. I find that a one-unit increase in the labor supply elasticity to the firm is associated with wage gains of between 5 and 18 percent. While nontrivial, these estimates imply that firms do not fully exercise their labor market power over their workers. Furthermore, I find that the negative earnings impact of a firm\'s market power is strongest in the lower half of the earnings distribution, and that a one standard deviation increase in firms\' labor supply elasticities reduces the variance of the earnings distribution by 9 percent.
View Full
Paper PDF
-
A FIRST STEP TOWARDS A GERMAN SYNLBD: CONSTRUCTING A GERMAN LONGITUDINAL BUSINESS DATABASE
February 2014
Working Paper Number:
CES-14-13
One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.
View Full
Paper PDF
-
New Approaches to Confidentiality Protection Synthetic Data, Remote Access and Research Data Centers
June 2004
Working Paper Number:
tp-2004-03
View Full
Paper PDF