CREAT - Census Bureau

Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data

June 2004

Written by: John M. Abowd, Simon Woodcock

Working Paper Number:

tp-2004-04

Abstract

This paper describes ongoing research to protect confidentiality in longitudinal linked data through creation of multiply-imputed, partially synthetic data. We present two enhancements to the methods of [2]. The first is designed to preserve marginal distributions in the partially synthetic data. The second is designed to protect confidential links between sampling frames.

Document Tags and Keywords

Keywords:

Tags:

National Bureau of Economic Research, Longitudinal Employer Household Dynamics

Similar Working Papers

The 10 most similar working papers to the working paper 'Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data' are listed below in order of similarity.

Working Paper

Distribution Preserving Statistical Disclosure Limitation

September 2006

Authors: Simon Woodcock, Gary Benedetto

Working Paper Number:

tp-2006-04

One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with confidential data replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate inferences because the distribution of synthetic data is completely determined by the model used to generate them. We present two practical methods of generating synthetic values when the imputer has only limited information about the true data generating process. One is applicable when the true likelihood is known up to a monotone transformation. The second requires only limited knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and sampling error in the estimated transformation. We validate the approach with a simulation and application to a large linked employer-employee database.
View Full Paper PDF
Working Paper

Synthetic Data and Confidentiality Protection

September 2003

Authors: Julia I. Lane, John M. Abowd

Working Paper Number:

tp-2003-10

View Full Paper PDF
Working Paper

Total Error and Variability Measures with Integrated Disclosure Limitation for Quarterly Workforce Indicators and LEHD Origin Destination Employment Statistics in On The Map

January 2017

Authors: Lars Vilhuber, John M. Abowd, Kevin L. McKinney, Andrew S. Green

Working Paper Number:

CES-17-71

We report results from the rst comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total employment, beginning-of-quarter employment, full-quarter employment, total payroll, and average monthly earnings of full-quarter employees. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM). The evaluation is conducted by generating multiple threads of the edit and imputation models used in the LEHD Infrastructure File System. These threads conform to the Rubin (1987) multiple imputation model, with each thread or implicate being the output of formal probability models that address coverage, edit, and imputation errors. Design-based sampling variability and nite population corrections are also included in the evaluation. We derive special formulas for the Rubin total variability and its components that are consistent with the disclosure avoidance system used for QWI and LODES/OTM workplace reports. These formulas allow us to publish the complete set of detailed total quality measures for QWI and LODES. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs have quality in the range generally deemed acceptable. Tabulations involving zero, one or two jobs, which are generally suppressed in the QWI and synthesized in LODES, have substantial total variability but their publication in LODES allows the formation of larger custom aggregations, which will in general have the accuracy estimated for tabulations in the QWI based on a similar number of workers.
View Full Paper PDF
Working Paper

LOOKING BACK ON THREE YEARS OF USING THE SYNTHETIC LBD BETA

February 2014

Authors: Lars Vilhuber, Miranda, Javier

Working Paper Number:

CES-14-11

Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau's Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.
View Full Paper PDF
Working Paper

Total Error and Variability Measures for the Quarterly Workforce Indicators and LEHD Origin Destination Employment Statistics in OnTheMap

September 2020

Authors: Lars Vilhuber, John M. Abowd, Kevin L. McKinney, Andrew S. Green

Working Paper Number:

CES-20-30

We report results from the first comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total flow-employment, beginning-of-quarter employment, full quarter employment, average monthly earnings of full-quarter employees, and total quarterly payroll. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in On-TheMap (OTM), including OnTheMap for Emergency Management. We account for errors due to coverage; record-level non response; edit and imputation of item missing data; and statistical disclosure limitation. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs are a transition zone, where cells may be fit for use with caution. Tabulations involving one or two jobs, which are generally suppressed on fitness-for-use criteria in the QWI and synthesized in LODES, have substantial total variability but can still be used to estimate statistics for untabulated aggregates as long as the job count in the aggregate is more than 10.
View Full Paper PDF
Working Paper

New Approaches to Confidentiality Protection Synthetic Data, Remote Access and Research Data Centers

June 2004

Authors: Julia I. Lane, John M. Abowd

Working Paper Number:

tp-2004-03

View Full Paper PDF
Working Paper

Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics

February 2016

Authors: Javier Miranda, Lars Vilhuber

Working Paper Number:

CES-16-10

We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).
View Full Paper PDF
Working Paper

IMPROVING THE SYNTHETIC LONGITUDINAL BUSINESS DATABASE

February 2014

Authors: Javier Miranda, Jerome P. Reiter, Satkartar K. Kinney

Working Paper Number:

CES-14-12

In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments' confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version'now available for public use'of the U. S. Census Bureau's Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.
View Full Paper PDF
Working Paper

Firm Market Power and the Earnings Distribution

December 2011

Authors: Douglas Webber

Working Paper Number:

CES-11-41

Using the Longitudinal Employer Household Dynamics (LEHD) data from the United States Census Bureau, I compute firm-level measures of labor market (monopsony) power. To generate these measures, I extend the dynamic model proposed by Manning (2003) and estimate the labor supply elasticity facing each private non-farm firm in the US. While a link between monopsony power and earnings has traditionally been assumed, I provide the first direct evidence of the positive relationship between a firm\'s labor supply elasticity and the earnings of its workers. I also contrast the semistructural method with the more traditional use of concentration ratios to measure a firm\'s labor market power. In addition, I provide several alternative measures of labor market power which account for potential threats to identification such as endogenous mobility. Finally, I construct a counterfactual earnings distribution which allows the effects of firm market power to vary across the earnings distribution. I estimate the average firm\'s labor supply elasticity to be 1.08, however my findings suggest there to be significant variability in the distribution of firm market power across US firms, and that dynamic monopsony models are superior to the use of concentration ratios in evaluating a firm\'s labor market power. I find that a one-unit increase in the labor supply elasticity to the firm is associated with wage gains of between 5 and 18 percent. While nontrivial, these estimates imply that firms do not fully exercise their labor market power over their workers. Furthermore, I find that the negative earnings impact of a firm\'s market power is strongest in the lower half of the earnings distribution, and that a one standard deviation increase in firms\' labor supply elasticities reduces the variance of the earnings distribution by 9 percent.
View Full Paper PDF
Working Paper

A Comparison of Training Modules for Administrative Records Use in Nonresponse Followup Operations: The 2010 Census and the American Community Survey

January 2017

Authors: Mark J. Kutzbach, Hubert P. Janicki, Melissa Chow, Lawrence Warren, Moises Yi

Working Paper Number:

CES-17-47

While modeling work in preparation for the 2020 Census has shown that administrative records can be predictive of Nonresponse Followup (NRFU) enumeration outcomes, there is scope to examine the robustness of the models by using more recent training data. The models deployed for workload removal from the 2015 and 2016 Census Tests were based on associations of the 2010 Census with administrative records. Training the same models with more recent data from the American Community Survey (ACS) can identify any changes in parameter associations over time that might reduce the accuracy of model predictions. Furthermore, more recent training data would allow for the incorporation of new administrative record sources not available in 2010. However, differences in ACS methodology and the smaller sample size may limit its applicability. This paper replicates earlier results and examines model predictions based on the ACS in comparison with NRFU outcomes. The evaluation consists of a comparison of predicted counts and household compositions with actual 2015 NRFU outcomes. The main findings are an overall validation of the methodology using independent data.
View Full Paper PDF

Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data

June 2004

Working Paper Number:

tp-2004-04

Abstract

Document Tags and Keywords

The 10 most similar working papers to the working paper 'Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data' are listed below in order of similarity.

September 2006

Working Paper Number:

tp-2006-04

September 2003

Working Paper Number:

tp-2003-10

January 2017

Working Paper Number:

CES-17-71

February 2014

Working Paper Number:

CES-14-11

September 2020

Working Paper Number:

CES-20-30

June 2004

Working Paper Number:

tp-2004-03

February 2016

Working Paper Number:

CES-16-10

February 2014

Working Paper Number:

CES-14-12

December 2011

Working Paper Number:

CES-11-41

January 2017

Working Paper Number:

CES-17-47