CREAT - Census Bureau

Dynamically Consistent Noise Infusion and Partially Synthetic Data as Confidentiality Protection Measures for Related Time Series

July 2012

Written by: Lars Vilhuber, John M. Abowd, Kevin L. McKinney, Bryce Stephens, Simon Woodcock, Kaj Gittings

Working Paper Number:

CES-12-13

Abstract

The Census Bureau's Quarterly Workforce Indicators (QWI) provide detailed quarterly statistics on employment measures such as worker and job flows, tabulated by worker characteristics in various combinations. The data are released for several levels of NAICS industries and geography, the lowest aggregation of the latter being counties. Disclosure avoidance methods are required to protect the information about individuals and businesses that contribute to the underlying data. The QWI disclosure avoidance mechanism we describe here relies heavily on the use of noise infusion through a permanent multiplicative noise distortion factor, used for magnitudes, counts, differences and ratios. There is minimal suppression and no complementary suppressions. To our knowledge, the release in 2003 of the QWI was the first large-scale use of noise infusion in any official statistical product. We show that the released statistics are analytically valid along several critical dimensions { measures are unbiased and time series properties are preserved. We provide an analysis of the degree to which confidentiality is protected. Furthermore, we show how the judicious use of synthetic data, injected into the tabulation process, can completely eliminate suppressions, maintain analytical validity, and increase the protection of the underlying confidential data.

Document Tags and Keywords

Keywords:

analysis, economist, data, payroll, statistical, report, quarterly, statistical agencies, disclosure, confidentiality, estimator, earnings, employee, statistician, measure, workforce, labor statistics, employment count, measures employment, employment measures, census employment, workforce indicators, statistical disclosure

Tags:

Bureau of Labor Statistics, National Science Foundation, Center for Economic Studies, County Business Patterns, Employer Identification Numbers, National Research Council, Cornell University, Journal of Economic Literature, Research Data Center, North American Industry Classification System, Alfred P Sloan Foundation, Longitudinal Employer Household Dynamics, Cornell Institute for Social and Economic Research, LEHD Program, Quarterly Workforce Indicators, Quarterly Census of Employment and Wages, Census Bureau Disclosure Review Board, Commodity Flow Survey, Multiple Worksite Report

Similar Working Papers

The 10 most similar working papers to the working paper 'Dynamically Consistent Noise Infusion and Partially Synthetic Data as Confidentiality Protection Measures for Related Time Series' are listed below in order of similarity.

Working Paper
🔥

Confidentiality Protection in the Census Bureau Quarterly Workforce Indicators

February 2006

Authors: Lars Vilhuber, John M. Abowd, Bryce Stephens

Working Paper Number:

tp-2006-02

The QuarterlyWorkforce Indicators are new estimates developed by the Census Bureau's Longitudinal Employer-Household Dynamics Program as a part of its Local Employment Dynamics partnership with 37 state Labor Market Information offices. These data provide detailed quarterly statistics on employment, accessions, layoffs, hires, separations, full-quarter employment (and related flows), job creations, job destructions, and earnings (for flow and stock categories of workers). The data are released for NAICS industries (and 4-digit SICs) at the county, workforce investment board, and metropolitan area levels of geography. The confidential microdata - unemployment insurance wage records, ES-202 establishment employment, and Title 13 demographic and economic information - are protected using a permanent multiplicative noise distortion factor. This factor distorts all input sums, counts, differences and ratios. The released statistics are analytically valid - measures are unbiased and time series properties are preserved. The confidentiality protection is manifested in the release of some statistics that are flagged as "significantly distorted to preserve confidentiality." These statistics differ from the undistorted statistics by a significant proportion. Even for the significantly distorted statistics, the data remain analytically valid for time series properties. The released data can be aggregated; however, published aggregates are less distorted than custom postrelease aggregates. In addition to the multiplicative noise distortion, confidentiality protection is provided by the estimation process for the QWIs, which multiply imputes all missing data (including missing establishment, given UI account, in the UI wage record data) and dynamically re-weights the establishment data to provide state-level comparability with the BLS's Quarterly Census of Employment and Wages.
View Full Paper PDF
Working Paper
🔥

Total Error and Variability Measures with Integrated Disclosure Limitation for Quarterly Workforce Indicators and LEHD Origin Destination Employment Statistics in On The Map

January 2017

Authors: Lars Vilhuber, John M. Abowd, Kevin L. McKinney, Andrew S. Green

Working Paper Number:

CES-17-71

We report results from the rst comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total employment, beginning-of-quarter employment, full-quarter employment, total payroll, and average monthly earnings of full-quarter employees. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM). The evaluation is conducted by generating multiple threads of the edit and imputation models used in the LEHD Infrastructure File System. These threads conform to the Rubin (1987) multiple imputation model, with each thread or implicate being the output of formal probability models that address coverage, edit, and imputation errors. Design-based sampling variability and nite population corrections are also included in the evaluation. We derive special formulas for the Rubin total variability and its components that are consistent with the disclosure avoidance system used for QWI and LODES/OTM workplace reports. These formulas allow us to publish the complete set of detailed total quality measures for QWI and LODES. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs have quality in the range generally deemed acceptable. Tabulations involving zero, one or two jobs, which are generally suppressed in the QWI and synthesized in LODES, have substantial total variability but their publication in LODES allows the formation of larger custom aggregations, which will in general have the accuracy estimated for tabulations in the QWI based on a similar number of workers.
View Full Paper PDF
Working Paper
🔥

Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics

February 2016

Authors: Javier Miranda, Lars Vilhuber

Working Paper Number:

CES-16-10

We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).
View Full Paper PDF
Working Paper
🔥

Total Error and Variability Measures for the Quarterly Workforce Indicators and LEHD Origin Destination Employment Statistics in OnTheMap

September 2020

Authors: Lars Vilhuber, John M. Abowd, Kevin L. McKinney, Andrew S. Green

Working Paper Number:

CES-20-30

We report results from the first comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total flow-employment, beginning-of-quarter employment, full quarter employment, average monthly earnings of full-quarter employees, and total quarterly payroll. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in On-TheMap (OTM), including OnTheMap for Emergency Management. We account for errors due to coverage; record-level non response; edit and imputation of item missing data; and statistical disclosure limitation. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs are a transition zone, where cells may be fit for use with caution. Tabulations involving one or two jobs, which are generally suppressed on fitness-for-use criteria in the QWI and synthesized in LODES, have substantial total variability but can still be used to estimate statistics for untabulated aggregates as long as the job count in the aggregate is more than 10.
View Full Paper PDF
Working Paper
🔥

The LEHD Infrastructure Files and the Creation of the Quarterly Workforce Indicators

January 2006

Authors: Lars Vilhuber, John M. Abowd, Kevin L. McKinney, Bryce Stephens, Fredrik Andersson, Marc Roemer, Simon Woodcock

Working Paper Number:

tp-2006-01

The Longitudinal Employer-Household Dynamics (LEHD) Program at the U.S. Census Bureau, with the support of several national research agencies, has built a set of infrastructure files using administrative data provided by state agencies, enhanced with information from other administrative data sources, demographic and economic (business) surveys and censuses. The LEHD Infrastructure Files provide a detailed and comprehensive picture of workers, employers, and their interaction in the U.S. economy. Beginning in 2003 and building on this infrastructure, the Census Bureau has published the Quarterly Workforce Indicators (QWI), a new collection of data series that offers unprecedented detail on the local dynamics of labor markets. Despite the fine detail, confidentiality is maintained due to the application of state-of-the-art confidentiality protection methods. This article describes how the input files are compiled and combined to create the infrastructure files. We describe the multiple imputation methods used to impute in missing data and the statistical matching techniques used to combine and edit data when a direct identifier match requires improvement. Both of these innovations are crucial to the success of the final product. Finally, we pay special attention to the details of the confidentiality protection system used to protect the identity and micro data values of the underlying entities used to form the published estimates. We provide a brief description of public-use and restricted-access data files with pointers to further documentation for researchers interested in using these data.
View Full Paper PDF
Working Paper
🔥

The Creation of the Employment Dynamics Estimates

July 2002

Authors: Lars Vilhuber, John M. Abowd, Paul A. Lengermann

Working Paper Number:

tp-2002-13

View Full Paper PDF
Working Paper
🔥

Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database

February 2011

Authors: Arnold P Reznek, Ron Jarmin, Javier Miranda, John M. Abowd, Jerome P. Reiter, Satkartar K. Kinney

Working Paper Number:

CES-11-04

In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments\' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.
View Full Paper PDF
Working Paper
🔥

Disclosure Limitation and Confidentiality Protection in Linked Data

January 2018

Authors: Lars Vilhuber, John M. Abowd, Ian M. Schmutte

Working Paper Number:

CES-18-07

Confidentiality protection for linked administrative data is a combination of access modalities and statistical disclosure limitation. We review traditional statistical disclosure limitation methods and newer methods based on synthetic data, input noise infusion and formal privacy. We discuss how these methods are integrated with access modalities by providing three detailed examples. The first example is the linkages in the Health and Retirement Study to Social Security Administration data. The second example is the linkage of the Survey of Income and Program Participation to administrative data from the Internal Revenue Service and the Social Security Administration. The third example is the Longitudinal Employer-Household Dynamics data, which links state unemployment insurance records for workers and firms to a wide variety of censuses and surveys at the U.S. Census Bureau. For examples, we discuss access modalities, disclosure limitation methods, the effectiveness of those methods, and the resulting analytical validity. The final sections discuss recent advances in access modalities for linked administrative data.
View Full Paper PDF
Working Paper
🔥

Distribution Preserving Statistical Disclosure Limitation

September 2006

Authors: Simon Woodcock, Gary Benedetto

Working Paper Number:

tp-2006-04

One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with confidential data replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate inferences because the distribution of synthetic data is completely determined by the model used to generate them. We present two practical methods of generating synthetic values when the imputer has only limited information about the true data generating process. One is applicable when the true likelihood is known up to a monotone transformation. The second requires only limited knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and sampling error in the estimated transformation. We validate the approach with a simulation and application to a large linked employer-employee database.
View Full Paper PDF
Working Paper

The Sensitivity of Economic Statistics to Coding Errors in Personal Identifiers

October 2002

Authors: Lars Vilhuber, John M. Abowd

Working Paper Number:

tp-2002-17

In this paper, we describe the sensitivity of small-cell flow statistics to coding errors in the identity of the underlying entities. Specifically, we present results based on a comparison of the U.S. Census Bureau's Quarterly Workforce Indicators (QWI) before and after correcting for such errors in SSN-based identifiers in the underlying individual wage records. The correction used involves a novel application of existing statistical matching techniques. It is found that even a very conservative correction procedure has a sizable impact on the statistics. The average bias ranges from 0.25 percent up to 15 percent for flow statistics, and up to 5 percent for payroll aggregates.
View Full Paper PDF

Dynamically Consistent Noise Infusion and Partially Synthetic Data as Confidentiality Protection Measures for Related Time Series

July 2012

Working Paper Number:

CES-12-13

Abstract

Document Tags and Keywords

The 10 most similar working papers to the working paper 'Dynamically Consistent Noise Infusion and Partially Synthetic Data as Confidentiality Protection Measures for Related Time Series' are listed below in order of similarity.

February 2006

Working Paper Number:

tp-2006-02

January 2017

Working Paper Number:

CES-17-71

February 2016

Working Paper Number:

CES-16-10

September 2020

Working Paper Number:

CES-20-30

January 2006

Working Paper Number:

tp-2006-01

July 2002

Working Paper Number:

tp-2002-13

February 2011

Working Paper Number:

CES-11-04

January 2018

Working Paper Number:

CES-18-07

September 2006

Working Paper Number:

tp-2006-04

October 2002

Working Paper Number:

tp-2002-17