-
NOISE INFUSION AS A CONFIDENTIALITY PROTECTION MEASURE FOR GRAPH-BASED STATISTICS
September 2014
Working Paper Number:
CES-14-30
We use the bipartite graph representation of longitudinally linked em-ployer-employee data, and the associated projections onto the employer and em-ployee nodes, respectively, to characterize the set of potential statistical summar-ies that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightfor-ward extension of the dynamic noise-infusion method used in the U.S. Census Bureau's Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.
View Full
Paper PDF
-
IMPROVING THE SYNTHETIC LONGITUDINAL BUSINESS DATABASE
February 2014
Working Paper Number:
CES-14-12
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments' confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version'now available for public use'of the U. S. Census Bureau's Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.
View Full
Paper PDF
-
EXPANDING THE ROLE OF SYNTHETIC DATA AT THE U.S. CENSUS BUREAU
February 2014
Working Paper Number:
CES-14-10
National Statistical offices (NSOs) create official statistics from data collected from survey respondents, government administrative records and other sources. The raw source data is usually considered to be confidential. In the case of the U.S. Census Bureau, confidentiality of survey and administrative records microdata is mandated by statute, and this mandate to protect confidentiality is often at odds with the needs of users to extract as much information from the data as possible. Traditional disclosure protection techniques result in official data products that do not fully utilize the information content of the underlying microdata. Typically, these products take the form of simple aggregate tabulations. In a few cases anonymized public- use micro samples are made available, but these face a growing risk of re-identification by the increasing amounts of information about individuals and firms available in the public domain. One approach for overcoming these risks is to release products based on synthetic data where values are simulated from statistical models designed to mimic the (joint) distributions of the underlying microdata. We discuss re- cent Census Bureau work to develop and deploy such products. We discuss the benefits and challenges involved with extending the scope of synthetic data products in official statistics.
View Full
Paper PDF
-
Dynamically Consistent Noise Infusion and Partially Synthetic Data as Confidentiality Protection Measures for Related Time Series
July 2012
Working Paper Number:
CES-12-13
The Census Bureau's Quarterly Workforce Indicators (QWI) provide detailed quarterly statistics on employment measures such as worker and job flows, tabulated by worker characteristics in various combinations. The data are released for several levels of NAICS industries and geography, the lowest aggregation of the latter being counties. Disclosure avoidance methods are required to protect the information about individuals and businesses that contribute to the underlying data. The QWI disclosure avoidance mechanism we describe here relies heavily on the use of noise infusion through a permanent multiplicative noise distortion factor, used for magnitudes, counts, differences and ratios. There is minimal suppression and no complementary suppressions. To our knowledge, the release in 2003 of the QWI was the first large-scale use of noise infusion in any official statistical product. We show that the released statistics are analytically valid along several critical dimensions { measures are unbiased and time series properties are preserved. We provide an analysis of the degree to which confidentiality is protected. Furthermore, we show how the judicious use of synthetic data, injected into the tabulation process, can completely eliminate suppressions, maintain analytical validity, and increase the protection of the underlying confidential data.
View Full
Paper PDF
-
Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database
February 2011
Working Paper Number:
CES-11-04
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments\' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.
View Full
Paper PDF
-
Access Methods for United States Microdata
August 2007
Working Paper Number:
CES-07-25
Beyond the traditional methods of tabulations and public-use microdata samples, statistical agencies have developed four key alternatives for providing non-government researchers with access to confidential microdata to improve statistical modeling. The first, licensing, allows qualified researchers access to confidential microdata at their own facilities, provided certain security requirements are met. The second, statistical data enclaves, offer qualified researchers restricted access to confidential economic and demographic data at specific agency-controlled locations. Third, statistical agencies can offer remote access, through a computer interface, to the confidential data under automated or manual controls. Fourth, synthetic data developed from the original data but retaining the correlations in the original data have the potential for allowing a wide range of analyses.
View Full
Paper PDF
-
Distribution Preserving Statistical Disclosure Limitation
September 2006
Working Paper Number:
tp-2006-04
One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed,
partially synthetic data sets. These are data on actual respondents, but with confidential data
replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate
inferences because the distribution of synthetic data is completely determined by the model used
to generate them. We present two practical methods of generating synthetic values when the imputer
has only limited information about the true data generating process. One is applicable when
the true likelihood is known up to a monotone transformation. The second requires only limited
knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential
data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility
and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and
sampling error in the estimated transformation. We validate the approach with a simulation and
application to a large linked employer-employee database.
View Full
Paper PDF
-
New Approaches to Confidentiality Protection Synthetic Data, Remote Access and Research Data Centers
June 2004
Working Paper Number:
tp-2004-03
View Full
Paper PDF
-
The Importance of Establishment Data in Economic Research
August 1993
Working Paper Number:
CES-93-10
The importance and usefulness of establishment microdata for economic research and policy analysis is outlined and contrasted with traditional products of statistical agencies -- aggregate cross-section tabulations. It is argued that statistical agencies must begin to seriously rethink the way they view establishment data products.
View Full
Paper PDF
-
Analytic Use Of Economic Microdata; A Model For Researcher Access With Confidentiality Protection
August 1992
Working Paper Number:
CES-92-08
A primary responsibility of the Center for Economic Studies (CES) of the U.S. Bureau of the Census is to facilitate researcher access to confidential economic microdata files. Benefits from this program accrue not only to policy makers--there is a growing awareness of the importance of microdata for analyzing both the descriptive and welfare implications of regulatory and environmental changes--but also and importantly to the statistical agencies themselves. In fact, there is substantial recent literature arguing for the proposition that the largest single improvement that the U.S. statistical system could make is to improve its analytic capabilities. In this paper I briefly discuss these benefits to greater access for analytical work and ways to achieve them. Due to the nature of business data, public use databases and masking technologies are not available as vehicles for releasing useful microdata files. I conclude that a combination of outside and inside research programs, carefully coordinated and integrated is the best model for ensuring that statistical agencies reap the gains from analytic data users. For the United States, at least, this is fortuitous with respect to justifying access since any direct research with confidential data by outsiders must have a "statistical purpose". Until the advent of CES, it was virtually impossible for researchers to work with the economic microdata collected by the various economic censuses. While the CES program is quite large, as it now stands, researchers, or their representatives, must come to the Census Bureau in Washington, D.C. to access the data. The success of the program has led to increasing demands for data access in facilities outside of the Washington, D.C. area. Two options are considered: 1) Establish Census Bureau facilities in various universities or similar nonprofit research facilities and 2) Develop CES regional operations in existing Census Bureau regional offices.
View Full
Paper PDF