Statistical agencies frequently publish microdata that have been altered to protect confidentiality. Such data retain utility for many types of broad analyses but can yield biased or Insufficiently precise results in others. Research access to de-identified versions of the restricted-use data with little or no alteration is often possible, albeit costly and time-consuming. We investigate the the advantages and disadvantages of public-use and restricted-use data from the American Community
Survey (ACS) in constructing a wage index. The public-use data used were Public Use Microdata Samples, while the restricted-use data were accessed via a Federal Statistical Research Data Center. We discuss the advantages and disadvantages of each data source and compare estimated CWIs and standard errors at the state and labor market levels.
-
SYNTHETIC DATA FOR SMALL AREA ESTIMATION IN THE AMERICAN COMMUNITY SURVEY
April 2013
Working Paper Number:
CES-13-19
Small area estimates provide a critical source of information used to study local populations. Statistical agencies regularly collect data from small areas but are prevented from releasing detailed geographical identifiers in public-use data sets due to disclosure concerns. Alternative data dissemination methods used in practice include releasing summary/aggregate tables, suppressing detailed geographic information in public-use data sets, and accessing restricted data via Research Data Centers. This research examines an alternative method for disseminating microdata that contains more geographical details than are currently being released in public-use data files. Specifically, the method replaces the observed survey values with imputed, or synthetic, values simulated from a hierarchical Bayesian model. Confidentiality protection is enhanced because no actual values are released. The method is demonstrated using restricted data from the 2005-2009 American Community Survey. The analytic validity of the synthetic data is assessed by comparing small area estimates obtained from the synthetic data with those obtained from the observed data.
View Full
Paper PDF
-
LOOKING BACK ON THREE YEARS OF USING THE SYNTHETIC LBD BETA
February 2014
Working Paper Number:
CES-14-11
Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau's Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.
View Full
Paper PDF
-
Using Small-Area Estimation (SAE) to Estimate Prevalence of Child Health Outcomes at the Census Regional-, State-, and County-Levels
November 2022
Working Paper Number:
CES-22-48
In this study, we implement small-area estimation to assess the prevalence of child health outcomes at the county, state, and regional levels, using national survey data.
View Full
Paper PDF
-
Total Error and Variability Measures with Integrated Disclosure Limitation for Quarterly Workforce Indicators and LEHD Origin Destination Employment Statistics in On The Map
January 2017
Working Paper Number:
CES-17-71
We report results from the rst comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total employment, beginning-of-quarter employment, full-quarter employment, total payroll, and average monthly earnings of full-quarter employees. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM). The evaluation is conducted by generating multiple threads of the edit and imputation models used in the LEHD Infrastructure File System. These threads conform to the Rubin (1987) multiple imputation model, with each thread or implicate being the output of formal probability models that address coverage, edit, and imputation errors. Design-based sampling variability and nite population corrections are also included in the evaluation. We derive special formulas for the Rubin total variability and its components that are consistent with the disclosure avoidance system used for QWI and LODES/OTM workplace reports. These formulas allow us to publish the complete set of detailed total quality measures for QWI and LODES. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs have quality in the range generally deemed acceptable. Tabulations involving zero, one or two jobs, which are generally suppressed in the QWI and synthesized in LODES, have substantial total variability but their publication in LODES allows the formation of larger custom aggregations, which will in general have the accuracy estimated for tabulations in the QWI based on a similar number of workers.
View Full
Paper PDF
-
Distribution Preserving Statistical Disclosure Limitation
September 2006
Working Paper Number:
tp-2006-04
One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed,
partially synthetic data sets. These are data on actual respondents, but with confidential data
replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate
inferences because the distribution of synthetic data is completely determined by the model used
to generate them. We present two practical methods of generating synthetic values when the imputer
has only limited information about the true data generating process. One is applicable when
the true likelihood is known up to a monotone transformation. The second requires only limited
knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential
data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility
and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and
sampling error in the estimated transformation. We validate the approach with a simulation and
application to a large linked employer-employee database.
View Full
Paper PDF
-
Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database
February 2011
Working Paper Number:
CES-11-04
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments\' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.
View Full
Paper PDF
-
School Equalization in the Shadow of Jim Crow: Causes and Consequences of Resource Disparity in Mississippi circa 1940
May 2024
Working Paper Number:
CES-24-25
A school finance equalization program established in Mississippi in 1920 failed to help many of the state's Black students'an outcome that was typical in the segregated U.S. South (Horace Mann Bond, 1934). In majority-Black school districts, local decision-makers overwhelmingly favored white schools when allotting funds from the state's preexisting per capita fund, and the resulting high expenditures on white students rendered these districts ineligible for the equalization program. Thus, while Black students residing in majority-white districts benefitted from increased spending and standards for Black schools, those in majority-Black districts continued to experience extremely low'and even worsening'school funding. We model the processes that led the so-called equalization policy to create disparities in schooling resources for Black students, and estimate effects on Black children using both a neighboring-counties design and an IV strategy. We find that local educational spending had large impacts on Black enrollment rates, as reported in the 1940 census, with Black educational attainment increasing in marginal spending. Finally, we link the 1940 and 2000 censuses to show that Black children exposed to higher levels of school expenditures had significantly more completed schooling and higher income late in life.
View Full
Paper PDF
-
County-Level Estimates of the Employment Prospects of Low-Skill Workers
July 2000
Working Paper Number:
CES-00-11
This study examines low-skill wage and employment opportunities for men and women at the county level over the period 1989-96. Currently, reliable direct measures of wages and employment rates for different demographic and skill groups are only available for large geographic areas such as regions and populous states or at infrequent intervals (e.g., from the Decennial Census) for some smaller areas. This study constructs indirect annual measures for all counties from 1989-96 by combining skill-specific information on earnings and employment from the Sample Edited Detail File (SEDF) of the 1990 Decennial Census and the 1990-97 Annual Demographic files of the Current Population Survey (CPS) with annual industry-specific information from the Regional Economic Information System (REIS). Special versions of the SEDF and CPS files that identify county of residence are used. The study regresses the low-skill wage and employment data from the SEDF and CPS files on a set of personal variables from the combined files and local employment measures derived from the REIS. The wage regressions are corrected for selectivity from the employment decision and account for county-specific effects as well as general time effects. Estimates from the regressions are then combined with the available employment data from the REIS to impute wage and employment rates for low-skill adults across counties.
View Full
Paper PDF
-
Design Comparison of LODES and ACS Commuting Data Products
October 2014
Working Paper Number:
CES-14-38
The Census Bureau produces two complementary data products, the American Community Survey (ACS) commuting and workplace data and the Longitudinal Employer-Household Dynamics (LEHD) Origin-Destination Employment Statistics (LODES), which can be used to answer questions about spatial, economic, and demographic questions relating to workplaces and home-to-work flows. The products are complementary in the sense that they measure similar activities but each has important unique characteristics that provide information that the other measure cannot. As a result of questions from data users, the Census Bureau has created this document to highlight the major design differences between these two data products. This report guides users on the relative advantages of each data product for various analyses and helps explain differences that may arise when using the products.2,3
As an overview, these two data products are sourced from different inputs, cover different populations and time periods, are subject to different sets of edits and imputations, are released under different confidentiality protection mechanisms, and are tabulated at different geographic and characteristic levels. As a general rule, the two data products should not be expected to match exactly for arbitrary queries and may differ substantially for some queries.
Within this document, we compare the two data products by the design elements that were deemed most likely to contribute to differences in tabulated data. These elements are: Collection, Coverage, Geographic and Longitudinal Scope, Job Definition and Reference Period, Job and Worker Characteristics, Location Definitions (Workplace and Residence), Completeness of Geographic Information and Edits/Imputations, Geographic Tabulation Levels, Control Totals, Confidentiality Protection and Suppression, and Related
Public-Use Data Products.
An in-depth data analysis'in aggregate or with the microdata'between the two data products will be the subject of a future technical report. The Census Bureau has begun a pilot project to integrate ACS microdata with LEHD administrative data to develop an enhanced frame of employment status, place of work, and commuting. The Census Bureau will publish quality metrics for person match rates, residence and workplace match rates, and commute distance comparisons.
View Full
Paper PDF
-
LEHD Data Documentation LEHD-OVERVIEW-S2008-rev1
December 2011
Working Paper Number:
CES-11-43
View Full
Paper PDF