CREAT - Census Bureau

A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census

August 2025

Written by: Lars Vilhuber, John M. Abowd, Ethan Lewis, Nathan Goldschlag, Michael B. Hawes, Robert Ashmead, Daniel Kifer, Philip Leclerc, Rolando A. Rodríguez, Tamara Adams, David Darais, Sourya Dey, Simson L. Garfinkel, Scott Moore, Ramy N. Tadros

Working Paper Number:

CES-25-57

Abstract

For the last half-century, it has been a common and accepted practice for statistical agencies, including the United States Census Bureau, to adopt different strategies to protect the confidentiality of aggregate tabular data products from those used to protect the individual records contained in publicly released microdata products. This strategy was premised on the assumption that the aggregation used to generate tabular data products made the resulting statistics inherently less disclosive than the microdata from which they were tabulated. Consistent with this common assumption, the 2010 Census of Population and Housing in the U.S. used different disclosure limitation rules for its tabular and microdata publications. This paper demonstrates that, in the context of disclosure limitation for the 2010 Census, the assumption that tabular data are inherently less disclosive than their underlying microdata is fundamentally flawed. The 2010 Census published more than 150 billion aggregate statistics in 180 table sets. Most of these tables were published at the most detailed geographic level'individual census blocks, which can have populations as small as one person. Using only 34 of the published table sets, we reconstructed microdata records including five variables (census block, sex, age, race, and ethnicity) from the confidential 2010 Census person records. Using only published data, an attacker using our methods can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. We further confirm, through reidentification studies, that an attacker can, within census blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with race and ethnicity different from the modal person on the census block) with 95% accuracy. Having shown the vulnerabilities inherent to the disclosure limitation methods used for the 2010 Census, we proceed to demonstrate that the more robust disclosure limitation framework used for the 2020 Census publications defends against attacks that are based on reconstruction. Finally, we show that available alternatives to the 2020 Census Disclosure Avoidance System would either fail to protect confidentiality, or would overly degrade the statistics' utility for the primary statutory use case: redrawing the boundaries of all of the nation's legislative and voting districts in compliance with the 1965 Voting Rights Act.

Document Tags and Keywords

Keywords:

statistical, report, census data, survey, respondent, minority, ethnicity, hispanic, economic census, percentile, population, census bureau, decade, use census, assessed, census responses, census disclosure

Tags:

Center for Economic Studies, 1940 Census, Cornell University, Social Security Number, United States Census Bureau, Census Bureau Disclosure Review Board, 2010 Census, MAFID, Census Edited File, Some Other Race

Similar Working Papers

The 10 most similar working papers to the working paper 'A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census' are listed below in order of similarity.

Working Paper
🔥

A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census: Full Technical Report

December 2023

Authors: Lars Vilhuber, John M. Abowd, Ethan Lewis, Nathan Goldschlag, Robert Ashmead, Daniel Kifer, Philip Leclerc, Rolando A. Rodríguez, Tamara Adams, David Darais, Sourya Dey, Simson L. Garfinkel, Scott Moore, Ramy N. Tadros

Working Paper Number:

CES-23-63R

For the last half-century, it has been a common and accepted practice for statistical agencies, including the United States Census Bureau, to adopt different strategies to protect the confidentiality of aggregate tabular data products from those used to protect the individual records contained in publicly released microdata products. This strategy was premised on the assumption that the aggregation used to generate tabular data products made the resulting statistics inherently less disclosive than the microdata from which they were tabulated. Consistent with this common assumption, the 2010 Census of Population and Housing in the U.S. used different disclosure limitation rules for its tabular and microdata publications. This paper demonstrates that, in the context of disclosure limitation for the 2010 Census, the assumption that tabular data are inherently less disclosive than their underlying microdata is fundamentally flawed. The 2010 Census published more than 150 billion aggregate statistics in 180 table sets. Most of these tables were published at the most detailed geographic level'individual census blocks, which can have populations as small as one person. Using only 34 of the published table sets, we reconstructed microdata records including five variables (census block, sex, age, race, and ethnicity) from the confidential 2010 Census person records. Using only published data, an attacker using our methods can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. We further confirm, through reidentification studies, that an attacker can, within census blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with race and ethnicity different from the modal person on the census block) with 95% accuracy. Having shown the vulnerabilities inherent to the disclosure limitation methods used for the 2010 Census, we proceed to demonstrate that the more robust disclosure limitation framework used for the 2020 Census publications defends against attacks that are based on reconstruction. Finally, we show that available alternatives to the 2020 Census Disclosure Avoidance System would either fail to protect confidentiality, or would overly degrade the statistics' utility for the primary statutory use case: redrawing the boundaries of all of the nation's legislative and voting districts in compliance with the 1965 Voting Rights Act. You are reading the full technical report. For the summary paper see https://doi.org/10.1162/99608f92.4a1ebf70.
View Full Paper PDF
Working Paper
🔥

An In-Depth Examination of Requirements for Disclosure Risk Assessment

October 2023

Authors: Ron Jarmin, John M. Abowd, Ian M. Schmutte, Jerome P. Reiter, Nathan Goldschlag, Victoria A. Velkoff, Michael B. Hawes, Robert Ashmead, Ryan Cumings-Menon, Sallie Ann Keller, Daniel Kifer, Philip Leclerc, Rolando A. Rodríguez, Pavel Zhuravlev

Working Paper Number:

CES-23-49

The use of formal privacy to protect the confidentiality of responses in the 2020 Decennial Census of Population and Housing has triggered renewed interest and debate over how to measure the disclosure risks and societal benefits of the published data products. Following long-established precedent in economics and statistics, we argue that any proposal for quantifying disclosure risk should be based on pre-specified, objective criteria. Such criteria should be used to compare methodologies to identify those with the most desirable properties. We illustrate this approach, using simple desiderata, to evaluate the absolute disclosure risk framework, the counterfactual framework underlying differential privacy, and prior-to-posterior comparisons. We conclude that satisfying all the desiderata is impossible, but counterfactual comparisons satisfy the most while absolute disclosure risk satisfies the fewest. Furthermore, we explain that many of the criticisms levied against differential privacy would be levied against any technology that is not equivalent to direct, unrestricted access to confidential data. Thus, more research is needed, but in the near-term, the counterfactual approach appears best-suited for privacy-utility analysis.
View Full Paper PDF
Working Paper
🔥

Why the Economics Profession Must Actively Participate in the Privacy Protection Debate

March 2019

Authors: Lars Vilhuber, John M. Abowd, Ian M. Schmutte, William N. Sexton

Working Paper Number:

CES-19-09

When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research.
View Full Paper PDF
Working Paper

Disclosure Avoidance Techniques Used for the 1970 through 2010 Decennial Censuses of Population and Housing

November 2018

Authors: Laura McKenna

Working Paper Number:

CES-18-47

The U.S. Census Bureau conducts the decennial censuses under Title 13 of the U. S. Code with the Section 9 mandate to not 'use the information furnished under the provisions of this title for any purpose other than the statistical purposes for which it is supplied; or make any publication whereby the data furnished by any particular establishment or individual under this title can be identified; or permit anyone other than the sworn officers and employees of the Department or bureau or agency thereof to examine the individual reports (13 U.S.C. ' 9 (2007)).' The Census Bureau applies disclosure avoidance techniques to its publicly released statistical products in order to protect the confidentiality of its respondents and their data.
View Full Paper PDF
Working Paper

An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices

August 2018

Authors: John M. Abowd, Ian M. Schmutte

Working Paper Number:

CES-18-35

Statistical agencies face a dual mandate to publish accurate statistics while protecting respondent privacy. Increasing privacy protection requires decreased accuracy. Recognizing this as a resource allocation problem, we propose an economic solution: operate where the marginal cost of increasing privacy equals the marginal benefit. Our model of production, from computer science, assumes data are published using an efficient differentially private algorithm. Optimal choice weighs the demand for accurate statistics against the demand for privacy. Examples from U.S. statistical programs show how our framework can guide decision-making. Further progress requires a better understanding of willingness-to-pay for privacy and statistical accuracy.
View Full Paper PDF
Working Paper

Examining Multi-Level Correlates of Suicide by Merging NVDRS and ACS Data

January 2017

Authors: David A Boulifard, Bernice A Pescosolido

Working Paper Number:

CES-17-25

This paper describes a novel database and an associated suicide event prediction model that surmount longstanding barriers in suicide risk factor research. The database comingles person-level records from the National Violent Death Reporting System (NVDRS) and the American Community Survey (ACS) to establish a case-control study sample that includes all identified suicide cases, while faithfully reflecting general population sociodemographics, in sixteen USA states during the years 2005 2011. It supports a statistical model of individual suicide risk that accommodates person-level factors and the moderation of these factors by their community rates. Named the United States Multi-Level Suicide Data Set (US-MSDS), the database was developed outside the RDC laboratory using publicly available ACS microdata, and reconstructed inside the laboratory using restricted access ACS microdata. Analyses of the latter version yielded findings that largely amplified but also extended those obtained from analyses of the former. This experience shows that the analytic precision achievable using restricted access ACS data can play an important role in conducting social research, although it also indicates that publicly available ACS data have considerable value in conducting preliminary analyses and preparing to use an RDC laboratory. The database development strategy may interest scientists investigating sociodemographic risk factors for other types of low-frequency mortality.
View Full Paper PDF
Working Paper

Gradient Boosting to Address Statistical Problems Arising from Non-Linkage of Census Bureau Datasets

June 2024

Authors: Narayan Sastry, Todd Gardner, Matthew Cefalu, John Sullivan, Elizabeth Fussell

Working Paper Number:

CES-24-27

This article introduces the twangRDC package, which contains functions to address non-linkage in US Census Bureau datasets. The Census Bureau's Person Identification Validation System facilitates data linkage by assigning unique person identifiers to federal, third party, decennial census, and survey data. Not all records in these datasets can be linked to the reference file and as such not all records will be assigned an identifier. This article is a tutorial for using the twangRDC to generate nonresponse weights to account for non-linkage of person records across US Census Bureau datasets.
View Full Paper PDF
Working Paper

Access Methods for United States Microdata

August 2007

Authors: John M. Abowd, Daniel Weinberg, Sandra Rowland, Philip Steel, Laura Zayatz

Working Paper Number:

CES-07-25

Beyond the traditional methods of tabulations and public-use microdata samples, statistical agencies have developed four key alternatives for providing non-government researchers with access to confidential microdata to improve statistical modeling. The first, licensing, allows qualified researchers access to confidential microdata at their own facilities, provided certain security requirements are met. The second, statistical data enclaves, offer qualified researchers restricted access to confidential economic and demographic data at specific agency-controlled locations. Third, statistical agencies can offer remote access, through a computer interface, to the confidential data under automated or manual controls. Fourth, synthetic data developed from the original data but retaining the correlations in the original data have the potential for allowing a wide range of analyses.
View Full Paper PDF
Working Paper

Person Matching in Historical Files using the Census Bureau's Person Validation System

September 2014

Authors: Amy B. O'Hara, Catherine G. Massey, Amy OHara

Working Paper Number:

carra-2014-11

The recent release of the 1940 Census manuscripts enables the creation of longitudinal data spanning the whole of the twentieth century. Linked historical and contemporary data would allow unprecedented analyses of the causes and consequences of health, demographic, and economic change. The Census Bureau is uniquely equipped to provide high quality linkages of person records across datasets. This paper summarizes the linkage techniques employed by the Census Bureau and discusses utilization of these techniques to append protected identification keys to the 1940 Census.
View Full Paper PDF
Working Paper

Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods

January 2017

Authors: John M. Abowd, Ian M. Schmutte

Working Paper Number:

CES-17-37

We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently undersupplied by a private provider. Solving the appropriate social planner's problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial.
View Full Paper PDF

A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census

August 2025

Working Paper Number:

CES-25-57

Abstract

Document Tags and Keywords

The 10 most similar working papers to the working paper 'A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census' are listed below in order of similarity.

December 2023

Working Paper Number:

CES-23-63R

October 2023

Working Paper Number:

CES-23-49

March 2019

Working Paper Number:

CES-19-09

November 2018

Working Paper Number:

CES-18-47

August 2018

Working Paper Number:

CES-18-35

January 2017

Working Paper Number:

CES-17-25

June 2024

Working Paper Number:

CES-24-27

August 2007

Working Paper Number:

CES-07-25

September 2014

Working Paper Number:

carra-2014-11

January 2017

Working Paper Number:

CES-17-37