-
A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census: Full Technical Report
December 2023
Authors:
Lars Vilhuber,
John M. Abowd,
Ethan Lewis,
Nathan Goldschlag,
Robert Ashmead,
Daniel Kifer,
Philip Leclerc,
Rolando A. Rodríguez,
Tamara Adams,
David Darais,
Sourya Dey,
Simson L. Garfinkel,
Scott Moore,
Ramy N. Tadros
Working Paper Number:
CES-23-63R
For the last half-century, it has been a common and accepted practice for statistical agencies, including the United States Census Bureau, to adopt different strategies to protect the confidentiality of aggregate tabular data products from those used to protect the individual records contained in publicly released microdata products. This strategy was premised on the assumption that the aggregation used to generate tabular data products made the resulting statistics inherently less disclosive than the microdata from which they were tabulated. Consistent with this common assumption, the 2010 Census of Population and Housing in the U.S. used different disclosure limitation rules for its tabular and microdata publications. This paper demonstrates that, in the context of disclosure limitation for the 2010 Census, the assumption that tabular data are inherently less disclosive than their underlying microdata is fundamentally flawed. The 2010 Census published more than 150 billion aggregate statistics in 180 table sets. Most of these tables were published at the most detailed geographic level'individual census blocks, which can have populations as small as one person. Using only 34 of the published table sets, we reconstructed microdata records including five variables (census block, sex, age, race, and ethnicity) from the confidential 2010 Census person records. Using only published data, an attacker using our methods can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. We further confirm, through reidentification studies, that an attacker can, within census blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with race and ethnicity different from the modal person on the census block) with 95% accuracy. Having shown the vulnerabilities inherent to the disclosure limitation methods used for the 2010 Census, we proceed to demonstrate that the more robust disclosure limitation framework used for the 2020 Census publications defends against attacks that are based on reconstruction. Finally, we show that available alternatives to the 2020 Census Disclosure Avoidance System would either fail to protect confidentiality, or would overly degrade the statistics' utility for the primary statutory use case: redrawing the boundaries of all of the nation's legislative and voting districts in compliance with the 1965 Voting Rights Act. You are reading the full technical report. For the summary paper see https://doi.org/10.1162/99608f92.4a1ebf70.
View Full
Paper PDF
-
The 2010 Census Confidentiality Protections Failed, Here's How and Why
December 2023
Authors:
Lars Vilhuber,
John M. Abowd,
Ethan Lewis,
Nathan Goldschlag,
Robert Ashmead,
Daniel Kifer,
Philip Leclerc,
Rolando A. Rodríguez,
Tamara Adams,
David Darais,
Sourya Dey,
Simson L. Garfinkel,
Scott Moore,
Ramy N. Tadros
Working Paper Number:
CES-23-63
Using only 34 published tables, we reconstruct five variables (census block, sex, age, race, and ethnicity) in the confidential 2010 Census person records. Using the 38-bin age variable tabulated at the census block level, at most 20.1% of reconstructed records can differ from their confidential source on even a single value for these five variables. Using only published data, an attacker can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. The tabular publications in Summary File 1 thus have prohibited disclosure risk similar to the unreleased confidential microdata. Reidentification studies confirm that an attacker can, within blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with nonmodal characteristics) with 95% accuracy, the same precision as the confidential data achieve and far greater than statistical baselines. The flaw in the 2010 Census framework was the assumption that aggregation prevented accurate microdata reconstruction, justifying weaker disclosure limitation methods than were applied to 2010 Census public microdata. The framework used for 2020 Census publications defends against attacks that are based on reconstruction, as we also demonstrate here. Finally, we show that alternatives to the 2020 Census Disclosure Avoidance System with similar accuracy (enhanced swapping) also fail to protect confidentiality, and those that partially defend against reconstruction attacks (incomplete suppression implementations) destroy the primary statutory use case: data for redistricting all legislatures in the country in compliance with the 1965 Voting Rights Act.
View Full
Paper PDF
-
Coverage of Children in the American Community Survey Based on California Birth Records
September 2023
Working Paper Number:
CES-23-46
The U.S. Census Bureau's American Community Survey (ACS) collects information on individuals and households. The ACS provides survey-based estimates of children drawn from a sample of the U.S. population. However, survey responses may not match administrative records, such as birth records. Birth records should provide a complete account of all births, along with child-parent relationships and demographic characteristics. California is a state that has both a large population of children and a high undercount for young children. This paper uses California as a case study to examine differences between reported versus unreported children in the ACS based on state birth records. Child reporting rates were lower for more recent data years, younger children, for Black and Hispanic mothers, and for more complex households. Child reporting rates were higher for more educated mothers and for households above the poverty line. Using mother's race and Hispanic ethnicity from the birth records combined with poverty indices from the ACS, this analysis also finds that child reporting does not uniformly vary with poverty status across all race and ethnicity groups. This research builds support for the utility of state birth records in analyzing the undercount of children.
View Full
Paper PDF
-
Using Small-Area Estimation (SAE) to Estimate Prevalence of Child Health Outcomes at the Census Regional-, State-, and County-Levels
November 2022
Working Paper Number:
CES-22-48
In this study, we implement small-area estimation to assess the prevalence of child health outcomes at the county, state, and regional levels, using national survey data.
View Full
Paper PDF
-
The Impact of Household Surveys on 2020 Census Self-Response
July 2022
Working Paper Number:
CES-22-24
Households who were sampled in 2019 for the American Community Survey (ACS) had lower self-response rates to the 2020 Census. The magnitude varied from -1.5 percentage point for household sampled in January 2019 to -15.1 percent point for households sampled in December 2019. Similar effects are found for the Current Population Survey (CPS) as well.
View Full
Paper PDF
-
Finding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning
November 2021
Working Paper Number:
CES-21-35
This paper considers the problem of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across establishments is highly skewed. To address these difficulties, this paper develops a probabilistic record linkage methodology that combines machine learning (ML) with multiple imputation (MI). This ML-MI methodology is applied to link survey respondents in the Health and Retirement Study to their workplaces in the Census Business Register. The linked data reveal new evidence that non-sampling errors in household survey data are correlated with respondents' workplace characteristics.
View Full
Paper PDF
-
Small Business Pulse Survey Estimates by Owner Characteristics and Rural/Urban Designation
September 2021
Working Paper Number:
CES-21-24
In response to requests from policymakers for additional context for Small Business Pulse Survey (SBPS) measures of the impact of COVID-19 on small businesses, we researched developing estimates by owner characteristics and rural/urban locations. Leveraging geographic coding on the Business Register, we create estimates of the effect of the pandemic on small businesses by urban and rural designations. A more challenging exercise entails linking micro-level data from the SBPS with ownership data from the Annual Business Survey (ABS) to create estimates of the effect of the pandemic on small businesses by owner race, sex, ethnicity, and veteran status. Given important differences in survey design and concerns about nonresponse bias, we face significant challenges in producing estimates for owner demographics. We discuss our attempts to meet these challenges and provide discussion about caution that must be used in interpreting the results. The estimates produced for this paper are available for download. Reflecting the Census Bureau's commitment to scientific inquiry and transparency, the micro data from the SBPS will be available to qualified researchers on approved projects in the Federal Statistical Research Data Center network.
View Full
Paper PDF
-
Measuring the Impact of COVID-19 on Businesses and People: Lessons from the Census Bureau's Experience
January 2021
Working Paper Number:
CES-21-02
We provide an overview of Census Bureau activities to enhance the consistency, timeliness, and relevance of our data products in response to the COVID-19 pandemic. We highlight new data products designed to provide timely and granular information on the pandemic's impact: the Small Business Pulse Survey, weekly Business Formation Statistics, the Household Pulse Survey, and Community Resilience Estimates. We describe pandemic-related content introduced to existing surveys such as the Annual Business Survey and the Current Population Survey. We discuss adaptations to ensure the continuity and consistency of existing data products such as principal economic indicators and the American Community Survey.
View Full
Paper PDF
-
Addressing Data Gaps:
Four New Lines of Inquiry in the 2017 Economic Census
September 2019
Working Paper Number:
CES-19-28
We describe four new lines of inquiry added to the 2017 Economic Census regarding (i) retail health clinics, (ii) management practices in health care services, (iii) self-service in retail and service industries, and (iv) water use in manufacturing and mining industries. These were proposed by economists from the U.S. Census Bureau's Center for Economic Studies in order to fill data gaps in current Census Bureau products concerning the U.S. economy. The new content addresses such issues as the rise in importance of health care and its complexity, the adoption of automation technologies, and the importance of measuring water, a critical input to many manufacturing and mining industries.
View Full
Paper PDF
-
Disclosure Avoidance Techniques Used for the 1970 through 2010 Decennial Censuses of Population and Housing
November 2018
Working Paper Number:
CES-18-47
The U.S. Census Bureau conducts the decennial censuses under Title 13 of the U. S. Code with the Section 9 mandate to not 'use the information furnished under the provisions of this title for any purpose other than the statistical purposes for which it is supplied; or make any publication whereby the data furnished by any particular establishment or individual under this title can be identified; or permit anyone other than the sworn officers and employees of the Department or bureau or agency thereof to examine the individual reports (13 U.S.C. ' 9 (2007)).' The Census Bureau applies disclosure avoidance techniques to its publicly released statistical products in order to protect the confidentiality of its respondents and their data.
View Full
Paper PDF