This paper describes a novel database and an associated suicide event prediction model that surmount longstanding barriers in suicide risk factor research. The database comingles person-level records from the National Violent Death Reporting System (NVDRS) and the American Community Survey (ACS) to establish a case-control study sample that includes all identified suicide cases, while faithfully reflecting general population sociodemographics, in sixteen USA states during the years 2005 2011. It supports a statistical model of individual suicide risk that accommodates person-level factors and the moderation of these factors by their community rates. Named the United States Multi-Level Suicide Data Set (US-MSDS), the database was developed outside the RDC laboratory using publicly available ACS microdata, and reconstructed inside the laboratory using restricted access ACS microdata. Analyses of the latter version yielded findings that largely amplified but also extended those obtained from analyses of the former. This experience shows that the analytic precision achievable using restricted access ACS data can play an important role in conducting social research, although it also indicates that publicly available ACS data have considerable value in conducting preliminary analyses and preparing to use an RDC laboratory. The database development strategy may interest scientists investigating sociodemographic risk factors for other types of low-frequency mortality.
-
The Case of the Missing Ethnicity: Indians without Tribes in the 21st Century
June 2011
Working Paper Number:
CES-11-17
Among American Indians and Alaska Natives, most aspects of ethnicity are tightly associated with the person's tribal origins. Language, history, foods, land, and traditions differ among the hundreds of tribes indigenous to the United States. Why did almost one million of them fail to respond to the tribal affiliation part of the Census 2000 race question? We investigate four hypotheses about why one-third of multiracial American Indians and one-sixth of single-race American Indians did not report a tribe: (1) survey item non-response which undermines all fillin- the-blank questions, (2) a non-salient tribal identity, (3) a genealogy-based affiliation, and (4) mestizo identity which does not require a tribe. We use multivariate logistic regression models and high-density restricted-use Census 2000 data. We find support for the first two hypotheses and note that the predictors and results differ substantially for single race versus multiple race American Indians.
View Full
Paper PDF
-
A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census
August 2025
Authors:
Lars Vilhuber,
John M. Abowd,
Ethan Lewis,
Nathan Goldschlag,
Michael B. Hawes,
Robert Ashmead,
Daniel Kifer,
Philip Leclerc,
Rolando A. RodrÃguez,
Tamara Adams,
David Darais,
Sourya Dey,
Simson L. Garfinkel,
Scott Moore,
Ramy N. Tadros
Working Paper Number:
CES-25-57
For the last half-century, it has been a common and accepted practice for statistical agencies, including the United States Census Bureau, to adopt different strategies to protect the confidentiality of aggregate tabular data products from those used to protect the individual records contained in publicly released microdata products. This strategy was premised on the assumption that the aggregation used to generate tabular data products made the resulting statistics inherently less disclosive than the microdata from which they were tabulated. Consistent with this common assumption, the 2010 Census of Population and Housing in the U.S. used different disclosure limitation rules for its tabular and microdata publications. This paper demonstrates that, in the context of disclosure limitation for the 2010 Census, the assumption that tabular data are inherently less disclosive than their underlying microdata is fundamentally flawed. The 2010 Census published more than 150 billion aggregate statistics in 180 table sets. Most of these tables were published at the most detailed geographic level'individual census blocks, which can have populations as small as one person. Using only 34 of the published table sets, we reconstructed microdata records including five variables (census block, sex, age, race, and ethnicity) from the confidential 2010 Census person records. Using only published data, an attacker using our methods can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. We further confirm, through reidentification studies, that an attacker can, within census blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with race and ethnicity different from the modal person on the census block) with 95% accuracy. Having shown the vulnerabilities inherent to the disclosure limitation methods used for the 2010 Census, we proceed to demonstrate that the more robust disclosure limitation framework used for the 2020 Census publications defends against attacks that are based on reconstruction. Finally, we show that available alternatives to the 2020 Census Disclosure Avoidance System would either fail to protect confidentiality, or would overly degrade the statistics' utility for the primary statutory use case: redrawing the boundaries of all of the nation's legislative and voting districts in compliance with the 1965 Voting Rights Act.
View Full
Paper PDF
-
Shift or replenishment? Reassessing the prospect of stable Spanish bilingualism across contexts of ethnic change
June 2023
Working Paper Number:
CES-23-28
Much of the existing literature on Latinos' use of Spanish claims that a general pattern of intergenerational decline in the use of Spanish will produce an overall shift away from Spanish use in the U.S. (Rumbaut, Massey, and Bean 2006; Veltman 1983b, 1990). In contrast, recent works emphasize the importance of the social and linguistic context in reinforcing the use of Spanish as well as (pan)ethnic identities among U.S.-born Latinos (Linton 2004; Linton and Jim'nez 2009; Stevens 1992). This literature suggests conditions under which Spanish-English bilingualism might become stable at the level of metropolitan areas; however, such conditions depend on how immigration shapes the context of language use for native-born Latinos. Given the declining levels of immigration from Latin America, will bilingualism subside in the U.S., or have certain communities created conditions in which bilingualism can be stable? Using geocoded data from restricted access versions of the Survey of Income and Program Participation (SIPP) and the American Community Survey (ACS), we model the probability of Spanish-English bilingualism among second- and third-generation Latinos using multilevel models with contextual measures of immigration and language use at both the neighborhood and metropolitan levels. We find evidence that U.S.-born Latinos are heavily influenced by the prevalence of Spanish use among U.S. born Latinos at both the metropolitan and neighborhood levels. Further, the proportion of foreign-born Latinos has little effect on the native born, after controlling for Spanish use among U.S,-born Latinos. These results are a first step in understanding the link between ethnic or panethnic contexts and language practices, and also in producing a better characterization of stable bilingualism that can be tested quantitatively.
View Full
Paper PDF
-
A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census: Full Technical Report
December 2023
Authors:
Lars Vilhuber,
John M. Abowd,
Ethan Lewis,
Nathan Goldschlag,
Robert Ashmead,
Daniel Kifer,
Philip Leclerc,
Rolando A. RodrÃguez,
Tamara Adams,
David Darais,
Sourya Dey,
Simson L. Garfinkel,
Scott Moore,
Ramy N. Tadros
Working Paper Number:
CES-23-63R
For the last half-century, it has been a common and accepted practice for statistical agencies, including the United States Census Bureau, to adopt different strategies to protect the confidentiality of aggregate tabular data products from those used to protect the individual records contained in publicly released microdata products. This strategy was premised on the assumption that the aggregation used to generate tabular data products made the resulting statistics inherently less disclosive than the microdata from which they were tabulated. Consistent with this common assumption, the 2010 Census of Population and Housing in the U.S. used different disclosure limitation rules for its tabular and microdata publications. This paper demonstrates that, in the context of disclosure limitation for the 2010 Census, the assumption that tabular data are inherently less disclosive than their underlying microdata is fundamentally flawed. The 2010 Census published more than 150 billion aggregate statistics in 180 table sets. Most of these tables were published at the most detailed geographic level'individual census blocks, which can have populations as small as one person. Using only 34 of the published table sets, we reconstructed microdata records including five variables (census block, sex, age, race, and ethnicity) from the confidential 2010 Census person records. Using only published data, an attacker using our methods can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. We further confirm, through reidentification studies, that an attacker can, within census blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with race and ethnicity different from the modal person on the census block) with 95% accuracy. Having shown the vulnerabilities inherent to the disclosure limitation methods used for the 2010 Census, we proceed to demonstrate that the more robust disclosure limitation framework used for the 2020 Census publications defends against attacks that are based on reconstruction. Finally, we show that available alternatives to the 2020 Census Disclosure Avoidance System would either fail to protect confidentiality, or would overly degrade the statistics' utility for the primary statutory use case: redrawing the boundaries of all of the nation's legislative and voting districts in compliance with the 1965 Voting Rights Act. You are reading the full technical report. For the summary paper see https://doi.org/10.1162/99608f92.4a1ebf70.
View Full
Paper PDF
-
Mortality in a Multi-State Cohort of Former State Prisoners, 2010-2015
February 2022
Working Paper Number:
CES-22-06
Previous studies report that individuals who have been imprisoned have higher mortality rates than their demographic counterparts in the general population, particularly non-Hispanic white former prisoners. Most of these studies have been based on a single state's prison system, and the extent to which their findings can be generalized has not been established. In this study we explore the role that race/Hispanic origin, other demographic characteristics, and custodial/ criminal history factors have on post-release mortality, including on the timing of deaths. We also assess whether conditional release to community supervision or reimprisonment may explain the higher post-release mortality found among non-Hispanic whites. In the second part of the analysis, we estimate standardized mortality ratios (SMRs) by sex, age group, and race/Hispanic origin using as reference the U.S. general population. The data come from state prison releases from the Bureau of Justice Statistics' (BJS) National Corrections Reporting Program (NCRP). The NCRP records were linked to the Census Numident to identify deaths occurring within five years from prison release. We also linked NCRP records to previous decennial censuses and survey responses to obtain self-reported race and Hispanic origin if available. We found that non-Hispanic white former prisoners were more likely to die within five years after prison release and more likely to die in the initial weeks after release compared to racial minorities and Hispanics. Reimprisonment, age at release, and a history of multiple prison terms had a similar influence on the odds of dying across all race/Hispanic origin groups. Other factors, such as the type of release and the duration of the last term in prison, were associated with higher risks of mortality for some groups but not for others.
View Full
Paper PDF
-
Leaving Home: Modeling the Effect of Civic and Economic Structure on Individual Migration Patterns
June 2002
Working Paper Number:
CES-02-16
This research analyzes the effect of community structure upon individuals' probabilities of moving between 1985 and 1990. Using the full Census sample long form microdata for 1990, we re-allocate adult persons in 1990 to their 1985 county of residence. Then, using origin county macro-structural variables (derived from the Economic Census microdata) and individual characteristics (from Decennial Census microdata), we develop a two level hierarchical linear model. In level 1, we construct a logistic equation modeling individual probabilities of moving. In level 2, we model the contextual effects of origin community structure on these models. These contextual effects fall into two categories: 1) economic conditions that comprise the usual aggregate 'push' factors and 2) civic community factors that act to retain people in their community. Results specify the relationship between community context and individual migration patterns, and demonstrate effects of local economic structure and local civic structure on these individual probabilities. Most notably, we find that civic attributes of communities are associated with a propensity to stay in place, net of community economic factors and individual characteristics.
View Full
Paper PDF
-
An In-Depth Examination of Requirements for Disclosure Risk Assessment
October 2023
Authors:
Ron Jarmin,
John M. Abowd,
Ian M. Schmutte,
Jerome P. Reiter,
Nathan Goldschlag,
Victoria A. Velkoff,
Michael B. Hawes,
Robert Ashmead,
Ryan Cumings-Menon,
Sallie Ann Keller,
Daniel Kifer,
Philip Leclerc,
Rolando A. RodrÃguez,
Pavel Zhuravlev
Working Paper Number:
CES-23-49
The use of formal privacy to protect the confidentiality of responses in the 2020 Decennial Census of Population and Housing has triggered renewed interest and debate over how to measure the disclosure risks and societal benefits of the published data products. Following long-established precedent in economics and statistics, we argue that any proposal for quantifying disclosure risk should be based on pre-specified, objective criteria. Such criteria should be used to compare methodologies to identify those with the most desirable properties. We illustrate this approach, using simple desiderata, to evaluate the absolute disclosure risk framework, the counterfactual framework underlying differential privacy, and prior-to-posterior comparisons. We conclude that satisfying all the desiderata is impossible, but counterfactual comparisons satisfy the most while absolute disclosure risk satisfies the fewest. Furthermore, we explain that many of the criticisms levied against differential privacy would be levied against any technology that is not equivalent to direct, unrestricted access to confidential data. Thus, more research is needed, but in the near-term, the counterfactual approach appears best-suited for privacy-utility analysis.
View Full
Paper PDF
-
Dynamics of Race: Joining, Leaving, and Staying in the American Indian/Alaska Native Race Category between 2000 and 2010
August 2014
Working Paper Number:
carra-2014-10
Each census for decades has seen the American Indian and Alaska Native population increase substantially more than expected. Changes in racial reporting seem to play an important role in the observed net increases, though research has been hampered by data limitations. We address previously unanswerable questions about race response change among American Indian and Alaska Natives (hereafter 'American Indians') using uniquely-suited (but not nationally representative) linked data from the 2000 and 2010 decennial censuses (N = 3.1 million) and the 2006-2010 American Community Survey (N = 188,131). To what extent do people change responses to include or exclude American Indian? How are people who change responses similar to or different from those who do not? How are people who join a group similar to or different from those who leave it? We find considerable race response change by people in our data, especially by multiple-race and/or Hispanic American Indians. This turnover is hidden in cross-sectional comparisons because people joining the group are similar in number and characteristics to those who leave the group. People in our data who changed their race response to add or drop American Indian differ from those who kept the same race response in 2000 and 2010 and from those who moved between a single-race and multiple-race American Indian response. Those who consistently reported American Indian (including those who added or dropped another race response) were relatively likely to report a tribe, live in an American Indian area, report American Indian ancestry, and live in the West. There are significant differences between those who joined and those who left a specific American Indian response group, but poor model fit indicates general similarity between joiners and leavers. Response changes should be considered when conceptualizing and operationalizing 'the American Indian and Alaska Native population.'
View Full
Paper PDF
-
Using Small-Area Estimation (SAE) to Estimate Prevalence of Child Health Outcomes at the Census Regional-, State-, and County-Levels
November 2022
Working Paper Number:
CES-22-48
In this study, we implement small-area estimation to assess the prevalence of child health outcomes at the county, state, and regional levels, using national survey data.
View Full
Paper PDF
-
Gradient Boosting to Address Statistical Problems Arising from Non-Linkage of Census Bureau Datasets
June 2024
Working Paper Number:
CES-24-27
This article introduces the twangRDC package, which contains functions to address non-linkage in US Census Bureau datasets. The Census Bureau's Person Identification Validation System facilitates data linkage by assigning unique person identifiers to federal, third party, decennial census, and survey data. Not all records in these datasets can be linked to the reference file and as such not all records will be assigned an identifier. This article is a tutorial for using the twangRDC to generate nonresponse weights to account for non-linkage of person records across US Census Bureau datasets.
View Full
Paper PDF