The U.S. Census Bureau's Person Identification Validation System facilitates anonymous linkages between survey and administrative records by assigning Protected Identification Keys (PIKs) to person records. While PIK assignment is generally accurate, some person records are not successfully assigned a PIK, which can lead to sample selection bias in analyses of linked data. Using the American Community Survey (ACS) and the Current Population Survey Annual Social and Economic Supplement (CPS ASEC) between 2005 and 2022, we corroborate and extend existing findings on the drivers of PIK assignment, showing that the rate of PIK assignment varies widely across socio-demographic subgroups. Using earnings as a test case, we then show that limiting a survey sample of wage earners to person records with PIKs or successful linkages to W-2 wage records tends to overestimate self-reported wage earnings, on average, indicative of linkage-induced selection bias. In a validation exercise, we demonstrate that reweighting methods, such as inverse probability weighting or entropy balancing, can mitigate this bias.
-
The Nature of the Bias When Studying Only Linkable Person Records: Evidence from the American Community Survey
April 2014
Working Paper Number:
carra-2014-08
Record linkage across survey and administrative records sources can greatly enrich data and improve their quality. The linkage can reduce respondent burden and nonresponse follow-up costs. This is particularly important in an era of declining survey response rates and tight budgets. Record linkage also creates statistical bias, however. The U.S. Census Bureau links person records through its Person Identification Validation System (PVS), assigning each record a Protected Identification Key (PIK). It is not possible to reliably assign a PIK to every record, either due to insufficient identifying information or because the information does not uniquely match any of the administrative records used in the person validation process. Non-random ability to assign a PIK can potentially inject bias into statistics using linked data. This paper studies the nature of this bias using the 2009 and 2010 American Community Survey (ACS). The ACS is well-suited for this analysis, as it contains a rich set of person characteristics that can describe the bias. We estimate probit models for whether a record is assigned a PIK. The results suggest that young children, minorities, residents of group quarters, immigrants, recent movers, low-income individuals, and non-employed individuals are less likely to receive a PIK using 2009 ACS. Changes to the PVS process in 2010 significantly addressed the young children deficit, attenuated the other biases, and increased the validated records share from 88.1 to 92.6 percent (person-weighted).
View Full
Paper PDF
-
The Design of Sampling Strata for the National Household Food Acquisition and Purchase Survey
February 2025
Working Paper Number:
CES-25-13
The National Household Food Acquisition and Purchase Survey (FoodAPS), sponsored by the United States Department of Agriculture's (USDA) Economic Research Service (ERS) and Food and Nutrition Service (FNS), examines the food purchasing behavior of various subgroups of the U.S. population. These subgroups include participants in the Supplemental Nutrition Assistance Program (SNAP) and the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC), as well as households who are eligible for but don't participate in these programs. Participants in these social protection programs constitute small proportions of the U.S. population; obtaining an adequate number of such participants in a survey would be challenging absent stratified sampling to target SNAP and WIC participating households. This document describes how the U.S. Census Bureau (which is planning to conduct future versions of the FoodAPS survey on behalf of USDA) created sampling strata to flag the FoodAPS targeted subpopulations using machine learning applications in linked survey and administrative data. We describe the data, modeling techniques, and how well the sampling flags target low-income households and households receiving WIC and SNAP benefits. We additionally situate these efforts in the nascent literature on the use of big data and machine learning for the improvement of survey efficiency.
View Full
Paper PDF
-
Differences in Disability Insurance Allowance Rates
August 2025
Working Paper Number:
CES-25-54
Allowance rates for disability insurance applications vary by race and ethnicity, but it is unclear to what extent these differences are artifacts of other differing socio-economic and health characteristics, or selection issues in SSA's race and ethnicity data. This paper uses the 2015 American Community Survey linked to 2015-2019 SSA administrative data to investigate DI application allowance rates among non-Hispanic White, non-Hispanic Black, non-Hispanic Asian, non-Hispanic American Indian/Alaska Native, and Hispanic applicants aged 25-65. The analysis uses regression, propensity score matching, and inverse probability weighting to estimate differences in allowance rates among applicants who are similar on observable characteristics. Relative to raw comparisons, differences by race and ethnicity in multivariate analyses are substantially smaller in magnitude and are generally not statistically significant.
View Full
Paper PDF
-
Foreign-Born and Native-Born Migration in the U.S.: Evidence from IRS Administrative and Census Survey Records
July 2018
Working Paper Number:
carra-2018-07
This paper details efforts to link administrative records from the Internal Revenue Service (IRS) to American Community Survey (ACS) and 2010 Census microdata for the study of migration among foreign-born and native-born populations in the United States. Specifically, we (1) document our linkage strategy and methodology for inferring migration in IRS records; (2) model selection into and survival across IRS records to determine suitability for research applications; and (3) gauge the efficacy of the IRS records by demonstrating how they can be used to validate and potentially improve migration responses for native-born and foreign-born respondents in ACS microdata. Our results show little evidence of selection or survival bias in the IRS records, suggesting broad generalizability to the nation as a whole. Moreover, we find that the combined IRS 1040, 1099, and W2 records may provide important information on populations, such as the foreign-born, that may be difficult to reach with traditional Census Bureau surveys. Finally, while preliminary, the results of our comparison of IRS and ACS migration responses shows that IRS records may be useful in improving ACS migration measurement for respondents whose migration response is proxy, allocated, or imputed. Taking these results together, we discuss the potential application of our longitudinal IRS dataset to innovations in migration research on both the native-born and foreign-born populations of the United States.
View Full
Paper PDF
-
The Icing on the Cake: The Effects of Monetary Incentives on Income Data Quality in the SIPP
January 2024
Working Paper Number:
CES-24-03
Accurate measurement of key income variables plays a crucial role in economic research and policy decision-making. However, the presence of item nonresponse and measurement error in survey data can cause biased estimates. These biases can subsequently lead to sub-optimal policy decisions and inefficient allocation of resources. While there have been various studies documenting item nonresponse and measurement error in economic data, there have not been many studies investigating interventions that could reduce item nonresponse and measurement error. In our research, we investigate the impact of monetary incentives on reducing item nonresponse and measurement error for labor and investment income in the Survey of Income and Program Participation (SIPP). Our study utilizes a randomized incentive experiment in Waves 1 and 2 of the 2014 SIPP, which allows us to assess the effectiveness of incentives in reducing item nonresponse and measurement error. We find that households receiving incentives had item nonresponse rates that are 1.3 percentage points lower for earnings and 1.5 percentage points lower for Social Security income. Measurement error was 6.31 percentage points lower at the intensive margin for interest income, and 16.48 percentage points lower for dividend income compared to non-incentive recipient households. These findings provide valuable insights for data producers and users and highlight the importance of implementing strategies to improve data quality in economic research.
View Full
Paper PDF
-
Gradient Boosting to Address Statistical Problems Arising from Non-Linkage of Census Bureau Datasets
June 2024
Working Paper Number:
CES-24-27
This article introduces the twangRDC package, which contains functions to address non-linkage in US Census Bureau datasets. The Census Bureau's Person Identification Validation System facilitates data linkage by assigning unique person identifiers to federal, third party, decennial census, and survey data. Not all records in these datasets can be linked to the reference file and as such not all records will be assigned an identifier. This article is a tutorial for using the twangRDC to generate nonresponse weights to account for non-linkage of person records across US Census Bureau datasets.
View Full
Paper PDF
-
Self-Employment Income Reporting on Surveys
April 2023
Working Paper Number:
CES-23-19
We examine the relation between administrative income data and survey reports for self-employed and wage-earning respondents from 2000 - 2015. The self-employed report 40 percent more wages and self-employment income in the survey than in tax administrative records; this estimate nets out differences between these two sources that are also shared by wage-earners. We provide evidence that differential reporting incentives are an important explanation of the larger self-employed gap by exploiting a well-known artifact ' self-employed respondents exhibit substantial bunching at the
first EITC kink in their administrative records. We do not observe the same behavior in their survey responses even after accounting for survey measurement concerns.
View Full
Paper PDF
-
The Opportunities and Challenges of Linked IRS Administrative and Census Survey Records in the Study of Migration
July 2018
Working Paper Number:
carra-2018-06
This paper details efforts to link administrative records from the Internal Revenue Service (IRS) to American Community Survey (ACS) and 2010 Census microdata for the study of migration in the United States. Specifically, we (1) document our linkage strategy and methodology for inferring migration in IRS records; (2) model selection into and survival across IRS records to determine suitability for research applications; and (3) gauge the efficacy of the IRS records by demonstrating how they can be used to validate and potentially improve migration responses in ACS microdata. Our results show little evidence of selection or survival bias in the IRS records, suggesting broad generalizability to the nation as a whole. Moreover, we find that the combined IRS 1040, 1099, and W2 records may provide important information on populations that are hard to reach with traditional Census surveys. Finally, while preliminary, the results of our comparison of IRS and ACS migration responses shows that IRS records may be useful in improving ACS migration measurement for respondents whose migration response is proxy, allocated, or imputed. Taking these results together, we discuss the potential applications of our longitudinal IRS dataset to innovations in migration research in the United States.
View Full
Paper PDF
-
Understanding the Quality of Alternative Citizenship Data Sources for the 2020 Census
August 2018
Working Paper Number:
CES-18-38R
This paper examines the quality of citizenship data in self-reported survey responses compared to administrative records and evaluates options for constructing an accurate count of resident U.S. citizens. Person-level discrepancies between survey-collected citizenship data and administrative records are more pervasive than previously reported in studies comparing survey and administrative data aggregates. Our results imply that survey-sourced citizenship data produce significantly lower estimates of the noncitizen share of the population than would be produced from currently available administrative records; both the survey-sourced and administrative data have shortcomings that could contribute to this difference. Our evidence is consistent with noncitizen respondents misreporting their own citizenship status and failing to report that of other household members. At the same time, currently available administrative records may miss some naturalizations and capture others with a delay. The evidence in this paper also suggests that adding a citizenship question to the 2020 Census would lead to lower self-response rates in households potentially containing noncitizens, resulting in higher fieldwork costs and a lower-quality population count.
View Full
Paper PDF
-
Potential Bias When Using Administrative Data to Measure the Family Income of School-Aged Children
January 2025
Working Paper Number:
CES-25-03
Researchers and practitioners increasingly rely on administrative data sources to measure family income. However, administrative data sources are often incomplete in their coverage of the population, giving rise to potential bias in family income measures, particularly if coverage deficiencies are not well understood. We focus on the school-aged child population, due to its particular import to research and policy, and because of the unique challenges of linking children to family income information. We find that two of the most significant administrative sources of family income information that permit linking of children and parents'IRS Form 1040 and SNAP participation records'usefully complement each other, potentially reducing coverage bias when used together. In a case study considering how best to measure economic disadvantage rates in the public school student population, we demonstrate the sensitivity of family income statistics to assumptions about individuals who do not appear in administrative data sources.
View Full
Paper PDF