CREAT - Census Bureau

Working Paper

Gradient Boosting to Address Statistical Problems Arising from Non-Linkage of Census Bureau Datasets

June 2024

Authors: Narayan Sastry, Todd Gardner, Matthew Cefalu, John Sullivan, Elizabeth Fussell

Working Paper Number:

CES-24-27

This article introduces the twangRDC package, which contains functions to address non-linkage in US Census Bureau datasets. The Census Bureau's Person Identification Validation System facilitates data linkage by assigning unique person identifiers to federal, third party, decennial census, and survey data. Not all records in these datasets can be linked to the reference file and as such not all records will be assigned an identifier. This article is a tutorial for using the twangRDC to generate nonresponse weights to account for non-linkage of person records across US Census Bureau datasets.
View Full Paper PDF
Working Paper

Where Are Your Parents? Exploring Potential Bias in Administrative Records on Children

March 2024

Authors: Jennifer Bernard, Kelsey Drotning, Katie R. Genadek

Working Paper Number:

CES-24-18

This paper examines potential bias in the Census Household Composition Key's (CHCK) probabilistic parent-child linkages. By linking CHCK data to the American Community Survey (ACS), we reveal disparities in parent-child linkages among specific demographic groups and find that characteristics of children that can and cannot be linked to the CHCK vary considerably from the larger population. In particular, we find that children from low-income, less educated households and of Hispanic origin are less likely to be linked to a mother or a father in the CHCK. We also highlight some data considerations when using the CHCK.
View Full Paper PDF
Working Paper

Finding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning

November 2021

Authors: Kristin McCue, John M. Abowd, Matthew D. Shapiro, Trivellore Raghunathan, Margaret C. Levenstein, Joelle Abramowitz, Dhiren Patki, Ann M. Rodgers, Nada Wasi, Dawn Zinsser

Working Paper Number:

CES-21-35

This paper considers the problem of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across establishments is highly skewed. To address these difficulties, this paper develops a probabilistic record linkage methodology that combines machine learning (ML) with multiple imputation (MI). This ML-MI methodology is applied to link survey respondents in the Health and Retirement Study to their workplaces in the Census Business Register. The linked data reveal new evidence that non-sampling errors in household survey data are correlated with respondents' workplace characteristics.
View Full Paper PDF
Working Paper

Optimal Probabilistic Record Linkage: Best Practice for Linking Employers in Survey and Administrative Data

March 2019

Authors: Kristin McCue, John M. Abowd, Matthew D. Shapiro, Trivellore Raghunathan, Margaret C. Levenstein, Joelle Abramowitz, Dhiren Patki, Ann M. Rodgers, Nada Wasi

Working Paper Number:

CES-19-08

This paper illustrates an application of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across firms is highly asymmetric. To address these difficulties, this paper uses a supervised machine learning model to probabilistically link survey respondents in the Health and Retirement Study (HRS) with employers and establishments in the Census Business Register (BR) to create a new data source which we call the CenHRS. Multiple imputation is used to propagate uncertainty from the linkage step into subsequent analyses of the linked data. The linked data reveal new evidence that survey respondents' misreporting and selective nonresponse about employer characteristics are systematically correlated with wages.
View Full Paper PDF
Working Paper

Squeezing More Out of Your Data: Business Record Linkage with Python

November 2018

Authors: Nathan Goldschlag, John Cuffe

Working Paper Number:

CES-18-46

Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.
View Full Paper PDF
Working Paper

File Matching with Faulty Continuous Matching Variables

January 2017

Authors: Gale Boyd, Jerome P. Reiter, Nicole M. Dalzell

Working Paper Number:

CES-17-45

We present LFCMV, a Bayesian file linking methodology designed to link records using continuous matching variables in situations where we do not expect values of these matching variables to agree exactly across matched pairs. The method involves a linking model for the distance between the matching variables of records in one file and the matching variables of their linked records in the second. This linking model is conditional on a vector indicating the links. We specify a mixture model for the distance component of the linking model, as this latent structure allows the distance between matching variables in linked pairs to vary across types of linked pairs. Finally, we specify a model for the linking vector. We describe the Gibbs sampling algorithm for sampling from the posterior distribution of this linkage model and use artificial data to illustrate model performance. We also introduce a linking application using public survey information and data from the U.S. Census of Manufactures and use LFCMV to link the records.
View Full Paper PDF
Working Paper

Playing with Matches: An Assessment of Accuracy in Linked Historical Data

June 2016

Authors: Catherine G. Massey

Working Paper Number:

carra-2016-05

This paper evaluates linkage quality achieved by various record linkage techniques used in historical demography. I create benchmark, or truth, data by linking the 2005 Current Population Survey Annual Social and Economic Supplement to the Social Security Administration's Numeric Identification System by Social Security Number. By comparing simulated linkages to the benchmark data, I examine the value added (in terms of number and quality of links) from incorporating text-string comparators, adjusting age, and using a probabilistic matching algorithm. I find that text-string comparators and probabilistic approaches are useful for increasing the linkage rate, but use of text-string comparators may decrease accuracy in some cases. Overall, probabilistic matching offers the best balance between linkage rates and accuracy.
View Full Paper PDF
Working Paper

Assessing Coverage and Quality of the 2007 Prototype Census Kidlink Database

September 2015

Authors: Adela Luque, Deborah Wagner

Working Paper Number:

carra-2015-07

The Census Bureau is conducting research to expand the use of administrative records data in censuses and surveys to decrease respondent burden and reduce costs while improving data quality. Much of this research (e.g., Rastogi and O''Hara (2012), Luque and Bhaskar (2014)) hinges on the ability to integrate multiple data sources by linking individuals across files. One of the Census Bureau's record linkage methodologies for data integration is the Person Identification Validation System or PVS. PVS assigns anonymous and unique IDs (Protected Identification Keys or PIKs) that serve as linkage keys across files. Prior research showed that integrating 'known associates' information into PVS's reference files could potentially enhance PVS's PIK assignment rates. The term 'known associates' refers to people that are likely to be associated with each other because of a known common link (such as family relationships or people sharing a common address), and thus, to be observed together in different files. One of the results from this prior research was the creation of the 2007 Census Kidlink file, a child-level file linking a child's Social Security Number (SSN) record to the SSN of those identified as the child's parents. In this paper, we examine to what extent the 2007 Census Kidlink methodology was able to link parents SSNs to children SSN records, and also evaluate the quality of those links. We find that in approximately 80 percent of cases, at least one parent was linked to the child's record. Younger children and noncitizens have a higher percentage of cases where neither parent could be linked to the child. Using 2007 tax data as a benchmark, our quality evaluation results indicate that in at least 90 percent of the cases, the parent-child link agreed with those found in the tax data. Based on our findings, we propose improvements to the 2007 Kidlink methodology to increase child-parent links, and discuss how the creation of the file could be operationalized moving forward.
View Full Paper PDF
Working Paper

Person Matching in Historical Files using the Census Bureau's Person Validation System

September 2014

Authors: Amy B. O'Hara, Catherine G. Massey, Amy OHara

Working Paper Number:

carra-2014-11

The recent release of the 1940 Census manuscripts enables the creation of longitudinal data spanning the whole of the twentieth century. Linked historical and contemporary data would allow unprecedented analyses of the causes and consequences of health, demographic, and economic change. The Census Bureau is uniquely equipped to provide high quality linkages of person records across datasets. This paper summarizes the linkage techniques employed by the Census Bureau and discusses utilization of these techniques to append protected identification keys to the 1940 Census.
View Full Paper PDF
Working Paper

Estimating Record Linkage False Match Rate for the Person Identification Validation System

July 2014

Authors: Deborah Wagner, Mary Layne, Cynthia Rothhaas

Working Paper Number:

carra-2014-02

The Census Bureau Person Identification Validation System (PVS) assigns unique person identifiers to federal, commercial, census, and survey data to facilitate linkages across files. PVS uses probabilistic matching to assign a unique Census Bureau identifier for each person. This paper presents a method to measure the false match rate in PVS following the approach of Belin and Rubin (1995). The Belin and Rubin methodology requires truth data to estimate a mixture model. The parameters from the mixture model are used to obtain point estimates of the false match rate for each of the PVS search modules. The truth data requirement is satisfied by the unique access the Census Bureau has to high quality name, date of birth, address and Social Security (SSN) data. Truth data are quickly created for the Belin and Rubin model and do not involve a clerical review process. These truth data are used to create estimates for the Belin and Rubin parameters, making the approach more feasible. Both observed and modeled false match rates are computed for all search modules in federal administrative records data and commercial data.
View Full Paper PDF

1 2 Next Total Results: 12

Papers Containing Keywords(s): 'linkage'

See Working Papers by Tag(s), Keywords(s), Author(s), or Search Text

Click here to search again

Frequently Occurring Concepts within this Search

No authors occur more than twice in this search.

Viewing papers 1 through 10 of 12

June 2024

Working Paper Number:

CES-24-27

March 2024

Working Paper Number:

CES-24-18

November 2021

Working Paper Number:

CES-21-35

March 2019

Working Paper Number:

CES-19-08

November 2018

Working Paper Number:

CES-18-46

January 2017

Working Paper Number:

CES-17-45

June 2016

Working Paper Number:

carra-2016-05

September 2015

Working Paper Number:

carra-2015-07

September 2014

Working Paper Number:

carra-2014-11

July 2014

Working Paper Number:

carra-2014-02