CREAT - Census Bureau

Optimal Probabilistic Record Linkage: Best Practice for Linking Employers in Survey and Administrative Data

March 2019

Written by: Kristin McCue, John M. Abowd, Matthew D. Shapiro, Trivellore Raghunathan, Margaret C. Levenstein, Joelle Abramowitz, Dhiren Patki, Ann M. Rodgers, Nada Wasi

Working Paper Number:

CES-19-08

Abstract

This paper illustrates an application of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across firms is highly asymmetric. To address these difficulties, this paper uses a supervised machine learning model to probabilistically link survey respondents in the Health and Retirement Study (HRS) with employers and establishments in the Census Business Register (BR) to create a new data source which we call the CenHRS. Multiple imputation is used to propagate uncertainty from the linkage step into subsequent analyses of the linked data. The linked data reveal new evidence that survey respondents' misreporting and selective nonresponse about employer characteristics are systematically correlated with wages.

Document Tags and Keywords

Keywords:

census research, survey data, survey, respondent, linked census, imputation, census employment, linkage, employed census

Tags:

Social Security Administration, Longitudinal Research Database, National Science Foundation, National Bureau of Economic Research, Longitudinal Business Database, Employer Identification Numbers, Survey of Income and Program Participation, Cornell University, Unemployment Insurance, Health and Retirement Study, Alfred P Sloan Foundation, Longitudinal Employer Household Dynamics, Census Bureau Business Register, Business Register, Quarterly Workforce Indicators, Quarterly Census of Employment and Wages, University of Michigan, Disclosure Review Board

Similar Working Papers

The 10 most similar working papers to the working paper 'Optimal Probabilistic Record Linkage: Best Practice for Linking Employers in Survey and Administrative Data' are listed below in order of similarity.

Working Paper
🔥

Finding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning

November 2021

Authors: Kristin McCue, John M. Abowd, Matthew D. Shapiro, Trivellore Raghunathan, Margaret C. Levenstein, Joelle Abramowitz, Dhiren Patki, Ann M. Rodgers, Nada Wasi, Dawn Zinsser

Working Paper Number:

CES-21-35

This paper considers the problem of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across establishments is highly skewed. To address these difficulties, this paper develops a probabilistic record linkage methodology that combines machine learning (ML) with multiple imputation (MI). This ML-MI methodology is applied to link survey respondents in the Health and Retirement Study to their workplaces in the Census Business Register. The linked data reveal new evidence that non-sampling errors in household survey data are correlated with respondents' workplace characteristics.
View Full Paper PDF
Working Paper

Two Perspectives on Commuting: A Comparison of Home to Work Flows Across Job-Linked Survey and Administrative Files

January 2017

Authors: Lars Vilhuber, Mark J. Kutzbach, Andrew S. Green

Working Paper Number:

CES-17-34

Commuting flows and workplace employment data have a wide constituency of users including urban and regional planners, social science and transportation researchers, and businesses. The U.S. Census Bureau releases two, national data products that give the magnitude and characteristics of home to work flows. The American Community Survey (ACS) tabulates households' responses on employment, workplace, and commuting behavior. The Longitudinal Employer-Household Dynamics (LEHD) program tabulates administrative records on jobs in the LEHD Origin-Destination Employment Statistics (LODES). Design differences across the datasets lead to divergence in a comparable statistic: county-to-county aggregate commute flows. To understand differences in the public use data, this study compares ACS and LEHD source files, using identifying information and probabilistic matching to join person and job records. In our assessment, we compare commuting statistics for job frames linked on person, employment status, employer, and workplace and we identify person and job characteristics as well as design features of the data frames that explain aggregate differences. We find a lower rate of within-county commuting and farther commutes in LODES. We attribute these greater distances to differences in workplace reporting and to uncertainty of establishment assignments in LEHD for workers at multi-unit employers. Minor contributing factors include differences in residence location and ACS workplace edits. The results of this analysis and the data infrastructure developed will support further work to understand and enhance commuting statistics in both datasets.
View Full Paper PDF
Working Paper

The Sensitivity of Economic Statistics to Coding Errors in Personal Identifiers

October 2002

Authors: Lars Vilhuber, John M. Abowd

Working Paper Number:

tp-2002-17

In this paper, we describe the sensitivity of small-cell flow statistics to coding errors in the identity of the underlying entities. Specifically, we present results based on a comparison of the U.S. Census Bureau's Quarterly Workforce Indicators (QWI) before and after correcting for such errors in SSN-based identifiers in the underlying individual wage records. The correction used involves a novel application of existing statistical matching techniques. It is found that even a very conservative correction procedure has a sizable impact on the statistics. The average bias ranges from 0.25 percent up to 15 percent for flow statistics, and up to 5 percent for payroll aggregates.
View Full Paper PDF
Working Paper

MISCLASSIFICATION IN BINARY CHOICE MODELS

May 2013

Authors: Bruce Meyer, Nikolas Mittag

Working Paper Number:

CES-13-27

We derive the asymptotic bias from misclassification of the dependent variable in binary choice models. Measurement error is necessarily non-classical in this case, which leads to bias in linear and non-linear models even if only the dependent variable is mismeasured. A Monte Carlo study and an application to food stamp receipt show that the bias formulas are useful to analyze the sensitivity of substantive conclusions, to interpret biased coefficients and imply features of the estimates that are robust to misclassification. Using administrative records linked to survey data as validation data, we examine estimators that are consistent under misclassification. They can improve estimates if their assumptions hold, but can aggravate the problem if the assumptions are invalid. The estimators differ in their robustness to such violations, which can be improved by incorporating additional information. We propose tests for the presence and nature of misclassification that can help to choose an estimator.
View Full Paper PDF
Working Paper

LODES Design and Methodology Report: Methodology Version 7

August 2025

Authors: Matthew R. Graham, Mark J. Kutzbach, Andrew Foote

Working Paper Number:

CES-25-52

The purpose of this report is to document the important features of Version 7 of the LEHD Origin-Destination Employment Statistics (LODES) processing system. This includes data sources, data processing methodology, confidentiality protection methodology, some quality measures, and a high-level description of the published data. The intended audience for this document includes LODES data users, Local Employment Dynamics (LED) Partnership members, U.S. Census Bureau management, program quality auditors, and current and future research and development staff members.
View Full Paper PDF
Working Paper

Estimating Measurement Error in SIPP Annual Job Earnings: A Comparison of Census Survey and SSA Administrative Data

September 2002

Authors: Martha Stinson

Working Paper Number:

tp-2002-24

The third chapter investigates measurement error in SIPP annual job earnings data linked to SSA administrative earnings data. The multiple earnings measures provided by the survey and administrative data enable the identification of components of true variation and variation due to measurement error. We find that 18% of the variation in SIPP annual job earnings can be attributed to measurement error. We also find that in both the SIPP and the DER, measurement error is persistent over time. A lower level of auto-correlation in the SIPP measurement error than in the economic error component leads to a lower reliability ratio of .62 for first-differenced earnings.
View Full Paper PDF
Working Paper

The LEHD Infrastructure Files and the Creation of the Quarterly Workforce Indicators

January 2006

Authors: Lars Vilhuber, John M. Abowd, Kevin L. McKinney, Bryce Stephens, Fredrik Andersson, Marc Roemer, Simon Woodcock

Working Paper Number:

tp-2006-01

The Longitudinal Employer-Household Dynamics (LEHD) Program at the U.S. Census Bureau, with the support of several national research agencies, has built a set of infrastructure files using administrative data provided by state agencies, enhanced with information from other administrative data sources, demographic and economic (business) surveys and censuses. The LEHD Infrastructure Files provide a detailed and comprehensive picture of workers, employers, and their interaction in the U.S. economy. Beginning in 2003 and building on this infrastructure, the Census Bureau has published the Quarterly Workforce Indicators (QWI), a new collection of data series that offers unprecedented detail on the local dynamics of labor markets. Despite the fine detail, confidentiality is maintained due to the application of state-of-the-art confidentiality protection methods. This article describes how the input files are compiled and combined to create the infrastructure files. We describe the multiple imputation methods used to impute in missing data and the statistical matching techniques used to combine and edit data when a direct identifier match requires improvement. Both of these innovations are crucial to the success of the final product. Finally, we pay special attention to the details of the confidentiality protection system used to protect the identity and micro data values of the underlying entities used to form the published estimates. We provide a brief description of public-use and restricted-access data files with pointers to further documentation for researchers interested in using these data.
View Full Paper PDF
Working Paper

Employees in the US Nonprofit Sector

May 2026

Authors: Stephanie Karol, Jennifer Mayo

Working Paper Number:

CES-26-33

The nonprofit sector employs roughly 10% of the American workforce, making it the third largest workforce behind the retail and manufacturing sectors. Despite this, relatively little is known about its employees. This paper is the first to use comprehensive administrative tax data, covering the near-universe of workers in the US, to quantify and explain the causes of the nonprofit pay differential. Unconditionally, we find the nonprofit earnings penalty to be 12% relative to for-profit workers. Estimating an 'AKM' worker-firm job ladder model, we show that most of the penalty is causal and not driven by selection. We also document considerable heterogeneity across industries, both in terms of earnings premia/penalties and worker selection, and show that nonprofit and for-profit earnings have been converging over time.
View Full Paper PDF
Working Paper

Squeezing More Out of Your Data: Business Record Linkage with Python

November 2018

Authors: Nathan Goldschlag, John Cuffe

Working Paper Number:

CES-18-46

Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.
View Full Paper PDF
Working Paper

BIAS IN FOOD STAMPS PARTICIPATION ESTIMATES IN THE PRESENCE OF MISREPORTING ERROR

March 2013

Authors: Cathleen Li

Working Paper Number:

CES-13-13

This paper focuses on how survey misreporting of food stamp receipt can bias demographic estimation of program participation. Food stamps is a federally funded program which subsidizes the nutrition of low-income households. In order to improve the reach of this program, studies on how program participation varies by demographic groups have been conducted using census data. Census data are subject to a lot of misreporting error, both underreporting and over-reporting, which can bias the estimates. The impact of misreporting error on estimate bias is examined by calculating food stamp participation rates, misreporting rates, and bias for select household characteristics (covariates).
View Full Paper PDF

Optimal Probabilistic Record Linkage: Best Practice for Linking Employers in Survey and Administrative Data

March 2019

Working Paper Number:

CES-19-08

Abstract

Document Tags and Keywords

The 10 most similar working papers to the working paper 'Optimal Probabilistic Record Linkage: Best Practice for Linking Employers in Survey and Administrative Data' are listed below in order of similarity.

November 2021

Working Paper Number:

CES-21-35

January 2017

Working Paper Number:

CES-17-34

October 2002

Working Paper Number:

tp-2002-17

May 2013

Working Paper Number:

CES-13-27

August 2025

Working Paper Number:

CES-25-52

September 2002

Working Paper Number:

tp-2002-24

January 2006

Working Paper Number:

tp-2006-01

May 2026

Working Paper Number:

CES-26-33

November 2018

Working Paper Number:

CES-18-46

March 2013

Working Paper Number:

CES-13-13