CREAT - Census Bureau

Squeezing More Out of Your Data: Business Record Linkage with Python

November 2018

Written by: Nathan Goldschlag, John Cuffe

Working Paper Number:

CES-18-46

Abstract

Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.

Document Tags and Keywords

Keywords:

data, database, microdata, classification, classifying, business data, record, matched, matching, datasets, identifier, linkage

Tags:

Metropolitan Statistical Area, Standard Statistical Establishment List, Service Annual Survey, Center for Economic Studies, Employer Identification Numbers, Census Bureau Business Register, Business Register, University of Michigan

Similar Working Papers

The 10 most similar working papers to the working paper 'Squeezing More Out of Your Data: Business Record Linkage with Python' are listed below in order of similarity.

Working Paper

Automating Response Evaluation For Franchising Questions On The 2017 Economic Census

July 2019

Authors: J. Bradford Jensen, Shawn Klimek, Joseph Staudt, Yifang Wei, Lisa Singh, Andrew L. Baer

Working Paper Number:

CES-19-20

Between the 2007 and 2012 Economic Censuses (EC), the count of franchise-affiliated establishments declined by 9.8%. One reason for this decline was a reduction in resources that the Census Bureau was able to dedicate to the manual evaluation of survey responses in the franchise section of the EC. Extensive manual evaluation in 2007 resulted in many establishments, whose survey forms indicated they were not franchise-affiliated, being recoded as franchise-affiliated. No such evaluation could be undertaken in 2012. In this paper, we examine the potential of using external data harvested from the web in combination with machine learning methods to automate the process of evaluating responses to the franchise section of the 2017 EC. Our method allows us to quickly and accurately identify and recode establishments have been mistakenly classified as not being franchise-affiliated, increasing the unweighted number of franchise-affiliated establishments in the 2017 EC by 22%-42%.
View Full Paper PDF
Working Paper

Redesigning the Longitudinal Business Database

May 2021

Authors: Martha Stinson, T. Kirk White, Teresa C. Fort, Christopher Goetz, Nathan Goldschlag, Melissa Chow, Elisabeth Ruth Perlman, James Lawrence

Working Paper Number:

CES-21-08

In this paper we describe the U.S. Census Bureau's redesign and production implementation of the Longitudinal Business Database (LBD) first introduced by Jarmin and Miranda (2002). The LBD is used to create the Business Dynamics Statistics (BDS), tabulations describing the entry, exit, expansion, and contraction of businesses. The new LBD and BDS also incorporate information formerly provided by the Statistics of U.S. Businesses program, which produced similar year-to-year measures of employment and establishment flows. We describe in detail how the LBD is created from curation of the input administrative data, longitudinal matching, retiming of economic census-year births and deaths, creation of vintage consistent industry codes and noise factors, and the creation and cleaning of each year of LBD data. This documentation is intended to facilitate the proper use and understanding of the data by both researchers with approved projects accessing the LBD microdata and those using the BDS tabulations.
View Full Paper PDF
Working Paper

Finding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning

November 2021

Authors: Kristin McCue, John M. Abowd, Matthew D. Shapiro, Trivellore Raghunathan, Margaret C. Levenstein, Joelle Abramowitz, Dhiren Patki, Ann M. Rodgers, Nada Wasi, Dawn Zinsser

Working Paper Number:

CES-21-35

This paper considers the problem of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across establishments is highly skewed. To address these difficulties, this paper develops a probabilistic record linkage methodology that combines machine learning (ML) with multiple imputation (MI). This ML-MI methodology is applied to link survey respondents in the Health and Retirement Study to their workplaces in the Census Business Register. The linked data reveal new evidence that non-sampling errors in household survey data are correlated with respondents' workplace characteristics.
View Full Paper PDF
Working Paper

The Person Identification Validation System (PVS): Applying the Center for Administrative Records Research and Applications' (CARRA) Record Linkage Software

July 2014

Authors: Deborah Wagner, Mary Layne

Working Paper Number:

carra-2014-01

The Census Bureau's Person Identification Validation System (PVS) assigns unique person identifiers to federal, commercial, census, and survey data to facilitate linkages across and within files. PVS uses probabilistic matching to assign a unique Census Bureau identifier for each person. The PVS matches incoming files to reference files created with data from the Social Security Administration (SSA) Numerical Identification file, and SSA data with addresses obtained from federal files. This paper describes the PVS methodology from editing input data to creating the final file.
View Full Paper PDF
Working Paper

LEHD Data Documentation LEHD-OVERVIEW-S2008-rev1

December 2011

Authors: Lars Vilhuber, Kevin L. McKinney

Working Paper Number:

CES-11-43

View Full Paper PDF
Working Paper

Matching State Business Registration Records to Census Business Data

January 2020

Authors: Kristin McCue, J. Daniel Kim

Working Paper Number:

CES-20-03

We describe our methodology and results from matching state Business Registration Records (BRR) to Census business data. We use data from Massachusetts and California to develop methods and preliminary results that could be used to guide matching data for additional states. We obtain matches to Census business records for 45% of the Massachusetts BRR records and 40% of the California BRR records. We find higher match rates for incorporated businesses and businesses with higher startup-quality scores as assigned in Guzman and Stern (2018). Clerical reviews show that using relatively strict matching on address is important for match accuracy, while results are less sensitive to name matching strictness. Among matched BRR records, the modal timing of the first match to the BR is in the year in which the BRR record was filed. We use two sets of software to identify matches: SAS DQ Match and a machine-learning algorithm described in Cuffe and Goldschlag (2018). We find preliminary evidence that while the ML-based method yields more match results, SAS DQ tends to result in higher accuracy rates. To conclude, we provide suggestions on how to proceed with matching other states' data in light of our findings using these two states.
View Full Paper PDF
Working Paper

File Matching with Faulty Continuous Matching Variables

January 2017

Authors: Gale Boyd, Jerome P. Reiter, Nicole M. Dalzell

Working Paper Number:

CES-17-45

We present LFCMV, a Bayesian file linking methodology designed to link records using continuous matching variables in situations where we do not expect values of these matching variables to agree exactly across matched pairs. The method involves a linking model for the distance between the matching variables of records in one file and the matching variables of their linked records in the second. This linking model is conditional on a vector indicating the links. We specify a mixture model for the distance component of the linking model, as this latent structure allows the distance between matching variables in linked pairs to vary across types of linked pairs. Finally, we specify a model for the linking vector. We describe the Gibbs sampling algorithm for sampling from the posterior distribution of this linkage model and use artificial data to illustrate model performance. We also introduce a linking application using public survey information and data from the U.S. Census of Manufactures and use LFCMV to link the records.
View Full Paper PDF
Working Paper

Methodology on Creating the U.S. Linked Retail Health Clinic (LiRHC) Database

March 2023

Authors: Alice Zawacki, Joey Marshall, Donald Cherry, Xianghua Yin, Brian W. Ward

Working Paper Number:

CES-23-10

Retail health clinics (RHCs) are a relatively new type of health care setting and understanding the role they play as a source of ambulatory care in the United States is important. To better understand these settings, a joint project by the Census Bureau and National Center for Health Statistics used data science techniques to link together data on RHCs from Convenient Care Association, County Business Patterns Business Register, and National Plan and Provider Enumeration System to create the Linked RHC (LiRHC, pronounced 'lyric') database of locations throughout the United States during the years 2018 to 2020. The matching methodology used to perform this linkage is described, as well as the benchmarking, match statistics, and manual review and quality checks used to assess the resulting matched data. The large majority (81%) of matches received quality scores at or above 75/100, and most matches were linked in the first two (of eight) matching passes, indicating high confidence in the final linked dataset. The LiRHC database contained 2,000 RHCs and found that 97% of these clinics were in metropolitan statistical areas and 950 were in the South region of the United States. Through this collaborative effort, the Census Bureau and National Center for Health Statistics strive to understand how RHCs can potentially impact population health as well as the access and provision of health care services across the nation.
View Full Paper PDF
Working Paper

Optimal Probabilistic Record Linkage: Best Practice for Linking Employers in Survey and Administrative Data

March 2019

Authors: Kristin McCue, John M. Abowd, Matthew D. Shapiro, Trivellore Raghunathan, Margaret C. Levenstein, Joelle Abramowitz, Dhiren Patki, Ann M. Rodgers, Nada Wasi

Working Paper Number:

CES-19-08

This paper illustrates an application of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across firms is highly asymmetric. To address these difficulties, this paper uses a supervised machine learning model to probabilistically link survey respondents in the Health and Retirement Study (HRS) with employers and establishments in the Census Business Register (BR) to create a new data source which we call the CenHRS. Multiple imputation is used to propagate uncertainty from the linkage step into subsequent analyses of the linked data. The linked data reveal new evidence that survey respondents' misreporting and selective nonresponse about employer characteristics are systematically correlated with wages.
View Full Paper PDF
Working Paper

Releasing Earnings Distributions using Differential Privacy: Disclosure Avoidance System For Post Secondary Employment Outcomes (PSEO)

April 2019

Authors: Kevin L. McKinney, Andrew Foote, Ashwin Machanavajjhala

Working Paper Number:

CES-19-13

The U.S. Census Bureau recently released data on earnings percentiles of graduates from post secondary institutions. This paper describes and evaluates the disclosure avoidance system developed for these statistics. We propose a differentially private algorithm for releasing these data based on standard differentially private building blocks, by constructing a histogram of earnings and the application of the Laplace mechanism to recover a differentially-private CDF of earnings. We demonstrate that our algorithm can release earnings distributions with low error, and our algorithm out-performs prior work based on the concept of smooth sensitivity from Nissim, Raskhodnikova and Smith (2007).
View Full Paper PDF

Squeezing More Out of Your Data: Business Record Linkage with Python

November 2018

Working Paper Number:

CES-18-46

Abstract

Document Tags and Keywords

The 10 most similar working papers to the working paper 'Squeezing More Out of Your Data: Business Record Linkage with Python' are listed below in order of similarity.

July 2019

Working Paper Number:

CES-19-20

May 2021

Working Paper Number:

CES-21-08

November 2021

Working Paper Number:

CES-21-35

July 2014

Working Paper Number:

carra-2014-01

December 2011

Working Paper Number:

CES-11-43

January 2020

Working Paper Number:

CES-20-03

January 2017

Working Paper Number:

CES-17-45

March 2023

Working Paper Number:

CES-23-10

March 2019

Working Paper Number:

CES-19-08

April 2019

Working Paper Number:

CES-19-13