CREAT - Census Bureau

Matching Addresses between Household Surveys and Commercial Data

July 2015

Written by: Quentin Brummet

Working Paper Number:

carra-2015-04

Abstract

Matching third-party data sources to household surveys can benefit household surveys in a number of ways, but the utility of these new data sources depends critically on our ability to link units between data sets. To understand this better, this report discusses potential modifications to the existing match process that could potentially improve our matches. While many changes to the matching procedure produce marginal improvements in match rates, substantial increases in match rates can only be achieved by relaxing the definition of a successful match. In the end, the results show that the most important factor determining the success of matching procedures is the quality and composition of the data sets being matched.

Document Tags and Keywords

Keywords:

analysis, data, survey data, survey, research, information, record, matched, matching, housing, associate, sample, household surveys, residence, apartment, housing survey

Tags:

Service Annual Survey, American Housing Survey, Master Address File, Center for Administrative Records Research, Center for Administrative Records Research and Applications

Similar Working Papers

The 10 most similar working papers to the working paper 'Matching Addresses between Household Surveys and Commercial Data' are listed below in order of similarity.

Working Paper
🔥

Comparison of Survey, Federal, and Commercial Address Data Quality

June 2014

Authors: Quentin Brummet

Working Paper Number:

carra-2014-06

This report summarizes matching of survey, commercial, and administrative records housing units to the Census Bureau Master Address File (MAF). We document overall MAF match rates in each data set and evaluate differences in match rates across a variety of housing characteristics. Results show that over 90 percent of records in survey data from the American Housing Survey (AHS) match to the MAF. Commercial data from CoreLogic matches at much lower rates, in part due to missing address information and poor match rates for multi-unit buildings. MAF match rates for administrative records from the Department of Housing and Urban Development are also high, and open the possibility of using this information in surveys such as the AHS.
View Full Paper PDF
Working Paper
🔥

The Person Identification Validation System (PVS): Applying the Center for Administrative Records Research and Applications' (CARRA) Record Linkage Software

July 2014

Authors: Deborah Wagner, Mary Layne

Working Paper Number:

carra-2014-01

The Census Bureau's Person Identification Validation System (PVS) assigns unique person identifiers to federal, commercial, census, and survey data to facilitate linkages across and within files. PVS uses probabilistic matching to assign a unique Census Bureau identifier for each person. The PVS matches incoming files to reference files created with data from the Social Security Administration (SSA) Numerical Identification file, and SSA data with addresses obtained from federal files. This paper describes the PVS methodology from editing input data to creating the final file.
View Full Paper PDF
Working Paper

Comparing the 2019 American Housing Survey to Contemporary Sources of Property Tax Records: Implications for Survey Efficiency and Quality

June 2022

Authors: John Voorheis, Ariel J. Binder, Emily Molfino

Working Paper Number:

CES-22-22

Given rising nonresponse rates and concerns about respondent burden, government statistical agencies have been exploring ways to supplement household survey data collection with administrative records and other sources of third-party data. This paper evaluates the potential of property tax assessment records to improve housing surveys by comparing these records to responses from the 2019 American Housing Survey. Leveraging the U.S. Census Bureau's linkage infrastructure, we compute the fraction of AHS housing units that could be matched to a unique property parcel (coverage rate), as well as the extent to which survey and property tax data contain the same information (agreement rate). We analyze heterogeneity in coverage and agreement across states, housing characteristics, and 11 AHS items of interest to housing researchers. Our results suggest that partial replacement of AHS data with property data, targeted toward certain survey items or single-family detached homes, could reduce respondent burden without altering data quality. Further research into partial-replacement designs is needed and should proceed on an item-by-item basis. Our work can guide this research as well as those who wish to conduct independent research with property tax records that is representative of the U.S. housing stock.
View Full Paper PDF
Working Paper

Person Matching in Historical Files using the Census Bureau's Person Validation System

September 2014

Authors: Amy B. O'Hara, Catherine G. Massey, Amy OHara

Working Paper Number:

carra-2014-11

The recent release of the 1940 Census manuscripts enables the creation of longitudinal data spanning the whole of the twentieth century. Linked historical and contemporary data would allow unprecedented analyses of the causes and consequences of health, demographic, and economic change. The Census Bureau is uniquely equipped to provide high quality linkages of person records across datasets. This paper summarizes the linkage techniques employed by the Census Bureau and discusses utilization of these techniques to append protected identification keys to the 1940 Census.
View Full Paper PDF
Working Paper

Creating Linked Historical Data: An Assessment of the Census Bureau's Ability to Assign Protected Identification Keys to the 1960 Census

September 2014

Authors: Catherine G. Massey

Working Paper Number:

carra-2014-12

In order to study social phenomena over the course of the 20th century, the Census Bureau is investigating the feasibility of digitizing historical census records and linking them to contemporary data. However, historical censuses have limited personally identifiable information available to match on. In this paper, I discuss the problems associated with matching older censuses to contemporary data files, and I describe the matching process used to match a small sample of the 1960 census to the Social Security Administration Numeric Identification System.
View Full Paper PDF
Working Paper

Squeezing More Out of Your Data: Business Record Linkage with Python

November 2018

Authors: Nathan Goldschlag, John Cuffe

Working Paper Number:

CES-18-46

Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.
View Full Paper PDF
Working Paper

When Race and Hispanic Origin Reporting are Discrepant Across Administrative Records and Third Party Sources: Exploring Methods to Assign Responses

December 2015

Authors: James M. Noon, Sharon R. Ennis, Ellen Zapata, Sonya R. Porter

Working Paper Number:

carra-2015-08

The U.S. Census Bureau is researching uses of administrative records and third party data in survey and decennial census operations. One potential use of administrative records is to utilize these data when race and Hispanic origin responses are missing. When federal and third party administrative records are compiled, race and Hispanic origin responses are not always the same for an individual across sources. We explore different methods to assign one race and one Hispanic response when these responses are discrepant. We also describe the characteristics of individuals with matching, non-matching, and missing race and Hispanic origin data by demographic, household, and contextual variables. We find that minorities, especially Hispanics, are more likely to have non-matching Hispanic origin and race responses in administrative records and third party data compared to the 2010 Census. Minority groups and individuals ages 0-17 are more likely to have missing race or Hispanic origin data in administrative records and third party data. Larger households tend to have more missing race data in administrative records and third party data than smaller households.
View Full Paper PDF
Working Paper

Estimating Record Linkage False Match Rate for the Person Identification Validation System

July 2014

Authors: Deborah Wagner, Mary Layne, Cynthia Rothhaas

Working Paper Number:

carra-2014-02

The Census Bureau Person Identification Validation System (PVS) assigns unique person identifiers to federal, commercial, census, and survey data to facilitate linkages across files. PVS uses probabilistic matching to assign a unique Census Bureau identifier for each person. This paper presents a method to measure the false match rate in PVS following the approach of Belin and Rubin (1995). The Belin and Rubin methodology requires truth data to estimate a mixture model. The parameters from the mixture model are used to obtain point estimates of the false match rate for each of the PVS search modules. The truth data requirement is satisfied by the unique access the Census Bureau has to high quality name, date of birth, address and Social Security (SSN) data. Truth data are quickly created for the Belin and Rubin model and do not involve a clerical review process. These truth data are used to create estimates for the Belin and Rubin parameters, making the approach more feasible. Both observed and modeled false match rates are computed for all search modules in federal administrative records data and commercial data.
View Full Paper PDF
Working Paper

Where Are Your Parents? Exploring Potential Bias in Administrative Records on Children

March 2024

Authors: Jennifer Bernard, Kelsey Drotning, Katie R. Genadek

Working Paper Number:

CES-24-18

This paper examines potential bias in the Census Household Composition Key's (CHCK) probabilistic parent-child linkages. By linking CHCK data to the American Community Survey (ACS), we reveal disparities in parent-child linkages among specific demographic groups and find that characteristics of children that can and cannot be linked to the CHCK vary considerably from the larger population. In particular, we find that children from low-income, less educated households and of Hispanic origin are less likely to be linked to a mother or a father in the CHCK. We also highlight some data considerations when using the CHCK.
View Full Paper PDF
Working Paper

Methodology on Creating the U.S. Linked Retail Health Clinic (LiRHC) Database

March 2023

Authors: Alice Zawacki, Joey Marshall, Donald Cherry, Xianghua Yin, Brian W. Ward

Working Paper Number:

CES-23-10

Retail health clinics (RHCs) are a relatively new type of health care setting and understanding the role they play as a source of ambulatory care in the United States is important. To better understand these settings, a joint project by the Census Bureau and National Center for Health Statistics used data science techniques to link together data on RHCs from Convenient Care Association, County Business Patterns Business Register, and National Plan and Provider Enumeration System to create the Linked RHC (LiRHC, pronounced 'lyric') database of locations throughout the United States during the years 2018 to 2020. The matching methodology used to perform this linkage is described, as well as the benchmarking, match statistics, and manual review and quality checks used to assess the resulting matched data. The large majority (81%) of matches received quality scores at or above 75/100, and most matches were linked in the first two (of eight) matching passes, indicating high confidence in the final linked dataset. The LiRHC database contained 2,000 RHCs and found that 97% of these clinics were in metropolitan statistical areas and 950 were in the South region of the United States. Through this collaborative effort, the Census Bureau and National Center for Health Statistics strive to understand how RHCs can potentially impact population health as well as the access and provision of health care services across the nation.
View Full Paper PDF

Matching Addresses between Household Surveys and Commercial Data

July 2015

Working Paper Number:

carra-2015-04

Abstract

Document Tags and Keywords

The 10 most similar working papers to the working paper 'Matching Addresses between Household Surveys and Commercial Data' are listed below in order of similarity.

June 2014

Working Paper Number:

carra-2014-06

July 2014

Working Paper Number:

carra-2014-01

June 2022

Working Paper Number:

CES-22-22

September 2014

Working Paper Number:

carra-2014-11

September 2014

Working Paper Number:

carra-2014-12

November 2018

Working Paper Number:

CES-18-46

December 2015

Working Paper Number:

carra-2015-08

July 2014

Working Paper Number:

carra-2014-02

March 2024

Working Paper Number:

CES-24-18

March 2023

Working Paper Number:

CES-23-10