-
Gradient Boosting to Address Statistical Problems Arising from Non-Linkage of Census Bureau Datasets
June 2024
Working Paper Number:
CES-24-27
This article introduces the twangRDC package, which contains functions to address non-linkage in US Census Bureau datasets. The Census Bureau's Person Identification Validation System facilitates data linkage by assigning unique person identifiers to federal, third party, decennial census, and survey data. Not all records in these datasets can be linked to the reference file and as such not all records will be assigned an identifier. This article is a tutorial for using the twangRDC to generate nonresponse weights to account for non-linkage of person records across US Census Bureau datasets.
View Full
Paper PDF
-
Revisiting Methods to Assign Responses when Race and Hispanic Origin Reporting are Discrepant Across Administrative Records and Third Party Sources
May 2024
Working Paper Number:
CES-24-26
The Best Race and Ethnicity Administrative Records Composite file ('Best Race file') is an composite file which combines Census, federal, and Third Party Data (TPD) sources and applies business rules to assign race and ethnicity values to person records. The first version of the Best Race administrative records composite was first constructed in 2015 and subsequently updated each year to include more recent vintages, when available, of the data sources originally included in the composite file. Where updates were available for data sources, the most recent information for persons was retained, and the business rules were reapplied to assign a single race and single Hispanic origin value to each person record. The majority of person records on the Best Race file have consistent race and ethnicity information across data sources. Where there are discrepancies in responses across data sources, we apply a series of business rules to assign a single race and ethnicity to each record. To improve the quality of the Best Race administrative records composite, we have begun revising the business rules which were developed several years ago. This paper discusses the original business rules as well as the implemented changes and their impact on the composite file.
View Full
Paper PDF
-
Mixed-Effects Methods For Search and Matching Research
September 2023
Working Paper Number:
CES-23-43
We study mixed-effects methods for estimating equations containing person and firm effects. In economics such models are usually estimated using fixed-effects methods. Recent enhancements to those fixed-effects methods include corrections to the bias in estimating the covariance matrix of the person and firm effects, which we also consider.
View Full
Paper PDF
-
Methodology on Creating the U.S. Linked Retail Health Clinic (LiRHC) Database
March 2023
Working Paper Number:
CES-23-10
Retail health clinics (RHCs) are a relatively new type of health care setting and understanding the role they play as a source of ambulatory care in the United States is important. To better understand these settings, a joint project by the Census Bureau and National Center for Health Statistics used data science techniques to link together data on RHCs from Convenient Care Association, County Business Patterns Business Register, and National Plan and Provider Enumeration System to create the Linked RHC (LiRHC, pronounced 'lyric') database of locations throughout the United States during the years 2018 to 2020. The matching methodology used to perform this linkage is described, as well as the benchmarking, match statistics, and manual review and quality checks used to assess the resulting matched data. The large majority (81%) of matches received quality scores at or above 75/100, and most matches were linked in the first two (of eight) matching passes, indicating high confidence in the final linked dataset. The LiRHC database contained 2,000 RHCs and found that 97% of these clinics were in metropolitan statistical areas and 950 were in the South region of the United States. Through this collaborative effort, the Census Bureau and National Center for Health Statistics strive to understand how RHCs can potentially impact population health as well as the access and provision of health care services across the nation.
View Full
Paper PDF
-
Improving Patent Assignee-Firm Bridge with Web Search Results
August 2022
Working Paper Number:
CES-22-31
This paper constructs a patent assignee-firm longitudinal bridge between U.S. patent assignees and firms using firm-level administrative data from the U.S. Census Bureau. We match granted patents applied between 1976 and 2016 to the U.S. firms recorded in the Longitudinal Business Database (LBD) in the Census Bureau. Building on existing algorithms in the literature, we first use the assignee name, address (state and city), and year information to link the two datasets. We then introduce a novel search-aided algorithm that significantly improves the matching results by 7% and 2.9% at the patent and the assignee level, respectively. Overall, we are able to match 88.2% and 80.1% of all U.S. patents and assignees respectively. We contribute to the existing literature by 1) improving the match rates and quality with the web search-aided algorithm, and 2) providing the longest and longitudinally consistent crosswalk between patent assignees and LBD firms.
View Full
Paper PDF
-
The impact of manufacturing credentials on earnings and the probability of employment
May 2022
Working Paper Number:
CES-22-15
This paper examines the labor market returns to earning industry-certified credentials in the manufacturing sector. Specifically, we are interested in estimating the impact of a manufacturing credential on wages, probability of employment, and probability of employment specifically in the manufacturing sector post credential attainment. We link students who earned manufacturing credentials to their enrollment and completion records, and then further link them to their IRS tax records for earnings and employment (Form W2 and 1040) and to the American Community Survey and decennial census for demographic information. We present earnings trajectories for workers with credentials by type of credential, industry of employment, age, race and ethnicity, gender, and state. To obtain a more causal estimate of the impact of a credential on earnings, we implement a coarsened exact matching strategy to compare outcomes between otherwise similar people with and without a manufacturing credential. We find that the attainment of a manufacturing industry credential is associated with higher earnings and a higher likelihood of labor market participation when we compare attainers to a group of non-attainers who are otherwise similar.
View Full
Paper PDF
-
Identifying U.S. Merchandise Traders: Integrating Customs Transactions with Business Administrative Data
September 2020
Working Paper Number:
CES-20-28
This paper describes the construction of the Longitudinal Firm Trade Transactions Database (LFTTD) enabling the identification of merchandise traders - exporters and importers - in the U.S. Census Bureau's Business Register (BR). The LFTTD links merchandise export and import transactions from customs declaration forms to the BR beginning in 1992 through the present. We employ a combination of deterministic and probabilistic matching algorithms to assign a unique firm identifier in the BR to a merchandise export or import transaction record. On average, we match 89 percent of export and import values to a firm identifier. In 1992, we match 79 (88) percent of export (import) value; in 2017, we match 92 (96) percent of export (import) value. Trade transactions in year t are matched to years between 1976 and t+1 of the BR. On average, 94 percent of the trade value matches to a firm in year t of the BR. The LFTTD provides the most comprehensive identification of and the foundation for the analysis of goods trading firms in the U.S. economy.
View Full
Paper PDF
-
Matching State Business Registration Records
to Census Business Data
January 2020
Working Paper Number:
CES-20-03
We describe our methodology and results from matching state Business Registration Records (BRR) to Census business data. We use data from Massachusetts and California to develop methods and preliminary results that could be used to guide matching data for additional states. We obtain matches to Census business records for 45% of the Massachusetts BRR records and 40% of the California BRR records. We find higher match rates for incorporated businesses and businesses with higher startup-quality scores as assigned in Guzman and Stern (2018). Clerical reviews show that using relatively strict matching on address is important for match accuracy, while results are less sensitive to name matching strictness. Among matched BRR records, the modal timing of the first match to the BR is in the year in which the BRR record was filed. We use two sets of software to identify matches: SAS DQ Match and a machine-learning algorithm described in Cuffe and Goldschlag (2018). We find preliminary evidence that while the ML-based method yields more match results, SAS DQ tends to result in higher accuracy rates. To conclude, we provide suggestions on how to proceed with matching other states' data in light of our findings using these two states.
View Full
Paper PDF
-
Squeezing More Out of Your Data: Business Record Linkage with Python
November 2018
Working Paper Number:
CES-18-46
Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a
single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.
View Full
Paper PDF
-
Who Moves Up the Job Ladder?*
January 2017
Working Paper Number:
CES-17-63
In this paper, we use linked employer-employee data to study the reallocation of heterogeneous workers between heterogeneous firms. We build on recent evidence of a cyclical job ladder that reallocates workers from low productivity to high productivity firms through job-to-job moves. In this paper we turn to the question of who moves up this job ladder, and the implications for worker sorting across firms. Not surprisingly, we find that job-to-job moves reallocate younger workers disproportionately from less productive to more productive firms. More surprisingly, especially in the context of the recent literature on assortative matching with on-the-job search, we find that job-to-
job moves disproportionately reallocate less-educated workers up the job ladder. This finding holds even though we find that more educated workers are more likely to work with more productive firms. We find that while highly educated workers are less likely to match to low productivity firms, they are also less likely to separate from them, with less-educated workers both more likely to separate to a better employer in expansions and to be shaken off the ladder (separate to nonemployment) in contractions. Our findings underscore the cyclical role job-to-job moves play in matching workers to
better paying employers.
View Full
Paper PDF