We describe our methodology and results from matching state Business Registration Records (BRR) to Census business data. We use data from Massachusetts and California to develop methods and preliminary results that could be used to guide matching data for additional states. We obtain matches to Census business records for 45% of the Massachusetts BRR records and 40% of the California BRR records. We find higher match rates for incorporated businesses and businesses with higher startup-quality scores as assigned in Guzman and Stern (2018). Clerical reviews show that using relatively strict matching on address is important for match accuracy, while results are less sensitive to name matching strictness. Among matched BRR records, the modal timing of the first match to the BR is in the year in which the BRR record was filed. We use two sets of software to identify matches: SAS DQ Match and a machine-learning algorithm described in Cuffe and Goldschlag (2018). We find preliminary evidence that while the ML-based method yields more match results, SAS DQ tends to result in higher accuracy rates. To conclude, we provide suggestions on how to proceed with matching other states' data in light of our findings using these two states.
-
Business Dynamics of Innovating Firms: Linking U.S. Patents with Administrative Data on Workers and Firms
July 2015
Working Paper Number:
CES-15-19
This paper discusses the construction of a new longitudinal database tracking inventors and patent-owning firms over time. We match granted patents between 2000 and 2011 to administrative databases of firms and workers housed at the U.S. Census Bureau. We use inventor information in addition to the patent assignee firm name to and improve on previous efforts linking patents to firms. The triangulated database allows us to maximize match rates and provide validation for a large fraction of matches. In this paper, we describe the construction of the database and explore basic features of the data. We find patenting firms, particularly young patenting firms, disproportionally contribute jobs to the U.S. economy. We find patenting is a relatively rare event among small firms but that most patenting firms are nevertheless small, and that patenting is not as rare an event for the youngest firms compared to the oldest firms. While manufacturing firms are more likely to patent than firms in other sectors, we find most patenting firms are in the services and wholesale sectors. These new data are a product of collaboration within the U.S. Department of Commerce, between the U.S. Census Bureau and the U.S. Patent and Trademark Office.
View Full
Paper PDF
-
NBER Patent Data-BR Bridge: User Guide and Technical Documentation
October 2010
Working Paper Number:
CES-10-36
This note provides details about the construction of the NBER Patent Data-BR concordance, and is intended for researchers planning to use this concordance. In addition to describing the matching process used to construct the concordance, this note provides a discussion of the benefits and limitations of this concordance.
View Full
Paper PDF
-
Improving Patent Assignee-Firm Bridge with Web Search Results
August 2022
Working Paper Number:
CES-22-31
This paper constructs a patent assignee-firm longitudinal bridge between U.S. patent assignees and firms using firm-level administrative data from the U.S. Census Bureau. We match granted patents applied between 1976 and 2016 to the U.S. firms recorded in the Longitudinal Business Database (LBD) in the Census Bureau. Building on existing algorithms in the literature, we first use the assignee name, address (state and city), and year information to link the two datasets. We then introduce a novel search-aided algorithm that significantly improves the matching results by 7% and 2.9% at the patent and the assignee level, respectively. Overall, we are able to match 88.2% and 80.1% of all U.S. patents and assignees respectively. We contribute to the existing literature by 1) improving the match rates and quality with the web search-aided algorithm, and 2) providing the longest and longitudinally consistent crosswalk between patent assignees and LBD firms.
View Full
Paper PDF
-
Starting Up AI
March 2024
Working Paper Number:
CES-24-09
Using comprehensive administrative data on business applications over the period 2004-2023, we study emerging business ideas for developing AI technologies or producing goods or services that use, integrate, or rely on AI. The annual number of new AI business applications is stable between 2004 and 2012 but begins to rise after 2012, and increases faster from 2016 onward into the pandemic, with a large, discrete jump in 2023. The distribution of AI business applications is highly uneven across states and sectors. AI business applications have a higher likelihood of becoming employer startups and higher expected initial employment compared to other business applications. Moreover, controlling for application characteristics, employer businesses originating from AI business applications exhibit higher employment, revenue, payroll, average pay per employee, and labor share, but have similar labor productivity and lower survival rate, compared to those originating from other business applications. While these early patterns may change as the diffusion of AI progresses, the rapid rise in AI business applications, combined with their generally higher rate of transition to employers and better performance in some post-transition outcomes, suggests a small but growing contribution from these applications to business dynamism.
View Full
Paper PDF
-
Squeezing More Out of Your Data: Business Record Linkage with Python
November 2018
Working Paper Number:
CES-18-46
Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a
single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.
View Full
Paper PDF
-
When Race and Hispanic Origin Reporting are Discrepant Across Administrative Records and Third Party Sources: Exploring Methods to Assign Responses
December 2015
Working Paper Number:
carra-2015-08
The U.S. Census Bureau is researching uses of administrative records and third party data in survey and decennial census operations. One potential use of administrative records is to utilize these data when race and Hispanic origin responses are missing. When federal and third party administrative records are compiled, race and Hispanic origin responses are not always the same for an individual across sources. We explore different methods to assign one race and one Hispanic response when these responses are discrepant. We also describe the characteristics of individuals with matching, non-matching, and missing race and Hispanic origin data by demographic, household, and contextual variables. We find that minorities, especially Hispanics, are more likely to have non-matching Hispanic origin and race responses in administrative records and third party data compared to the 2010 Census. Minority groups and individuals ages 0-17 are more likely to have missing race or Hispanic origin data in administrative records and third party data. Larger households tend to have more missing race data in administrative records and third party data than smaller households.
View Full
Paper PDF
-
Methodology on Creating the U.S. Linked Retail Health Clinic (LiRHC) Database
March 2023
Working Paper Number:
CES-23-10
Retail health clinics (RHCs) are a relatively new type of health care setting and understanding the role they play as a source of ambulatory care in the United States is important. To better understand these settings, a joint project by the Census Bureau and National Center for Health Statistics used data science techniques to link together data on RHCs from Convenient Care Association, County Business Patterns Business Register, and National Plan and Provider Enumeration System to create the Linked RHC (LiRHC, pronounced 'lyric') database of locations throughout the United States during the years 2018 to 2020. The matching methodology used to perform this linkage is described, as well as the benchmarking, match statistics, and manual review and quality checks used to assess the resulting matched data. The large majority (81%) of matches received quality scores at or above 75/100, and most matches were linked in the first two (of eight) matching passes, indicating high confidence in the final linked dataset. The LiRHC database contained 2,000 RHCs and found that 97% of these clinics were in metropolitan statistical areas and 950 were in the South region of the United States. Through this collaborative effort, the Census Bureau and National Center for Health Statistics strive to understand how RHCs can potentially impact population health as well as the access and provision of health care services across the nation.
View Full
Paper PDF
-
Measuring the Dynamics of Young and Small Businesses: Integrating the Employer and Nonemployer Universes
February 2006
Working Paper Number:
CES-06-04
We develop a preliminary version of an Integrated Longitudinal Business Database (ILBD) that combines administrative records and survey-based data for virtually all employer and nonemployer business units in the United States. In the process, we confront conceptual and practical issues that arise in measuring the importance and dynamic behavior of younger and smaller businesses. We also document some basic facts about younger and smaller businesses. In doing so, we exploit the ability of the ILBD to follow business transitions between employer and nonemployer status, and vice-versa. This aspect of the ILBD opens a new frontier for the study of business formation and the precursors to job creation in the U.S. economy.
View Full
Paper PDF
-
Age and High-Growth Entrepreneurship
April 2018
Working Paper Number:
CES-18-23
Many observers, and many investors, believe that young people are especially likely to produce the most successful new firms. We use administrative data at the U.S. Census Bureau to study the ages of founders of growth-oriented start-ups in the past decade. Our primary finding is that successful entrepreneurs are middle-aged, not young. The mean founder age for the 1 in 1,000 fastest growing new ventures is 45.0. The findings are broadly similar when considering high-technology sectors, entrepreneurial hubs, and successful firm exits. Prior experience in the specific industry predicts much greater rates of entrepreneurial success. These findings strongly reject common hypotheses that emphasize youth as a key trait of successful entrepreneurs.
View Full
Paper PDF
-
Successor/Predecessor Firms
March 2002
Working Paper Number:
tp-2002-04
The goal of this research was to investigate the value added from using worker flows to identify the spurious births and deaths of businesses. We identify four types of "at risk" businesses from ES202 using the successor/predecessor flag and mimic the same categories using UI wage record data. We use two critical decision rules in the analysis: a successor firm has to have at least 80% of employment coming from the donor firm and (in two of the four categories) at least 5 employees have to come from the donor firm. We examine the sensitivity of the categories based on the percentage definition, and find that the results stay very similar, with the exception of the identification of the pure successor. We examine the sensitivity based on the count threshold, and find that there are enormous differences, particularly with identifying spinoff businesses.
View Full
Paper PDF