CREAT: Census Research Exploration and Analysis Tool

Papers Containing Keywords(s): 'classifying'

The following papers contain search terms that you selected. From the papers listed below, you can navigate to the PDF, the profile page for that working paper, or see all the working papers written by an author. You can also explore tags, keywords, and authors that occur frequently within these papers.
Click here to search again

Viewing papers 1 through 10 of 10


  • Working Paper

    Exploring New Ways to Classify Industries for Energy Analysis and Modeling

    November 2022

    Working Paper Number:

    CES-22-49

    Combustion, other emitting processes and fossil energy use outside the power sector have become urgent concerns given the United States' commitment to achieving net-zero greenhouse gas emissions by 2050. Industry is an important end user of energy and relies on fossil fuels used directly for process heating and as feedstocks for a diverse range of applications. Fuel and energy use by industry is heterogeneous, meaning even a single product group can vary broadly in its production routes and associated energy use. In the United States, the North American Industry Classification System (NAICS) serves as the standard for statistical data collection and reporting. In turn, data based on NAICS are the foundation of most United States energy modeling. Thus, the effectiveness of NAICS at representing energy use is a limiting condition for current expansive planning to improve energy efficiency and alternatives to fossil fuels in industry. Facility-level data could be used to build more detail into heterogeneous sectors and thus supplement data from Bureau of the Census and U.S Energy Information Administration reporting at NAICS code levels but are scarce. This work explores alternative classification schemes for industry based on energy use characteristics and validates an approach to estimate facility-level energy use from publicly available greenhouse gas emissions data from the U.S. Environmental Protection Agency (EPA). The approaches in this study can facilitate understanding of current, as well as possible future, energy demand. First, current approaches to the construction of industrial taxonomies are summarized along with their usefulness for industrial energy modeling. Unsupervised machine learning techniques are then used to detect clusters in data reported from the U.S. Department of Energy's Industrial Assessment Center program. Clusters of Industrial Assessment Center data show similar levels of correlation between energy use and explanatory variables as three-digit NAICS codes. Interestingly, the clusters each include a large cross section of NAICS codes, which lends additional support to the idea that NAICS may not be particularly suited for correlation between energy use and the variables studied. Fewer clusters are needed for the same level of correlation as shown in NAICS codes. Initial assessment shows a reasonable level of separation using support vector machines with higher than 80% accuracy, so machine learning approaches may be promising for further analysis. The IAC data is focused on smaller and medium-sized facilities and is biased toward higher energy users for a given facility type. Cladistics, an approach for classification developed in biology, is adapted to energy and process characteristics of industries. Cladistics applied to industrial systems seeks to understand the progression of organizations and technology as a type of evolution, wherein traits are inherited from previous systems but evolve due to the emergence of inventions and variations and a selection process driven by adaptation to pressures and favorable outcomes. A cladogram is presented for evolutionary directions in the iron and steel sector. Cladograms are a promising tool for constructing scenarios and summarizing directions of sectoral innovation. The cladogram of iron and steel is based on the drivers of energy use in the sector. Phylogenetic inference is similar to machine learning approaches as it is based on a machine-led search of the solution space, therefore avoiding some of the subjectivity of other classification systems. Our prototype approach for constructing an industry cladogram is based on process characteristics according to the innovation framework derived from Schumpeter to capture evolution in a given sector. The resulting cladogram represents a snapshot in time based on detailed study of process characteristics. This work could be an important tool for the design of scenarios for more detailed modeling. Cladograms reveal groupings of emerging or dominant processes and their implications in a way that may be helpful for policymakers and entrepreneurs, allowing them to see the larger picture, other good ideas, or competitors. Constructing a cladogram could be a good first step to analysis of many industries (e.g. nitrogenous fertilizer production, ethyl alcohol manufacturing), to understand their heterogeneity, emerging trends, and coherent groupings of related innovations. Finally, validation is performed for facility-level energy estimates from the EPA Greenhouse Gas Reporting Program. Facility-level data availability continues to be a major challenge for industrial modeling. The method outlined by (McMillan et al. 2016; McMillan and Ruth 2019) allows estimating of facility level energy use based on mandatory greenhouse gas reporting. The validation provided here is an important step for further use of this data for industrial energy modeling.
    View Full Paper PDF
  • Working Paper

    Squeezing More Out of Your Data: Business Record Linkage with Python

    November 2018

    Working Paper Number:

    CES-18-46

    Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.
    View Full Paper PDF
  • Working Paper

    Occupational Classifications: A Machine Learning Approach

    August 2018

    Working Paper Number:

    CES-18-37

    Characterizing the work that people do on their jobs is a longstanding and core issue in labor economics. Traditionally, classification has been done manually. If it were possible to combine new computational tools and administrative wage records to generate an automated crosswalk between job titles and occupations, millions of dollars could be saved in labor costs, data processing could be sped up, data could become more consistent, and it might be possible to generate, without a lag, current information about the changing occupational composition of the labor market. This paper examines the potential to assign occupations to job titles contained in administrative data using automated, machine-learning approaches. We use a new extraordinarily rich and detailed set of data on transactional HR records of large firms (universities) in a relatively narrowly defined industry (public institutions of higher education) to identify the potential for machine-learning approaches to classify occupations.
    View Full Paper PDF
  • Working Paper

    The Effects of Industry Classification Changes on US Employment Composition

    June 2018

    Working Paper Number:

    CES-18-28

    This paper documents the extent to which compositional changes in US employment from 1976 to 2009 are due to changes in the industry classification scheme used to categorize economic activity. In 1997, US statistical agencies began implementation of a change from the Standard Industrial Classification System (SIC) to the North American Industrial Classification System (NAICS). NAICS was designed to provide a consistent classification scheme that consolidated declining or obsolete industries and added categories for new industries. Under NAICS, many activities previously classified as Manufacturing, Wholesale Trade, or Retail Trade were re-classified into the Services sector. This re-classification resulted in a significant shift of measured activities across sectors without any change in underlying economic activity. Using a newly developed establishment-level database of employment activity that is consistently classified on a NAICS basis, this paper shows that the change from SIC to NAICS increased the share of Services employment by approximately 36 percent. 7.6 percent of US manufacturing employment, equal to approximately 1.4 million jobs, was reclassified to services. Retail trade and wholesale trade also experienced a significant reclassification of activities in the transition.
    View Full Paper PDF
  • Working Paper

    Getting Patents and Economic Data to Speak to Each Other: An 'Algorithmic Links with Probabilities' Approach for Joint Analyses of Patenting and Economic Activity

    September 2012

    Working Paper Number:

    CES-12-16

    International technological diffusion is a key determinant of cross-country differences in economic performance. While patents can be a useful proxy for innovation and technological change and diffusion, fully exploiting patent data for such economic analyses requires patents to be tied to measures of economic activity. In this paper, we describe and explore a new algorithmic approach to constructing concordances between the International Patent Classification (IPC) system that organizes patents by technical features and industry classification systems that organize economic data, such as the Standard International Trade Classification (SITC), the International Standard Industrial Classification (ISIC) and the Harmonized System (HS). This 'Algorithmic Links with Probabilities' (ALP) approach incorporates text analysis software and keyword extraction programs and applies them to a comprehensive patent dataset. We compare the results of several ALP concordances to existing technology concordances. Based on these comparisons, we select a preferred ALP approach and discuss advantages of this approach relative to conventional approaches. We conclude with a discussion on some of the possible applications of the concordance and provide a sample analysis that uses our preferred ALP concordance to analyze international patent flows based on trade patterns.
    View Full Paper PDF
  • Working Paper

    Concording U.S. Harmonized System Categories Over Time

    May 2009

    Working Paper Number:

    CES-09-11

    This paper: outlines an algorithm for concording U.S. ten-digit Harmonized System export and import codes over time; describes the concordances we construct for 1989 to 2004; and provides Stata code that can be used to construct similar concordances for arbitrary beginning and ending years from 1989 to 2007.
    View Full Paper PDF
  • Working Paper

    Testing the Advantages of Using Product Level Data to Create Linkages Across Industrial Coding Systems

    October 1993

    Authors: Suzanne Peck

    Working Paper Number:

    CES-93-14

    After the major revision of the U.S. Standard Industrial Classification system (SIC) in the 1987, the problem arose of how to evaluate industrial performance over time. The revision resulted in the creation of new industries, the combination of old industries, and the remixing of other industries to better reflect the present U.S. economy. A method had to be developed to make the old and new sets of industries comparable over time. Ryten (1991) argues for performing the conversion at the "most micro level," the product level. Linking industries should be accomplished by reclassifying product data of each establishment to a standard system, reassigning the primary activity of the establishment, reaggregating the data to the industry level, and then making the desired statistical comparison (Ryten, 1991). This paper discusses linking the data at the very micro, product level, and at the more macro, industry level. The results suggest that with complete product information the product level conversion is preferable for most industries in manufacturing because it recognizes that establishments may switch their primary industry because of the conversion. For some industries, especially those having no substantial changes in SIC codes over time, the conversion at the industry level is fairly accurate. A small group of industries lacks complete product information in 1982 to link the 1982 product codes to the 1987 codes. This results in having to rely on the industry concordance to create a time series of statistics.
    View Full Paper PDF
  • Working Paper

    Manufacturing Establishments Reclassified Into New Industries: The Effect Of Survey Design Rules

    November 1992

    Working Paper Number:

    CES-92-14

    Establishment reclassification occurs when an establishment classified in one industry in one year is reclassified into another industry in another year. Because of survey design rules at the Census Bureau these reclassifications occur systematically over time, and affect the industry-level time series of output and employment. The evidence shows that reclassified establishments occur most often in two distinct years over the life of a sample panel. Switches are not only numerous in these years, they also contribute significantly to measured industry change in industry output and employment. The problem is that reclassifications are not necessarily processed in the year that they occur. The survey rules restrict most change to certain years. The effect of these rules is evidenced by looking at the variance across industry growth rates which increases greatly in these two years. Whatever the reason for reclassifying an establishment, the way the switches are processed raises the possibility of measurement errors in the industry level statistics. Researchers and policymakers relying upon observations in annual changes in industry statistics should be aware of these systematic discontinuities, discrepancies and potential data distortions.
    View Full Paper PDF
  • Working Paper

    Multiple Classification Systems For Economic Data: Can A Thousand Flowers Bloom? And Should They?

    December 1991

    Working Paper Number:

    CES-91-08

    The principle that the statistical system should provide flexibility-- possibilities for generating multiple groupings of data to satisfy multiple objectives--if it is to satisfy users is universally accepted. Yet in practice, this goal has not been achieved. This paper discusses the feasibility of providing flexibility in the statistical system to accommodate multiple uses of the industrial data now primarily examined within the Standard Industrial Classification (SIC) system. In one sense, the question of feasibility is almost trivial. With today's computer technology, vast amounts of data can be manipulated and stored at very low cost. Reconfigurations of the basic data are very inexpensive compared to the cost of collecting the data. Flexibility in the statistical system implies more than the technical ability to regroup data. It requires that the basic data are sufficiently detailed to support user needs and are processed and maintained in a fashion that makes the use of a variety of aggregation rules possible. For this to happen, statistical agencies must recognize the need for high quality microdata and build this into their planning processes. Agencies need to view their missions from a multiple use perspective and move away from use of a primary reporting and collection vehicle. Although the categories used to report data must be flexible, practical considerations dictate that data collection proceed within a fixed classification system. It is simply too expensive for both respondents and statistical agencies to process survey responses in the absence of standardized forms, data entry programs, etc. I argue for a basic classification centered on commodities--products, services, raw materials and labor inputs--as the focus of data collection. The idea is to make the principle variables of interest--the commodities--the vehicle for the collection and processing of the data. For completeness, the basic classification should include labor usage through some form of occupational classification. In most economic surveys at the Census Bureau, the reporting unit and the classified unit have been the establishment. But there is no need for this to be so. The basic principle to be followed in data collection is that the data should be collected in the most efficient way--efficiency being defined jointly in terms of statistical agency collection costs and respondent burdens.
    View Full Paper PDF
  • Working Paper

    The Classification of Manufacturing Industries: an Input-Based Clustering of Activity

    August 1990

    Working Paper Number:

    CES-90-07

    The classification and aggregation of manufacturing data is vital for the analysis and reporting of economic activity. Most organizations and researchers use the Standard Industrial Classification (SIC) system for this purpose. This is, however, not the only option. Our paper examines an alternative classification based on clustering activity using production technologies. While this approach yields results which are similar to the SIC, there are important differences between the two classifications in terms of the specific industrial categories and the amount of information lost through aggregation.
    View Full Paper PDF