We present LFCMV, a Bayesian file linking methodology designed to link records using continuous matching variables in situations where we do not expect values of these matching variables to agree exactly across matched pairs. The method involves a linking model for the distance between the matching variables of records in one file and the matching variables of their linked records in the second. This linking model is conditional on a vector indicating the links. We specify a mixture model for the distance component of the linking model, as this latent structure allows the distance between matching variables in linked pairs to vary across types of linked pairs. Finally, we specify a model for the linking vector. We describe the Gibbs sampling algorithm for sampling from the posterior distribution of this linkage model and use artificial data to illustrate model performance. We also introduce a linking application using public survey information and data from the U.S. Census of Manufactures and use
LFCMV to link the records.
-
USING IMPUTATION TECHNIQUES TO EVALUATE STOPPING RULES IN ADAPTIVE SURVEY DESIGN
October 2014
Working Paper Number:
CES-14-40
Adaptive Design methods for social surveys utilize the information from the data as it is collected to make decisions about the sampling design. In some cases, the decision is either to continue or stop the data collection. We evaluate this decision by proposing measures to compare the collected data with follow-up samples. The options are assessed by imputation of the nonrespondents under different missingness scenarios, including Missing Not at Random. The variation in the utility measures is compared to the cost induced by the follow-up sample sizes. We apply the proposed method to the 2007 U.S. Census of Manufacturers.
View Full
Paper PDF
-
Exploring New Ways to Classify Industries for Energy Analysis and Modeling
November 2022
Working Paper Number:
CES-22-49
Combustion, other emitting processes and fossil energy use outside the power sector have become urgent concerns given the United States' commitment to achieving net-zero greenhouse gas emissions by 2050. Industry is an important end user of energy and relies on fossil fuels used directly for process heating and as feedstocks for a diverse range of applications. Fuel and energy use by industry is heterogeneous, meaning even a single product group can vary broadly in its production routes and associated energy use. In the United States, the North American Industry Classification System (NAICS) serves as the standard for statistical data collection and reporting. In turn, data based on NAICS are the foundation of most United States energy modeling. Thus, the effectiveness of NAICS at representing energy use is a limiting condition for current
expansive planning to improve energy efficiency and alternatives to fossil fuels in industry. Facility-level data could be used to build more detail into heterogeneous sectors and thus supplement data from Bureau of the Census and U.S Energy Information Administration reporting at NAICS code levels but are scarce. This work explores alternative classification schemes for industry based on energy use characteristics and validates an approach to estimate facility-level energy use from publicly available greenhouse gas emissions data from the U.S. Environmental Protection Agency (EPA). The approaches in this study can facilitate understanding of current, as well as possible future, energy demand.
First, current approaches to the construction of industrial taxonomies are summarized along with their usefulness for industrial energy modeling. Unsupervised machine learning techniques are then used to detect clusters in data reported from the U.S. Department of Energy's Industrial Assessment Center program. Clusters of Industrial Assessment Center data show similar levels of correlation between energy use and explanatory variables as three-digit NAICS codes. Interestingly, the clusters each include a large cross section of NAICS codes, which lends additional support to the idea that NAICS may not be particularly suited for correlation between energy use and the variables studied. Fewer clusters are needed for the same level of correlation as shown in NAICS codes. Initial assessment shows a reasonable level of separation using support vector machines with higher than 80% accuracy, so machine learning approaches may be promising for further analysis. The IAC data is focused on smaller and medium-sized facilities and is biased toward higher energy users for a given facility type. Cladistics, an approach for classification developed in biology, is adapted to energy and process characteristics of industries. Cladistics applied to industrial systems seeks to understand the progression of organizations and technology as a type of evolution, wherein traits are inherited from previous systems but evolve due to the emergence of inventions and variations and a selection process driven by adaptation to pressures and favorable outcomes. A cladogram is presented for evolutionary directions in the iron and steel sector. Cladograms are a promising tool for constructing scenarios and summarizing directions of sectoral innovation.
The cladogram of iron and steel is based on the drivers of energy use in the sector. Phylogenetic inference is similar to machine learning approaches as it is based on a machine-led search of the solution space, therefore avoiding some of the subjectivity of other classification systems. Our prototype approach for constructing an industry cladogram is based on process characteristics according to the innovation framework derived from Schumpeter to capture evolution in a given sector. The resulting cladogram represents a snapshot in time based on detailed study of process characteristics. This work could be an important tool for the design of scenarios for more detailed modeling. Cladograms reveal groupings of emerging or dominant processes and their implications in a way that may be helpful for policymakers and entrepreneurs, allowing them to see the larger picture, other good ideas, or competitors. Constructing a cladogram could be a good first step to analysis of many industries (e.g. nitrogenous fertilizer production, ethyl alcohol manufacturing), to understand their heterogeneity, emerging trends, and coherent groupings of related innovations.
Finally, validation is performed for facility-level energy estimates from the EPA Greenhouse Gas Reporting Program. Facility-level data availability continues to be a major challenge for industrial modeling. The method outlined by (McMillan et al. 2016; McMillan and Ruth 2019) allows estimating of facility level energy use based on mandatory greenhouse gas reporting. The validation provided here is an important step for further use of this data for industrial energy modeling.
View Full
Paper PDF
-
Squeezing More Out of Your Data: Business Record Linkage with Python
November 2018
Working Paper Number:
CES-18-46
Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a
single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.
View Full
Paper PDF
-
Plant-Level Productivity and Imputation of Missing Data in the Census of Manufactures
January 2011
Working Paper Number:
CES-11-02
In the U.S. Census of Manufactures, the Census Bureau imputes missing values using a combination of mean imputation, ratio imputation, and conditional mean imputation. It is wellknown that imputations based on these methods can result in underestimation of variability and potential bias in multivariate inferences. We show that this appears to be the case for the existing imputations in the Census of Manufactures. We then present an alternative strategy for handling the missing data based on multiple imputation. Specifically, we impute missing values via sequences of classification and regression trees, which offer a computationally straightforward and flexible approach for semi-automatic, large-scale multiple imputation. We also present an approach to evaluating these imputations based on posterior predictive checks. We use the multiple imputations, and the imputations currently employed by the Census Bureau, to estimate production function parameters and productivity dispersions. The results suggest that the two approaches provide quite different answers about productivity.
View Full
Paper PDF
-
Empirical Distribution of the Plant-Level Components of Energy and Carbon Intensity at the Six-digit NAICS Level Using a Modified KAYA Identity
September 2024
Working Paper Number:
CES-24-46
Three basic pillars of industry-level decarbonization are energy efficiency, decarbonization of energy sources, and electrification. This paper provides estimates of a decomposition of these three components of carbon emissions by industry: energy intensity, carbon intensity of energy, and energy (fuel) mix. These estimates are constructed at the six-digit NAICS level from non-public, plant-level data collected by the Census Bureau. Four quintiles of the distribution of each of the three components are constructed, using multiple imputation (MI) to deal with non-reported energy variables in the Census data. MI allows the estimates to avoid non-reporting bias. MI also allows more six-digit NAICS to be estimated under Census non-disclosure rules, since dropping non-reported observations may have reduced the sample sizes unnecessarily. The estimates show wide variation in each of these three components of emissions (intensity) and provide a first empirical look into the plant-level variation that underlies carbon emissions.
View Full
Paper PDF
-
Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database
February 2011
Working Paper Number:
CES-11-04
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments\' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.
View Full
Paper PDF
-
A Concordance Between Ten-Digit U.S. Harmonized System Codes and SIC/NAICS Product Classes and Industries
November 2009
Working Paper Number:
CES-09-41
This paper provides and describes concordances between the ten-digit Harmonized System (HS) categories used to classify products in U.S. international trade and the four-digit SIC and six-digit NAICS industries that cover the years 1989 to 2006. We also provide concordances between ten- digit HS codes and the five-digit SIC and seven-digit NAICS product classes used to classify U.S. manufacturing production. Finally, we briefly describe how these concordances might be applied in current empirical international trade research.
View Full
Paper PDF
-
Simultaneous Edit-Imputation for Continuous Microdata
December 2015
Working Paper Number:
CES-15-44
Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation. To date, most approaches separate the error localization and imputation steps, typically using optimization methods to identify the variables to change followed by hot deck imputation. We present an approach that fully integrates editing and imputation for continuous microdata under linear constraints. Our approach relies on a Bayesian hierarchical model that includes (i) a flexible joint probability model for the underlying true values of the data with support only on the set of values that satisfy all editing constraints, (ii) a model for latent indicators of the variables that are in error, and (iii) a model for the reported responses for variables in error. We illustrate the potential advantages of the Bayesian editing approach over existing approaches using simulation studies. We apply the model to edit faulty data from the 2007 U.S. Census of Manufactures. Supplementary materials for this article are available online.
View Full
Paper PDF
-
IMPROVING THE SYNTHETIC LONGITUDINAL BUSINESS DATABASE
February 2014
Working Paper Number:
CES-14-12
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments' confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version'now available for public use'of the U. S. Census Bureau's Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.
View Full
Paper PDF
-
The Energy Efficiency Gap and Energy Price Responsiveness in Food Processing
June 2020
Working Paper Number:
CES-20-18
This paper estimates stochastic frontier energy demand functions with non-public, plant-level data from the U.S. Census Bureau to measure the energy efficiency gap and energy price elasticities in the food processing industry. The estimates are for electricity and fuel use in 4 food processing sectors, based on the disaggregation of this industry used by the National Energy Modeling System Industrial Demand Module. The estimated demand functions control for plant inputs and output, energy prices, and other observables including 6-digit NAICS industry designations. Own price elasticities range from 0.6 to -0.9 with little evidence of fuel/electricity substitution. The magnitude of the efficiency estimates is sensitive to the assumptions but consistently reveal that few plants achieve 100% efficiency. Defining a 'practical level of energy efficiency' as the 95th percentile of the efficiency distributions and averaging across all the models result in a ~20% efficiency gap. However, most of the potential reductions in energy use from closing this efficiency gap are from plants that are 'low hanging fruit'; 13% of the 20% potential reduction in the efficiency gap can be obtained by bringing the lower half of the efficiency distribution up to just the median level of observed performance. New plants do exhibit higher energy efficiency than existing plants which is statistically significant, but the difference is small for most of the industry; ranging from a low of 0.4% to a high of 5.7%.
View Full
Paper PDF