CREAT - Census Bureau

File Matching with Faulty Continuous Matching Variables

January 2017

Written by: Gale Boyd, Jerome P. Reiter, Nicole M. Dalzell

Working Paper Number:

CES-17-45

Abstract

We present LFCMV, a Bayesian file linking methodology designed to link records using continuous matching variables in situations where we do not expect values of these matching variables to agree exactly across matched pairs. The method involves a linking model for the distance between the matching variables of records in one file and the matching variables of their linked records in the second. This linking model is conditional on a vector indicating the links. We specify a mixture model for the distance component of the linking model, as this latent structure allows the distance between matching variables in linked pairs to vary across types of linked pairs. Finally, we specify a model for the linking vector. We describe the Gibbs sampling algorithm for sampling from the posterior distribution of this linkage model and use artificial data to illustrate model performance. We also introduce a linking application using public survey information and data from the U.S. Census of Manufactures and use LFCMV to link the records.

Document Tags and Keywords

Keywords:

data, statistical, model, record, matched, matching, associate, sampling, sample, inference, linkage

Tags:

Census of Manufactures, National Science Foundation, Department of Energy, Census of Manufacturing Firms, North American Industry Classification System

Similar Working Papers

The 10 most similar working papers to the working paper 'File Matching with Faulty Continuous Matching Variables' are listed below in order of similarity.

Working Paper

USING IMPUTATION TECHNIQUES TO EVALUATE STOPPING RULES IN ADAPTIVE SURVEY DESIGN

October 2014

Authors: Thais Paiva, Jerry Reiter

Working Paper Number:

CES-14-40

Adaptive Design methods for social surveys utilize the information from the data as it is collected to make decisions about the sampling design. In some cases, the decision is either to continue or stop the data collection. We evaluate this decision by proposing measures to compare the collected data with follow-up samples. The options are assessed by imputation of the nonrespondents under different missingness scenarios, including Missing Not at Random. The variation in the utility measures is compared to the cost induced by the follow-up sample sizes. We apply the proposed method to the 2007 U.S. Census of Manufacturers.
View Full Paper PDF
Working Paper

Squeezing More Out of Your Data: Business Record Linkage with Python

November 2018

Authors: Nathan Goldschlag, John Cuffe

Working Paper Number:

CES-18-46

Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.
View Full Paper PDF
Working Paper

Distribution Preserving Statistical Disclosure Limitation

September 2006

Authors: Simon Woodcock, Gary Benedetto

Working Paper Number:

tp-2006-04

One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with confidential data replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate inferences because the distribution of synthetic data is completely determined by the model used to generate them. We present two practical methods of generating synthetic values when the imputer has only limited information about the true data generating process. One is applicable when the true likelihood is known up to a monotone transformation. The second requires only limited knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and sampling error in the estimated transformation. We validate the approach with a simulation and application to a large linked employer-employee database.
View Full Paper PDF
Working Paper

Plant-Level Productivity and Imputation of Missing Data in the Census of Manufactures

January 2011

Authors: T. Kirk White, Amil Petrin, Jerome P. Reiter

Working Paper Number:

CES-11-02

In the U.S. Census of Manufactures, the Census Bureau imputes missing values using a combination of mean imputation, ratio imputation, and conditional mean imputation. It is wellknown that imputations based on these methods can result in underestimation of variability and potential bias in multivariate inferences. We show that this appears to be the case for the existing imputations in the Census of Manufactures. We then present an alternative strategy for handling the missing data based on multiple imputation. Specifically, we impute missing values via sequences of classification and regression trees, which offer a computationally straightforward and flexible approach for semi-automatic, large-scale multiple imputation. We also present an approach to evaluating these imputations based on posterior predictive checks. We use the multiple imputations, and the imputations currently employed by the Census Bureau, to estimate production function parameters and productivity dispersions. The results suggest that the two approaches provide quite different answers about productivity.
View Full Paper PDF
Working Paper

Empirical Distribution of the Plant-Level Components of Energy and Carbon Intensity at the Six-digit NAICS Level Using a Modified KAYA Identity

September 2024

Authors: Gale Boyd, Matthew Doolin, Yu Ma

Working Paper Number:

CES-24-46

Three basic pillars of industry-level decarbonization are energy efficiency, decarbonization of energy sources, and electrification. This paper provides estimates of a decomposition of these three components of carbon emissions by industry: energy intensity, carbon intensity of energy, and energy (fuel) mix. These estimates are constructed at the six-digit NAICS level from non-public, plant-level data collected by the Census Bureau. Four quintiles of the distribution of each of the three components are constructed, using multiple imputation (MI) to deal with non-reported energy variables in the Census data. MI allows the estimates to avoid non-reporting bias. MI also allows more six-digit NAICS to be estimated under Census non-disclosure rules, since dropping non-reported observations may have reduced the sample sizes unnecessarily. The estimates show wide variation in each of these three components of emissions (intensity) and provide a first empirical look into the plant-level variation that underlies carbon emissions.
View Full Paper PDF
Working Paper

Simultaneous Edit-Imputation for Continuous Microdata

December 2015

Authors: Jerome P. Reiter, Hang J. Kim, Lawrence H. Cox, Alan F. Karr, Quanli Wang

Working Paper Number:

CES-15-44

Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation. To date, most approaches separate the error localization and imputation steps, typically using optimization methods to identify the variables to change followed by hot deck imputation. We present an approach that fully integrates editing and imputation for continuous microdata under linear constraints. Our approach relies on a Bayesian hierarchical model that includes (i) a flexible joint probability model for the underlying true values of the data with support only on the set of values that satisfy all editing constraints, (ii) a model for latent indicators of the variables that are in error, and (iii) a model for the reported responses for variables in error. We illustrate the potential advantages of the Bayesian editing approach over existing approaches using simulation studies. We apply the model to edit faulty data from the 2007 U.S. Census of Manufactures. Supplementary materials for this article are available online.
View Full Paper PDF
Working Paper

Exploring New Ways to Classify Industries for Energy Analysis and Modeling

November 2022

Authors: Gale Boyd, Matthew Doolin, Liz Wachs, Colin McMillan

Working Paper Number:

CES-22-49

Combustion, other emitting processes and fossil energy use outside the power sector have become urgent concerns given the United States' commitment to achieving net-zero greenhouse gas emissions by 2050. Industry is an important end user of energy and relies on fossil fuels used directly for process heating and as feedstocks for a diverse range of applications. Fuel and energy use by industry is heterogeneous, meaning even a single product group can vary broadly in its production routes and associated energy use. In the United States, the North American Industry Classification System (NAICS) serves as the standard for statistical data collection and reporting. In turn, data based on NAICS are the foundation of most United States energy modeling. Thus, the effectiveness of NAICS at representing energy use is a limiting condition for current expansive planning to improve energy efficiency and alternatives to fossil fuels in industry. Facility-level data could be used to build more detail into heterogeneous sectors and thus supplement data from Bureau of the Census and U.S Energy Information Administration reporting at NAICS code levels but are scarce. This work explores alternative classification schemes for industry based on energy use characteristics and validates an approach to estimate facility-level energy use from publicly available greenhouse gas emissions data from the U.S. Environmental Protection Agency (EPA). The approaches in this study can facilitate understanding of current, as well as possible future, energy demand. First, current approaches to the construction of industrial taxonomies are summarized along with their usefulness for industrial energy modeling. Unsupervised machine learning techniques are then used to detect clusters in data reported from the U.S. Department of Energy's Industrial Assessment Center program. Clusters of Industrial Assessment Center data show similar levels of correlation between energy use and explanatory variables as three-digit NAICS codes. Interestingly, the clusters each include a large cross section of NAICS codes, which lends additional support to the idea that NAICS may not be particularly suited for correlation between energy use and the variables studied. Fewer clusters are needed for the same level of correlation as shown in NAICS codes. Initial assessment shows a reasonable level of separation using support vector machines with higher than 80% accuracy, so machine learning approaches may be promising for further analysis. The IAC data is focused on smaller and medium-sized facilities and is biased toward higher energy users for a given facility type. Cladistics, an approach for classification developed in biology, is adapted to energy and process characteristics of industries. Cladistics applied to industrial systems seeks to understand the progression of organizations and technology as a type of evolution, wherein traits are inherited from previous systems but evolve due to the emergence of inventions and variations and a selection process driven by adaptation to pressures and favorable outcomes. A cladogram is presented for evolutionary directions in the iron and steel sector. Cladograms are a promising tool for constructing scenarios and summarizing directions of sectoral innovation. The cladogram of iron and steel is based on the drivers of energy use in the sector. Phylogenetic inference is similar to machine learning approaches as it is based on a machine-led search of the solution space, therefore avoiding some of the subjectivity of other classification systems. Our prototype approach for constructing an industry cladogram is based on process characteristics according to the innovation framework derived from Schumpeter to capture evolution in a given sector. The resulting cladogram represents a snapshot in time based on detailed study of process characteristics. This work could be an important tool for the design of scenarios for more detailed modeling. Cladograms reveal groupings of emerging or dominant processes and their implications in a way that may be helpful for policymakers and entrepreneurs, allowing them to see the larger picture, other good ideas, or competitors. Constructing a cladogram could be a good first step to analysis of many industries (e.g. nitrogenous fertilizer production, ethyl alcohol manufacturing), to understand their heterogeneity, emerging trends, and coherent groupings of related innovations. Finally, validation is performed for facility-level energy estimates from the EPA Greenhouse Gas Reporting Program. Facility-level data availability continues to be a major challenge for industrial modeling. The method outlined by (McMillan et al. 2016; McMillan and Ruth 2019) allows estimating of facility level energy use based on mandatory greenhouse gas reporting. The validation provided here is an important step for further use of this data for industrial energy modeling.
View Full Paper PDF
Working Paper

Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database

February 2011

Authors: Arnold P Reznek, Ron Jarmin, Javier Miranda, John M. Abowd, Jerome P. Reiter, Satkartar K. Kinney

Working Paper Number:

CES-11-04

In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments\' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.
View Full Paper PDF
Working Paper

Estimating the Distribution of Plant-Level Manufacturing Energy Efficiency with Stochastic Frontier Regression

March 2007

Authors: Gale Boyd

Working Paper Number:

CES-07-07

A feature commonly used to distinguish between parametric/statistical models and engineering models is that engineering models explicitly represent best practice technologies while the parametric/statistical models are typically based on average practice. Measures of energy intensity based on average practice are less useful in the corporate management of energy or for public policy goal setting. In the context of company or plant level energy management, it is more useful to have a measure of energy intensity capable of representing where a company or plant lies within a distribution of performance. In other words, is the performance close (or far) from the industry best practice? This paper presents a parametric/statistical approach that can be used to measure best practice, thereby providing a measure of the difference, or 'efficiency gap' at a plant, company or overall industry level. The approach requires plant level data and applies a stochastic frontier regression analysis to energy use. Stochastic frontier regression analysis separates the energy intensity into three components, systematic effects, inefficiency, and statistical (random) error. The stochastic frontier can be viewed as a sub-vector input distance function. One advantage of this approach is that physical product mix can be included in the distance function, avoiding the problem of aggregating output to define a single energy/output ratio to measure energy intensity. The paper outlines the methods and gives an example of the analysis conducted for a non-public micro-dataset of wet corn refining plants.
View Full Paper PDF
Working Paper

Cogeneration Technology Adoption in the U.S.

January 2016

Authors: Mary Jialin Li

Working Paper Number:

CES-16-30

Well over half of all electricity generated in recent years in Denmark is through cogeneration. In U.S., however, this number is only roughly eight percent. While both the federal and state governments provided regulatory incentives for more cogeneration adoption, the capacity added in the past five years have been the lowest since late 1970s. My goal is to first understand what are and their relative importance of the factors that drive cogeneration technology adoption, with an emphasis on estimating the elasticity of adoption with respect to relative energy input prices and regulatory factors. Very preliminary results show that with a 1 cent increase in purchased electricity price from 6 cents (roughly current average) to 7 cents per kwh, the likelihood of cogeneration technology adoption goes up by about 0.7-1 percent. Then I will try to address the general equilibrium effect of cogeneration adoption in the electricity generation sector as a whole and potentially estimate some key parameters that the social planner would need to determine the optimal cogeneration investment amount. Partial equilibrium setting does not consider the decrease in investment in the utilities sector when facing competition from the distributed electricity generators, and therefore ignore the effects from the change in equilibrium price of electricity. The competitive market equilibrium setting does not consider the externality in the reduction of CO2 emissions, and leads to socially sub-optimal investment in cogeneration. If we were to achieve the national goal to increase cogeneration capacity half of the current capacity by 2020, the US Department of Energy (DOE) estimated an annual reduction of 150 million metric tons of CO2 annually ' equivalent to the emissions from over 25 million cars. This is about five times the annual carbon reduction from deregulation and consolidation in the US nuclear power industry (Davis, Wolfram 2012). Although the DOE estimates could be an overly optimistic estimate, it nonetheless suggests the large potential in the adoption of cogeneration technology.
View Full Paper PDF

File Matching with Faulty Continuous Matching Variables

January 2017

Working Paper Number:

CES-17-45

Abstract

Document Tags and Keywords

The 10 most similar working papers to the working paper 'File Matching with Faulty Continuous Matching Variables' are listed below in order of similarity.

October 2014

Working Paper Number:

CES-14-40

November 2018

Working Paper Number:

CES-18-46

September 2006

Working Paper Number:

tp-2006-04

January 2011

Working Paper Number:

CES-11-02

September 2024

Working Paper Number:

CES-24-46

December 2015

Working Paper Number:

CES-15-44

November 2022

Working Paper Number:

CES-22-49

February 2011

Working Paper Number:

CES-11-04

March 2007

Working Paper Number:

CES-07-07

January 2016

Working Paper Number:

CES-16-30