In the U.S. Census of Manufactures, the Census Bureau imputes missing values using a combination of mean imputation, ratio imputation, and conditional mean imputation. It is wellknown that imputations based on these methods can result in underestimation of variability and potential bias in multivariate inferences. We show that this appears to be the case for the existing imputations in the Census of Manufactures. We then present an alternative strategy for handling the missing data based on multiple imputation. Specifically, we impute missing values via sequences of classification and regression trees, which offer a computationally straightforward and flexible approach for semi-automatic, large-scale multiple imputation. We also present an approach to evaluating these imputations based on posterior predictive checks. We use the multiple imputations, and the imputations currently employed by the Census Bureau, to estimate production function parameters and productivity dispersions. The results suggest that the two approaches provide quite different answers about productivity.
-
USING IMPUTATION TECHNIQUES TO EVALUATE STOPPING RULES IN ADAPTIVE SURVEY DESIGN
October 2014
Working Paper Number:
CES-14-40
Adaptive Design methods for social surveys utilize the information from the data as it is collected to make decisions about the sampling design. In some cases, the decision is either to continue or stop the data collection. We evaluate this decision by proposing measures to compare the collected data with follow-up samples. The options are assessed by imputation of the nonrespondents under different missingness scenarios, including Missing Not at Random. The variation in the utility measures is compared to the cost induced by the follow-up sample sizes. We apply the proposed method to the 2007 U.S. Census of Manufacturers.
View Full
Paper PDF
-
RECOVERING THE ITEM-LEVEL EDIT AND IMPUTATION FLAGS IN THE 1977-1997 CENSUSES OF MANUFACTURES
September 2014
Working Paper Number:
CES-14-37
As part of processing the Census of Manufactures, the Census Bureau edits some data items and imputes for missing data and some data that is deemed erroneous. Until recently it was difficult for researchers using the plant-level microdata to determine which data items were changed or imputed during the editing and imputation process, because the edit/imputation processing flags were not available to researchers. This paper describes the process of reconstructing the edit/imputation flags for variables in the 1977, 1982, 1987, 1992, and 1997 Censuses of Manufactures using recently recovered Census Bureau files. Thepaper also reports summary statistics for the percentage of cases that are imputed for key variables. Excluding plants with fewer than 5 employees, imputation rates for several key variables range from 8% to 54% for the manufacturing sector as a whole, and from 1% to 72% at the 2-digit SIC industry level.
View Full
Paper PDF
-
Simultaneous Edit-Imputation for Continuous Microdata
December 2015
Working Paper Number:
CES-15-44
Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation. To date, most approaches separate the error localization and imputation steps, typically using optimization methods to identify the variables to change followed by hot deck imputation. We present an approach that fully integrates editing and imputation for continuous microdata under linear constraints. Our approach relies on a Bayesian hierarchical model that includes (i) a flexible joint probability model for the underlying true values of the data with support only on the set of values that satisfy all editing constraints, (ii) a model for latent indicators of the variables that are in error, and (iii) a model for the reported responses for variables in error. We illustrate the potential advantages of the Bayesian editing approach over existing approaches using simulation studies. We apply the model to edit faulty data from the 2007 U.S. Census of Manufactures. Supplementary materials for this article are available online.
View Full
Paper PDF
-
Regulating Mismeasured Pollution: Implications of Firm Heterogeneity for Environmental Policy
August 2018
Working Paper Number:
CES-18-03R
This paper provides the first estimates of within-industry heterogeneity in energy and CO2 productivity for the entire U.S. manufacturing sector. We measure energy and CO2 productivity as output per dollar energy input or per ton CO2 emitted. Three findings emerge. First, within narrowly defined industries, heterogeneity in energy and CO2 productivity across plants is enormous. Second, heterogeneity in energy and CO2 productivity exceeds heterogeneity in most other productivity measures, like labor or total factor productivity. Third, heterogeneity in energy and CO2 productivity has important implications for environmental policies targeting industries rather than plants, including technology standards and carbon border adjustments.
View Full
Paper PDF
-
Materials Prices and Productivity
June 2012
Working Paper Number:
CES-12-11
There is substantial within-industry variation, even within industries that use and produce homogeneous inputs and outputs, in the prices that plants pay for their material inputs. I explore, using plant-level data from the U.S. Census Bureau, the consequences and sources of this variation in materials prices. For a sample of industries with relatively homogeneous products, the standard deviation of plant-level productivities would be 7% lower if all plants faced the same materials prices. Moreover, plant-level materials prices are both persistent across time and predictive of exit. The contribution of net entry to aggregate productivity growth is smaller for productivity measures that strip out di'erences in materials prices. After documenting these patterns, I discuss three potential sources of materials price variation: geography, di'erences in suppliers. marginal costs, and suppliers. price discriminatory behavior. Together, these variables account for 13% of the dispersion of materials prices. Finally, I demonstrate that plants.marginal costs are correlated with the marginal costs of their intermediate input suppliers.
View Full
Paper PDF
-
File Matching with Faulty Continuous Matching Variables
January 2017
Working Paper Number:
CES-17-45
We present LFCMV, a Bayesian file linking methodology designed to link records using continuous matching variables in situations where we do not expect values of these matching variables to agree exactly across matched pairs. The method involves a linking model for the distance between the matching variables of records in one file and the matching variables of their linked records in the second. This linking model is conditional on a vector indicating the links. We specify a mixture model for the distance component of the linking model, as this latent structure allows the distance between matching variables in linked pairs to vary across types of linked pairs. Finally, we specify a model for the linking vector. We describe the Gibbs sampling algorithm for sampling from the posterior distribution of this linkage model and use artificial data to illustrate model performance. We also introduce a linking application using public survey information and data from the U.S. Census of Manufactures and use
LFCMV to link the records.
View Full
Paper PDF
-
The Impact of Plant-Level Resource Reallocations and Technical Progress on U.S. Macroeconomic Growth
December 2009
Working Paper Number:
CES-09-43
We build up from the plant level an "aggregate(d) Solow residual" by estimating every U.S. manufacturing plant's contribution to the change in aggregate final demand between 1976 and 1996. We decompose these contributions into plant-level resource reallocations and plant-level technical efficiency changes. We allow for 459 different production technologies, one for each 4- digit SIC code. Our framework uses the Petrin and Levinsohn (2008) definition of aggregate productivity growth, which aggregates plant-level changes to changes in aggregate final demand in the presence of imperfect competition and other distortions and frictions. On average, we find that aggregate reallocation made a larger contribution than aggregate technical efficiency growth. Our estimates of the contribution of reallocation range from 1:7% to2:1% per year, while our estimates of the average contribution of aggregate technical efficiency growth range from 0:2% to 0:6% per year. In terms of cyclicality, the aggregate technical efficiency component has a standard deviation that is roughly 50% to 100% larger than that of aggregate total reallocation, pointing to an important role for technical efficiency in macroeconomic fluctuations. Aggregate reallocation is negative in only 3 of the 20 years of our sample, suggesting that the movement of inputs to more highly valued activities on average plays a stabilizing role in manufacturing growth.
View Full
Paper PDF
-
Distribution Preserving Statistical Disclosure Limitation
September 2006
Working Paper Number:
tp-2006-04
One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed,
partially synthetic data sets. These are data on actual respondents, but with confidential data
replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate
inferences because the distribution of synthetic data is completely determined by the model used
to generate them. We present two practical methods of generating synthetic values when the imputer
has only limited information about the true data generating process. One is applicable when
the true likelihood is known up to a monotone transformation. The second requires only limited
knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential
data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility
and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and
sampling error in the estimated transformation. We validate the approach with a simulation and
application to a large linked employer-employee database.
View Full
Paper PDF
-
R&D, Attrition and Multiple Imputation in BRDIS
January 2017
Working Paper Number:
CES-17-13
Multiple imputation in business establishment surveys like BRDIS, an annual business survey in which some companies are sampled every year or multiple years, may enhance the estimates of total R&D in addition to helping researchers estimate models with subpopulations of small sample size. Considering a panel of BRDIS companies throughout the years 2008 to 2013 linked to LBD data, this paper uses the conclusions obtained with missing data visualization and other explorations to come up with a strategy to conduct multiple imputation appropriate to address the item nonresponse in R&D expenditures. Because survey design characteristics are behind much of the item and unit nonresponse, multiple imputation of missing data in BRDIS changes the estimates of total R&D significantly and alters the conclusions reached by models of the determinants of R&D investment obtained with complete case analysis.
View Full
Paper PDF
-
Productivity Races I: Are Some Productivuty Measures Better Than Others?
January 1997
Working Paper Number:
CES-97-02
In this study we construct twelve different measures of productivity at the plant level and test which measures of productivity are most closely associated with direct measures of economic performance. We first examine how closely correlated these measures are with various measures of profits. We then evaluate the extent to which each productivity measure is associated with lower rates of plant closure and faster plant growth (growth in employment, output, and capital). All measures of productivity considered are credible in the sense that highly productive plants, regardless of measure, are clearly more profitable, less likely to close, and grow faster. Nevertheless, labor productivity and measures of total factor productivity that are based on regression estimates of production functions are better predictors of plant growth and survival than factor share-based measures of total factor productivity (TFP). Measures of productivity that are based on several years of data appear to outperform measures of productivity that are based solely on data from the most recent year.
View Full
Paper PDF