In the U.S. Census of Manufactures, the Census Bureau imputes missing values using a combination of mean imputation, ratio imputation, and conditional mean imputation. It is wellknown that imputations based on these methods can result in underestimation of variability and potential bias in multivariate inferences. We show that this appears to be the case for the existing imputations in the Census of Manufactures. We then present an alternative strategy for handling the missing data based on multiple imputation. Specifically, we impute missing values via sequences of classification and regression trees, which offer a computationally straightforward and flexible approach for semi-automatic, large-scale multiple imputation. We also present an approach to evaluating these imputations based on posterior predictive checks. We use the multiple imputations, and the imputations currently employed by the Census Bureau, to estimate production function parameters and productivity dispersions. The results suggest that the two approaches provide quite different answers about productivity.
-
RECOVERING THE ITEM-LEVEL EDIT AND IMPUTATION FLAGS IN THE 1977-1997 CENSUSES OF MANUFACTURES
September 2014
Working Paper Number:
CES-14-37
As part of processing the Census of Manufactures, the Census Bureau edits some data items and imputes for missing data and some data that is deemed erroneous. Until recently it was difficult for researchers using the plant-level microdata to determine which data items were changed or imputed during the editing and imputation process, because the edit/imputation processing flags were not available to researchers. This paper describes the process of reconstructing the edit/imputation flags for variables in the 1977, 1982, 1987, 1992, and 1997 Censuses of Manufactures using recently recovered Census Bureau files. Thepaper also reports summary statistics for the percentage of cases that are imputed for key variables. Excluding plants with fewer than 5 employees, imputation rates for several key variables range from 8% to 54% for the manufacturing sector as a whole, and from 1% to 72% at the 2-digit SIC industry level.
View Full
Paper PDF
-
USING IMPUTATION TECHNIQUES TO EVALUATE STOPPING RULES IN ADAPTIVE SURVEY DESIGN
October 2014
Working Paper Number:
CES-14-40
Adaptive Design methods for social surveys utilize the information from the data as it is collected to make decisions about the sampling design. In some cases, the decision is either to continue or stop the data collection. We evaluate this decision by proposing measures to compare the collected data with follow-up samples. The options are assessed by imputation of the nonrespondents under different missingness scenarios, including Missing Not at Random. The variation in the utility measures is compared to the cost induced by the follow-up sample sizes. We apply the proposed method to the 2007 U.S. Census of Manufacturers.
View Full
Paper PDF
-
Simultaneous Edit-Imputation for Continuous Microdata
December 2015
Working Paper Number:
CES-15-44
Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation. To date, most approaches separate the error localization and imputation steps, typically using optimization methods to identify the variables to change followed by hot deck imputation. We present an approach that fully integrates editing and imputation for continuous microdata under linear constraints. Our approach relies on a Bayesian hierarchical model that includes (i) a flexible joint probability model for the underlying true values of the data with support only on the set of values that satisfy all editing constraints, (ii) a model for latent indicators of the variables that are in error, and (iii) a model for the reported responses for variables in error. We illustrate the potential advantages of the Bayesian editing approach over existing approaches using simulation studies. We apply the model to edit faulty data from the 2007 U.S. Census of Manufactures. Supplementary materials for this article are available online.
View Full
Paper PDF
-
Regulating Mismeasured Pollution: Implications of Firm Heterogeneity for Environmental Policy
August 2018
Working Paper Number:
CES-18-03R
This paper provides the first estimates of within-industry heterogeneity in energy and CO2 productivity for the entire U.S. manufacturing sector. We measure energy and CO2 productivity as output per dollar energy input or per ton CO2 emitted. Three findings emerge. First, within narrowly defined industries, heterogeneity in energy and CO2 productivity across plants is enormous. Second, heterogeneity in energy and CO2 productivity exceeds heterogeneity in most other productivity measures, like labor or total factor productivity. Third, heterogeneity in energy and CO2 productivity has important implications for environmental policies targeting industries rather than plants, including technology standards and carbon border adjustments.
View Full
Paper PDF
-
File Matching with Faulty Continuous Matching Variables
January 2017
Working Paper Number:
CES-17-45
We present LFCMV, a Bayesian file linking methodology designed to link records using continuous matching variables in situations where we do not expect values of these matching variables to agree exactly across matched pairs. The method involves a linking model for the distance between the matching variables of records in one file and the matching variables of their linked records in the second. This linking model is conditional on a vector indicating the links. We specify a mixture model for the distance component of the linking model, as this latent structure allows the distance between matching variables in linked pairs to vary across types of linked pairs. Finally, we specify a model for the linking vector. We describe the Gibbs sampling algorithm for sampling from the posterior distribution of this linkage model and use artificial data to illustrate model performance. We also introduce a linking application using public survey information and data from the U.S. Census of Manufactures and use
LFCMV to link the records.
View Full
Paper PDF
-
R&D, Attrition and Multiple Imputation in BRDIS
January 2017
Working Paper Number:
CES-17-13
Multiple imputation in business establishment surveys like BRDIS, an annual business survey in which some companies are sampled every year or multiple years, may enhance the estimates of total R&D in addition to helping researchers estimate models with subpopulations of small sample size. Considering a panel of BRDIS companies throughout the years 2008 to 2013 linked to LBD data, this paper uses the conclusions obtained with missing data visualization and other explorations to come up with a strategy to conduct multiple imputation appropriate to address the item nonresponse in R&D expenditures. Because survey design characteristics are behind much of the item and unit nonresponse, multiple imputation of missing data in BRDIS changes the estimates of total R&D significantly and alters the conclusions reached by models of the determinants of R&D investment obtained with complete case analysis.
View Full
Paper PDF
-
Automation, Labor Share, and Productivity:
Plant-Level Evidence from U.S. Manufacturing
September 2018
Working Paper Number:
CES-18-39
This paper provides new evidence on the plant-level relationship between automation, labor and capital usage, and productivity. The evidence, based on the U.S. Census Bureau's Survey of Manufacturing Technology, indicates that more automated establishments have lower production labor share and higher capital share, and a smaller fraction of workers in production who receive higher wages. These establishments also have higher labor productivity and experience larger long-term labor share declines. The relationship between automation and relative factor usage is modelled using a CES production function with endogenous technology choice. This deviation from the standard Cobb-Douglas assumption is necessary if the within-industry differences in the capital-labor ratio are determined by relative input price differences. The CES-based total factor productivity estimates are significantly different from the ones derived under Cobb-Douglas production and positively related to automation. The results, taken together with earlier findings of the productivity literature, suggest that the adoption of automation may be one mechanism associated with the rise of superstar firms.
View Full
Paper PDF
-
Productivity Dispersion and Plant Selection in the Ready-Mix Concrete Industry
September 2011
Working Paper Number:
CES-11-25
This paper presents a quantitative model of productivity dispersion to explain why inefficient producers are slowly selected out of the ready-mix concrete industry. Measured productivity dispersion between the 10th and 90th percentile falls from a 4 to 1 difference using OLS, to a 2 to 1 difference using a control function. Due to volatile productivity and high sunk entry costs, a dynamic oligopoly model shows that to rationalize small gaps in exit rates between high and low productivity plants, a plant in the top quintile must produce 1.5 times more than a plant in the bottom quintile.
View Full
Paper PDF
-
Macro and Micro Dynamics of Productivity: From Devilish Details to Insights
January 2017
Working Paper Number:
CES-17-41R
Researchers use a variety of methods to estimate total factor productivity (TFP) at the firm level and, while these may seem broadly equivalent, how the resulting measures relate to the TFP concept in theoretical models depends on the assumptions about the environment in which firms operate. Interpreting these measures and drawing insights based upon their characteristics thus must take into account these conceptual differences. Absent data on prices and quantities, most methods yield 'revenue productivity' measures. We focus on two broad classes of revenue productivity measures in our examination of the relationship between measured and conceptual TFP (TFPQ). The first measure has been increasingly used as a measure of idiosyncratic distortions and to assess the degree of misallocation. The second measure is, under standard assumptions, a function of funda-
mentals (e.g., TFPQ). Using plant-level U.S. manufacturing data, we find these alternative
measures are (i) highly correlated; (ii) exhibit similar dispersion; and (iii) have similar relationships with growth and survival. These findings raise questions about interpreting the first measure as a measure of idiosyncratic distortions. We also explore the sensitivity of estimates of the contribution of reallocation to aggregate productivity growth to these alternative approaches. We use recently developed structural decompositions of aggregate productivity growth that depend critically on estimates of output versus revenue elasticities. We find alternative approaches all yield a significant contribution of reallocation to
productivity growth (although the quantitative contribution varies across approaches).
View Full
Paper PDF
-
Distribution Preserving Statistical Disclosure Limitation
September 2006
Working Paper Number:
tp-2006-04
One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed,
partially synthetic data sets. These are data on actual respondents, but with confidential data
replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate
inferences because the distribution of synthetic data is completely determined by the model used
to generate them. We present two practical methods of generating synthetic values when the imputer
has only limited information about the true data generating process. One is applicable when
the true likelihood is known up to a monotone transformation. The second requires only limited
knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential
data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility
and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and
sampling error in the estimated transformation. We validate the approach with a simulation and
application to a large linked employer-employee database.
View Full
Paper PDF