Simultaneous Edit-Imputation for Continuous Microdata
December 2015
Working Paper Number:
CES-15-44
Abstract
Document Tags and Keywords
Keywords
Keywords are automatically generated using KeyBERT, a powerful and innovative
keyword extraction tool that utilizes BERT embeddings to ensure high-quality and contextually relevant
keywords.
By analyzing the content of working papers, KeyBERT identifies terms and phrases that capture the essence of the
text, highlighting the most significant topics and trends. This approach not only enhances searchability but
provides connections that go beyond potentially domain-specific author-defined keywords.
:
estimating,
data,
statistical,
microdata,
model,
imputation,
reporting,
unobserved,
budget,
inference,
datasets,
imputation model
Tags
Tags are automatically generated using a pretrained language model from spaCy, which excels at
several tasks, including entity tagging.
The model is able to label words and phrases by part-of-speech,
including "organizations." By filtering for frequent words and phrases labeled as "organizations", papers are
identified to contain references to specific institutions, datasets, and other organizations.
:
National Science Foundation,
Statistics Canada,
Economic Census,
Research Data Center,
Duke University
Similar Working Papers
Similarity between working papers are determined by an unsupervised neural
network model
know as Doc2Vec.
Doc2Vec is a model that represents entire documents as fixed-length vectors, allowing for the
capture of semantic meaning in a way that relates to the context of words within the document. The model learns to
associate a unique vector with each document while simultaneously learning word vectors, enabling tasks such as
document classification, clustering, and similarity detection by preserving the order and structure of words. The
document vectors are compared using cosine similarity/distance to determine the most similar working papers.
Papers identified with 🔥 are in the top 20% of similarity.
The 10 most similar working papers to the working paper 'Simultaneous Edit-Imputation for Continuous Microdata' are listed below in order of similarity.
-
Working PaperMeasuring Cross-Country Differences in Misallocation🔥
January 2016
Working Paper Number:
CES-16-50R
We describe differences between the commonly used version of the U.S. Census of Manufactures available at the RDCs and what establishments themselves report. The originally reported data has substantially more dispersion in measured establishment productivity. Measured allocative efficiency is substantially higher in the cleaned data than the raw data: 4x higher in 2002, 20x in 2007, and 80x in 2012. Many of the important editing strategies at the Census, including industry analysts' manual edits and edits using tax records, are infeasible in non-U.S. datasets. We describe a new Bayesian approach for editing and imputation that can be used across contexts.View Full Paper PDF
-
Working PaperUSING IMPUTATION TECHNIQUES TO EVALUATE STOPPING RULES IN ADAPTIVE SURVEY DESIGN
October 2014
Working Paper Number:
CES-14-40
Adaptive Design methods for social surveys utilize the information from the data as it is collected to make decisions about the sampling design. In some cases, the decision is either to continue or stop the data collection. We evaluate this decision by proposing measures to compare the collected data with follow-up samples. The options are assessed by imputation of the nonrespondents under different missingness scenarios, including Missing Not at Random. The variation in the utility measures is compared to the cost induced by the follow-up sample sizes. We apply the proposed method to the 2007 U.S. Census of Manufacturers.View Full Paper PDF
-
Working PaperOptimal Probabilistic Record Linkage: Best Practice for Linking Employers in Survey and Administrative Data
March 2019
Working Paper Number:
CES-19-08
This paper illustrates an application of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across firms is highly asymmetric. To address these difficulties, this paper uses a supervised machine learning model to probabilistically link survey respondents in the Health and Retirement Study (HRS) with employers and establishments in the Census Business Register (BR) to create a new data source which we call the CenHRS. Multiple imputation is used to propagate uncertainty from the linkage step into subsequent analyses of the linked data. The linked data reveal new evidence that survey respondents' misreporting and selective nonresponse about employer characteristics are systematically correlated with wages.View Full Paper PDF
-
Working PaperFinding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning
November 2021
Working Paper Number:
CES-21-35
This paper considers the problem of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across establishments is highly skewed. To address these difficulties, this paper develops a probabilistic record linkage methodology that combines machine learning (ML) with multiple imputation (MI). This ML-MI methodology is applied to link survey respondents in the Health and Retirement Study to their workplaces in the Census Business Register. The linked data reveal new evidence that non-sampling errors in household survey data are correlated with respondents' workplace characteristics.View Full Paper PDF
-
Working PaperWhy the Economics Profession Must Actively Participate in the Privacy Protection Debate
March 2019
Working Paper Number:
CES-19-09
When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research.View Full Paper PDF
-
Working PaperPlant-Level Productivity and Imputation of Missing Data in the Census of Manufactures
January 2011
Working Paper Number:
CES-11-02
In the U.S. Census of Manufactures, the Census Bureau imputes missing values using a combination of mean imputation, ratio imputation, and conditional mean imputation. It is wellknown that imputations based on these methods can result in underestimation of variability and potential bias in multivariate inferences. We show that this appears to be the case for the existing imputations in the Census of Manufactures. We then present an alternative strategy for handling the missing data based on multiple imputation. Specifically, we impute missing values via sequences of classification and regression trees, which offer a computationally straightforward and flexible approach for semi-automatic, large-scale multiple imputation. We also present an approach to evaluating these imputations based on posterior predictive checks. We use the multiple imputations, and the imputations currently employed by the Census Bureau, to estimate production function parameters and productivity dispersions. The results suggest that the two approaches provide quite different answers about productivity.View Full Paper PDF
-
Working PaperDistribution Preserving Statistical Disclosure Limitation
September 2006
Working Paper Number:
tp-2006-04
One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with confidential data replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate inferences because the distribution of synthetic data is completely determined by the model used to generate them. We present two practical methods of generating synthetic values when the imputer has only limited information about the true data generating process. One is applicable when the true likelihood is known up to a monotone transformation. The second requires only limited knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and sampling error in the estimated transformation. We validate the approach with a simulation and application to a large linked employer-employee database.View Full Paper PDF
-
Working PaperTowards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database
February 2011
Working Paper Number:
CES-11-04
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments\' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.View Full Paper PDF
-
Working PaperThe Sensitivity of Economic Statistics to Coding Errors in Personal Identifiers
October 2002
Working Paper Number:
tp-2002-17
In this paper, we describe the sensitivity of small-cell flow statistics to coding errors in the identity of the underlying entities. Specifically, we present results based on a comparison of the U.S. Census Bureau's Quarterly Workforce Indicators (QWI) before and after correcting for such errors in SSN-based identifiers in the underlying individual wage records. The correction used involves a novel application of existing statistical matching techniques. It is found that even a very conservative correction procedure has a sizable impact on the statistics. The average bias ranges from 0.25 percent up to 15 percent for flow statistics, and up to 5 percent for payroll aggregates.View Full Paper PDF
-
Working PaperRevisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods
January 2017
Working Paper Number:
CES-17-37
We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently undersupplied by a private provider. Solving the appropriate social planner's problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial.View Full Paper PDF