Simultaneous Edit-Imputation for Continuous Microdata
December 2015
Working Paper Number:
CES-15-44
Abstract
Document Tags and Keywords
Keywords
Keywords are automatically generated using KeyBERT, a powerful and innovative
keyword extraction tool that utilizes BERT embeddings to ensure high-quality and contextually relevant
keywords.
By analyzing the content of working papers, KeyBERT identifies terms and phrases that capture the essence of the
text, highlighting the most significant topics and trends. This approach not only enhances searchability but
provides connections that go beyond potentially domain-specific author-defined keywords.
:
estimating,
data,
statistical,
microdata,
model,
imputation,
reporting,
unobserved,
budget,
inference,
datasets,
imputation model
Tags
Tags are automatically generated using a pretrained language model from spaCy, which excels at
several tasks, including entity tagging.
The model is able to label words and phrases by part-of-speech,
including "organizations." By filtering for frequent words and phrases labeled as "organizations", papers are
identified to contain references to specific institutions, datasets, and other organizations.
:
National Science Foundation,
Statistics Canada,
Economic Census,
Research Data Center,
Duke University
Similar Working Papers
Similarity between working papers are determined by an unsupervised neural
network model
know as Doc2Vec.
Doc2Vec is a model that represents entire documents as fixed-length vectors, allowing for the
capture of semantic meaning in a way that relates to the context of words within the document. The model learns to
associate a unique vector with each document while simultaneously learning word vectors, enabling tasks such as
document classification, clustering, and similarity detection by preserving the order and structure of words. The
document vectors are compared using cosine similarity/distance to determine the most similar working papers.
Papers identified with 🔥 are in the top 20% of similarity.
The 10 most similar working papers to the working paper 'Simultaneous Edit-Imputation for Continuous Microdata' are listed below in order of similarity.
-
Working PaperMeasuring Cross-Country Differences in Misallocation🔥
January 2016
Working Paper Number:
CES-16-50R
We describe differences between the commonly used version of the U.S. Census of Manufactures available at the RDCs and what establishments themselves report. The originally reported data has substantially more dispersion in measured establishment productivity. Measured allocative efficiency is substantially higher in the cleaned data than the raw data: 4x higher in 2002, 20x in 2007, and 80x in 2012. Many of the important editing strategies at the Census, including industry analysts' manual edits and edits using tax records, are infeasible in non-U.S. datasets. We describe a new Bayesian approach for editing and imputation that can be used across contexts.View Full Paper PDF
-
Working PaperUSING IMPUTATION TECHNIQUES TO EVALUATE STOPPING RULES IN ADAPTIVE SURVEY DESIGN
October 2014
Working Paper Number:
CES-14-40
Adaptive Design methods for social surveys utilize the information from the data as it is collected to make decisions about the sampling design. In some cases, the decision is either to continue or stop the data collection. We evaluate this decision by proposing measures to compare the collected data with follow-up samples. The options are assessed by imputation of the nonrespondents under different missingness scenarios, including Missing Not at Random. The variation in the utility measures is compared to the cost induced by the follow-up sample sizes. We apply the proposed method to the 2007 U.S. Census of Manufacturers.View Full Paper PDF
-
Working PaperWhy the Economics Profession Must Actively Participate in the Privacy Protection Debate
March 2019
Working Paper Number:
CES-19-09
When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research.View Full Paper PDF
-
Working PaperPlant-Level Productivity and Imputation of Missing Data in the Census of Manufactures
January 2011
Working Paper Number:
CES-11-02
In the U.S. Census of Manufactures, the Census Bureau imputes missing values using a combination of mean imputation, ratio imputation, and conditional mean imputation. It is wellknown that imputations based on these methods can result in underestimation of variability and potential bias in multivariate inferences. We show that this appears to be the case for the existing imputations in the Census of Manufactures. We then present an alternative strategy for handling the missing data based on multiple imputation. Specifically, we impute missing values via sequences of classification and regression trees, which offer a computationally straightforward and flexible approach for semi-automatic, large-scale multiple imputation. We also present an approach to evaluating these imputations based on posterior predictive checks. We use the multiple imputations, and the imputations currently employed by the Census Bureau, to estimate production function parameters and productivity dispersions. The results suggest that the two approaches provide quite different answers about productivity.View Full Paper PDF
-
Working PaperFinding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning
November 2021
Working Paper Number:
CES-21-35
This paper considers the problem of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across establishments is highly skewed. To address these difficulties, this paper develops a probabilistic record linkage methodology that combines machine learning (ML) with multiple imputation (MI). This ML-MI methodology is applied to link survey respondents in the Health and Retirement Study to their workplaces in the Census Business Register. The linked data reveal new evidence that non-sampling errors in household survey data are correlated with respondents' workplace characteristics.View Full Paper PDF
-
Working PaperOptimal Probabilistic Record Linkage: Best Practice for Linking Employers in Survey and Administrative Data
March 2019
Working Paper Number:
CES-19-08
This paper illustrates an application of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across firms is highly asymmetric. To address these difficulties, this paper uses a supervised machine learning model to probabilistically link survey respondents in the Health and Retirement Study (HRS) with employers and establishments in the Census Business Register (BR) to create a new data source which we call the CenHRS. Multiple imputation is used to propagate uncertainty from the linkage step into subsequent analyses of the linked data. The linked data reveal new evidence that survey respondents' misreporting and selective nonresponse about employer characteristics are systematically correlated with wages.View Full Paper PDF
-
Working PaperA FIRST STEP TOWARDS A GERMAN SYNLBD: CONSTRUCTING A GERMAN LONGITUDINAL BUSINESS DATABASE
February 2014
Working Paper Number:
CES-14-13
One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.View Full Paper PDF
-
Working PaperDistribution Preserving Statistical Disclosure Limitation
September 2006
Working Paper Number:
tp-2006-04
One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with confidential data replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate inferences because the distribution of synthetic data is completely determined by the model used to generate them. We present two practical methods of generating synthetic values when the imputer has only limited information about the true data generating process. One is applicable when the true likelihood is known up to a monotone transformation. The second requires only limited knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and sampling error in the estimated transformation. We validate the approach with a simulation and application to a large linked employer-employee database.View Full Paper PDF
-
Working PaperLabor Market Segmentation and the Distribution of Income: New Evidence from Internal Census Bureau Data
August 2023
Working Paper Number:
CES-23-41
In this paper, we present new findings that validate earlier literature on the apparent segmentation of the US earnings distribution. Previous contributions posited that the observed distribution of earnings combined two or three distinct signals and was thus appropriately modeled as a finite mixture of distributions. Furthermore, each component in the mixture appeared to have distinct distributional features hinting at qualitatively distinct generating mechanisms behind each component, providing strong evidence for some form of labor market segmentation. This paper presents new findings that support these earlier conclusions using internal CPS ASEC data spanning a much longer study period from 1974 to 2016. The restricted-access internal data is not subject to the same level of top-coding as the public-use data that earlier contributions to the literature were based on. The evolution of the mixture components provides new insights about changes in the earnings distribution including earnings inequality. In addition, we correlate component membership with worker type to provide a tacit link to various theoretical explanations for labor market segmentation, while solving the problem of assigning observations to labor market segments a priori.View Full Paper PDF
-
Working PaperRevisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods
January 2017
Working Paper Number:
CES-17-37
We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently undersupplied by a private provider. Solving the appropriate social planner's problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial.View Full Paper PDF