CREAT: Census Research Exploration and Analysis Tool

The Sensitivity of Economic Statistics to Coding Errors in Personal Identifiers

October 2002

Working Paper Number:

tp-2002-17

Abstract

In this paper, we describe the sensitivity of small-cell flow statistics to coding errors in the identity of the underlying entities. Specifically, we present results based on a comparison of the U.S. Census Bureau's Quarterly Workforce Indicators (QWI) before and after correcting for such errors in SSN-based identifiers in the underlying individual wage records. The correction used involves a novel application of existing statistical matching techniques. It is found that even a very conservative correction procedure has a sizable impact on the statistics. The average bias ranges from 0.25 percent up to 15 percent for flow statistics, and up to 5 percent for payroll aggregates.

Document Tags and Keywords

Keywords Keywords are automatically generated using KeyBERT, a powerful and innovative keyword extraction tool that utilizes BERT embeddings to ensure high-quality and contextually relevant keywords.

By analyzing the content of working papers, KeyBERT identifies terms and phrases that capture the essence of the text, highlighting the most significant topics and trends. This approach not only enhances searchability but provides connections that go beyond potentially domain-specific author-defined keywords.
:
payroll, statistical, report, quarterly, survey, agency, earnings, employee, employ, employed, statistician, tenure, workforce, hire, matching, labor statistics, employment count, clerical, employment statistics, wage data, ssa, employee data, workforce indicators

Tags Tags are automatically generated using a pretrained language model from spaCy, which excels at several tasks, including entity tagging.

The model is able to label words and phrases by part-of-speech, including "organizations." By filtering for frequent words and phrases labeled as "organizations", papers are identified to contain references to specific institutions, datasets, and other organizations.
:
Bureau of Labor Statistics, Social Security Administration, National Science Foundation, Bureau of Labor, Employer Identification Numbers, Cornell University, Social Security, Economic Census, Social Security Number, National Institute on Aging, Alfred P Sloan Foundation, Longitudinal Employer Household Dynamics, Cornell Institute for Social and Economic Research, LEHD Program, Quarterly Workforce Indicators

Similar Working Papers Similarity between working papers are determined by an unsupervised neural network model know as Doc2Vec.

Doc2Vec is a model that represents entire documents as fixed-length vectors, allowing for the capture of semantic meaning in a way that relates to the context of words within the document. The model learns to associate a unique vector with each document while simultaneously learning word vectors, enabling tasks such as document classification, clustering, and similarity detection by preserving the order and structure of words. The document vectors are compared using cosine similarity/distance to determine the most similar working papers. Papers identified with 🔥 are in the top 20% of similarity.

The 10 most similar working papers to the working paper 'The Sensitivity of Economic Statistics to Coding Errors in Personal Identifiers' are listed below in order of similarity.