Public-Use vs. Restricted-Use: An Analysis Using the American Community Survey
January 2017
Working Paper Number:
CES-17-12
Abstract
Document Tags and Keywords
Keywords
Keywords are automatically generated using KeyBERT, a powerful and innovative
keyword extraction tool that utilizes BERT embeddings to ensure high-quality and contextually relevant
keywords.
By analyzing the content of working papers, KeyBERT identifies terms and phrases that capture the essence of the
text, highlighting the most significant topics and trends. This approach not only enhances searchability but
provides connections that go beyond potentially domain-specific author-defined keywords.
:
data,
work census,
statistical,
microdata,
analyst,
survey,
agency,
respondent,
confidentiality,
information,
employ,
workforce,
salary,
federal,
labor statistics,
census bureau,
research census,
use census,
datasets,
statistical disclosure,
public,
publicly
Tags
Tags are automatically generated using a pretrained language model from spaCy, which excels at
several tasks, including entity tagging.
The model is able to label words and phrases by part-of-speech,
including "organizations." By filtering for frequent words and phrases labeled as "organizations", papers are
identified to contain references to specific institutions, datasets, and other organizations.
:
Standard Industrial Classification,
Bureau of Labor Statistics,
North American Industry Classification System,
American Community Survey,
National Center for Health Statistics,
Public Use Micro Sample,
Occupational Employment Statistics,
Special Sworn Status,
Integrated Public Use Microdata Series,
North American Industry Classi,
Personally Identifiable Information,
Federal Statistical Research Data Center
Similar Working Papers
Similarity between working papers are determined by an unsupervised neural
network model
know as Doc2Vec.
Doc2Vec is a model that represents entire documents as fixed-length vectors, allowing for the
capture of semantic meaning in a way that relates to the context of words within the document. The model learns to
associate a unique vector with each document while simultaneously learning word vectors, enabling tasks such as
document classification, clustering, and similarity detection by preserving the order and structure of words. The
document vectors are compared using cosine similarity/distance to determine the most similar working papers.
Papers identified with 🔥 are in the top 20% of similarity.
The 10 most similar working papers to the working paper 'Public-Use vs. Restricted-Use: An Analysis Using the American Community Survey' are listed below in order of similarity.
-
Working PaperEstimating the Local Productivity Spillovers from Science
January 2017
Working Paper Number:
CES-17-56
We estimate the local productivity spillovers from science by relating wages and real estate prices across metros to measures of scienti c activity in those metros. We address three fundamental challenges: (1) factor input adjustments using wages and real estate prices, along with Shepards Lemma, to estimate changes metros' productivity, which must equal changes in unit production cost; (2) unobserved differences in metros/causality using a share shift index that exploits historic variation in the mix of research in metros interacted with trends in federal funding for specific fields as an instrument; (3) unobserved differences in workers using data on the states in which people are born. Our estimates show a strong positive relationship between wages and scientifc research and a weak positive relationship for real estate prices. Overall, we estimate high rate of return to research.View Full Paper PDF
-
Working PaperLOOKING BACK ON THREE YEARS OF USING THE SYNTHETIC LBD BETA
February 2014
Working Paper Number:
CES-14-11
Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau's Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.View Full Paper PDF
-
Working PaperTotal Error and Variability Measures with Integrated Disclosure Limitation for Quarterly Workforce Indicators and LEHD Origin Destination Employment Statistics in On The Map
January 2017
Working Paper Number:
CES-17-71
We report results from the rst comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total employment, beginning-of-quarter employment, full-quarter employment, total payroll, and average monthly earnings of full-quarter employees. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM). The evaluation is conducted by generating multiple threads of the edit and imputation models used in the LEHD Infrastructure File System. These threads conform to the Rubin (1987) multiple imputation model, with each thread or implicate being the output of formal probability models that address coverage, edit, and imputation errors. Design-based sampling variability and nite population corrections are also included in the evaluation. We derive special formulas for the Rubin total variability and its components that are consistent with the disclosure avoidance system used for QWI and LODES/OTM workplace reports. These formulas allow us to publish the complete set of detailed total quality measures for QWI and LODES. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs have quality in the range generally deemed acceptable. Tabulations involving zero, one or two jobs, which are generally suppressed in the QWI and synthesized in LODES, have substantial total variability but their publication in LODES allows the formation of larger custom aggregations, which will in general have the accuracy estimated for tabulations in the QWI based on a similar number of workers.View Full Paper PDF
-
Working PaperLEHD Data Documentation LEHD-OVERVIEW-S2008-rev1
December 2011
Working Paper Number:
CES-11-43
View Full Paper PDF
-
Working PaperWho Values Human Capitalists' Human Capital? Healthcare Spending and Physician Earnings
July 2020
Working Paper Number:
CES-20-23
Is government guiding the invisible hand at the top of the labor market? We study this question among physicians, the most common occupation among the top one percent of income earners, and whose billings comprise one-fifth of healthcare spending. We use a novel linkage of population-wide tax records with the administrative registry of all physicians in the U.S. to study the characteristics of these high earnings, and the influence of government payments in particular. We find a major role for government on the margin, with half of direct changes to government reimbursement rates flowing directly into physicians' incomes. These policies move physicians' relative and absolute incomes more than any reasonable changes to marginal tax rates. At the same time, the overall level of physician earnings can largely be explained by labor market fundamentals of long work and training hours. Competing occupations also pay well and provide a natural lower bound for physician earnings. We conclude that government plays a major role in determining the value of physicians' human capital, but it is unrealistic to use this power to reduce healthcare spending substantially.View Full Paper PDF
-
Working PaperWhat Caused Racial Disparities in Particulate Exposure to Fall? New Evidence from the Clean Air Act and Satellite-Based Measures of Air Quality
January 2020
Working Paper Number:
CES-20-02
Racial differences in exposure to ambient air pollution have declined significantly in the United States over the past 20 years. This project links restricted-access Census Bureau microdata to newly available, spatially continuous high resolution measures of ambient particulate pollution (PM2.5) to examine the underlying causes and consequences of differences in black-white pollution exposures. We begin by decomposing differences in pollution exposure into components explained by observable population characteristics (e.g., income) versus those that remain unexplained. We then use quantile regression methods to show that a significant portion of the 'unexplained' convergence in black-white pollution exposure can be attributed to differential impacts of the Clean Air Act (CAA) in non-Hispanic African American and non-Hispanic white communities. Areas with larger black populations saw greater CAA-related declines in PM2.5 exposure. We show that the CAA has been the single largest contributor to racial convergence in PM2.5 pollution exposure in the U.S. since 2000 accounting for over 60 percent of the reduction.View Full Paper PDF
-
Working PaperSYNTHETIC DATA FOR SMALL AREA ESTIMATION IN THE AMERICAN COMMUNITY SURVEY
April 2013
Working Paper Number:
CES-13-19
Small area estimates provide a critical source of information used to study local populations. Statistical agencies regularly collect data from small areas but are prevented from releasing detailed geographical identifiers in public-use data sets due to disclosure concerns. Alternative data dissemination methods used in practice include releasing summary/aggregate tables, suppressing detailed geographic information in public-use data sets, and accessing restricted data via Research Data Centers. This research examines an alternative method for disseminating microdata that contains more geographical details than are currently being released in public-use data files. Specifically, the method replaces the observed survey values with imputed, or synthetic, values simulated from a hierarchical Bayesian model. Confidentiality protection is enhanced because no actual values are released. The method is demonstrated using restricted data from the 2005-2009 American Community Survey. The analytic validity of the synthetic data is assessed by comparing small area estimates obtained from the synthetic data with those obtained from the observed data.View Full Paper PDF
-
Working PaperLEHD Infrastructure Files in the Census RDC: Overview of S2004 Snapshot
April 2011
Working Paper Number:
CES-11-13
The Longitudinal Employer-Household Dynamics (LEHD) Program at the U.S. Census Bureau, with the support of several national research agencies, has built a set of infrastructure files using administrative data provided by state agencies, enhanced with information from other administrative data sources, demographic and economic (business) surveys and censuses. The LEHD Infrastructure Files provide a detailed and comprehensive picture of workers, employers, and their interaction in the U.S. economy. This document describes the structure and content of the 2004 Snapshot of the LEHD Infrastructure files as they are made available in the Census Bureau's Research Data Center network.View Full Paper PDF
-
Working PaperMeasuring Labor Earnings Inequality Using Public-Use March Current Population Survey Data: The Value of Including Variances and Cell Means When Imputing Topcoded Values
November 2008
Working Paper Number:
CES-08-38
Using the Census Bureau's internal March Current Population Surveys (CPS) file, we construct and make available variances and cell means for all topcoded income values in the publicuse version of these data. We then provide a procedure that allows researchers with access only to the public-use March CPS data to take advantage of this added information when imputing its topcoded income values. As an example of its value we show how our new procedure improves on existing imputation methods in the labor earnings inequality literature.View Full Paper PDF
-
Working PaperCounty-Level Estimates of the Employment Prospects of Low-Skill Workers
July 2000
Working Paper Number:
CES-00-11
This study examines low-skill wage and employment opportunities for men and women at the county level over the period 1989-96. Currently, reliable direct measures of wages and employment rates for different demographic and skill groups are only available for large geographic areas such as regions and populous states or at infrequent intervals (e.g., from the Decennial Census) for some smaller areas. This study constructs indirect annual measures for all counties from 1989-96 by combining skill-specific information on earnings and employment from the Sample Edited Detail File (SEDF) of the 1990 Decennial Census and the 1990-97 Annual Demographic files of the Current Population Survey (CPS) with annual industry-specific information from the Regional Economic Information System (REIS). Special versions of the SEDF and CPS files that identify county of residence are used. The study regresses the low-skill wage and employment data from the SEDF and CPS files on a set of personal variables from the combined files and local employment measures derived from the REIS. The wage regressions are corrected for selectivity from the employment decision and account for county-specific effects as well as general time effects. Estimates from the regressions are then combined with the available employment data from the REIS to impute wage and employment rates for low-skill adults across counties.View Full Paper PDF