-
Why the Economics Profession Must Actively Participate in the Privacy Protection Debate
March 2019
Working Paper Number:
CES-19-09
When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research.
View Full
Paper PDF
-
Disclosure Avoidance Techniques Used for the 1970 through 2010 Decennial Censuses of Population and Housing
November 2018
Working Paper Number:
CES-18-47
The U.S. Census Bureau conducts the decennial censuses under Title 13 of the U. S. Code with the Section 9 mandate to not 'use the information furnished under the provisions of this title for any purpose other than the statistical purposes for which it is supplied; or make any publication whereby the data furnished by any particular establishment or individual under this title can be identified; or permit anyone other than the sworn officers and employees of the Department or bureau or agency thereof to examine the individual reports (13 U.S.C. ' 9 (2007)).' The Census Bureau applies disclosure avoidance techniques to its publicly released statistical products in order to protect the confidentiality of its respondents and their data.
View Full
Paper PDF
-
Squeezing More Out of Your Data: Business Record Linkage with Python
November 2018
Working Paper Number:
CES-18-46
Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a
single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.
View Full
Paper PDF
-
Investigating the Use of Administrative Records in the Consumer Expenditure Survey
March 2018
Working Paper Number:
carra-2018-01
In this paper, we investigate the potential of applying administrative records income data to the Consumer Expenditure (CE) survey to inform measurement error properties of CE estimates, supplement respondent-collected data, and estimate the representativeness of the CE survey by income level. We match individual responses to Consumer Expenditure Quarterly Interview Survey data collected from July 2013 through December 2014 to IRS administrative data in order to analyze CE questions on wages, social security payroll deductions, self-employment income receipt and retirement income. We find that while wage amounts are largely in alignment between the CE and administrative records in the middle of the wage distribution, there is evidence that wages are over-reported to the CE at the bottom of the wage distribution and under-reported at the top of the wage distribution. We find mixed evidence for alignment between the CE and administrative records on questions covering payroll deductions and self-employment income receipt, but find substantial divergence between CE responses and administrative records when examining retirement income. In addition to the analysis using person-based linkages, we also match responding and non-responding CE sample units to the universe of IRS 1040 tax returns by address to examine non-response bias. We find that non-responding households are substantially richer than responding households, and that very high income households are less likely to respond to the CE.
View Full
Paper PDF
-
Disclosure Limitation and Confidentiality Protection in Linked Data
January 2018
Working Paper Number:
CES-18-07
Confidentiality protection for linked administrative data is a combination of access modalities and statistical disclosure limitation. We review traditional statistical disclosure limitation methods and newer methods based on synthetic data, input noise infusion and formal privacy. We discuss how these methods are integrated with access modalities by providing three detailed examples. The first example is the linkages in the Health and Retirement Study to Social Security Administration data. The second example is the linkage of the Survey of Income and Program Participation to administrative data from the Internal Revenue Service and the Social Security Administration. The third example is the Longitudinal Employer-Household Dynamics data, which links state unemployment insurance records for workers and firms to a wide variety of censuses and surveys at the U.S. Census Bureau. For examples, we discuss access modalities, disclosure limitation methods, the effectiveness of those methods, and the resulting analytical validity. The final sections discuss recent advances in access modalities for linked administrative data.
View Full
Paper PDF
-
The Need to Account for Complex Sampling Features when Analyzing Establishment Survey Data: An Illustration using the 2013 Business Research and Development and Innovation Survey (BRDIS)
January 2017
Working Paper Number:
CES-17-62
The importance of correctly accounting for complex sampling features when generating finite population inferences based on complex sample survey data sets has now been clearly established in a variety of fields, including those in both statistical and non statistical domains. Unfortunately, recent studies of analytic error have suggested that many secondary analysts of survey data do not ultimately account for these sampling features when analyzing their data, for a variety of possible reasons (e.g., poor documentation, or a data producer may not provide the information in a publicuse data set). The research in this area has focused exclusively on analyses of household survey data, and individual respondents. No research to date has considered how analysts are approaching the data collected in establishment surveys, and whether published articles advancing science based on analyses of establishment behaviors and outcomes are correctly accounting for complex sampling features. This article presents alternative analyses of real data from the 2013 Business Research and Development and Innovation Survey (BRDIS), and shows that a failure to account for the complex design features of the sample underlying these data can lead to substantial differences in inferences about the target population of establishments for the BRDIS.
View Full
Paper PDF
-
Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Improve the U.S. Statistical System?
January 2017
Authors:
Lars Vilhuber,
John M. Abowd,
Daniel Weinberg,
Jerome P. Reiter,
Matthew D. Shapiro,
Robert F. Belli,
Noel Cressie,
David C. Folch,
Scott H. Holan,
Margaret C. Levenstein,
Kristen M. Olson,
Jolene Smyth,
Leen-Kiat Soh,
Bruce D. Spencer,
Seth E. Spielman,
Christopher K. Wikle
Working Paper Number:
CES-17-59R
The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This paper discusses some of the key research findings of the eight nodes, organized into six topics: (1) Improving census and survey data collection methods; (2) Using alternative sources of data; (3) Protecting privacy and confidentiality by improving disclosure avoidance; (4) Using spatial and spatio-temporal statistical modeling to improve estimates; (5) Assessing data cost and quality tradeoffs; and (6) Combining information from multiple sources. It also reports on collaborations across nodes and with federal agencies, new software developed, and educational activities and outcomes. The paper concludes with an evaluation of the ability of the FSS to apply the NCRN's research outcomes and suggests some next steps, as well as the implications of this research-network model for future federal government renewal initiatives.
View Full
Paper PDF
-
The Potential for Using Combined Survey and Administrative Data Sources to Study Internal Labor Migration
January 2017
Working Paper Number:
CES-17-55
This paper introduces a novel data set combining survey data from the American Community Survey (ACS) with administrative data on employment from the Longitudinal Employer-Household Dynamics program, in order to study geographic labor mobility. With its rich set of information about individuals at the time of the migration decision, large sample size, and near-comprehensive ability to detect labor mobility, the new combined ACS-LEHD data offers several advantages over the existing data sets that are typically used in the study of migration, such as the Decennial Census, Current Population Survey, and Internal Revenue Service data. An overview of how these different data sets can be employed, and examples demonstrating the usefulness of the newly proposed data set, are provided.
Aggregate statistics and stylized facts are generated from the ACS-LEHD data which reveal many of the same features as the existing data sets, including the decline of aggregate mobility throughout the past decade, as well as many of the known demographic differences in migration propensity.
View Full
Paper PDF
-
A Comparison of Training Modules for Administrative Records Use in Nonresponse Followup Operations: The 2010 Census and the American Community Survey
January 2017
Working Paper Number:
CES-17-47
While modeling work in preparation for the 2020 Census has shown that administrative records can be predictive of Nonresponse Followup (NRFU) enumeration outcomes, there is scope to examine the robustness of the models by using more recent training data. The models deployed for workload removal from the 2015 and 2016 Census Tests were based on associations of the 2010 Census with administrative records. Training the same models with more recent data from the American Community Survey (ACS) can identify any changes in parameter associations over time that might reduce the accuracy of model predictions. Furthermore, more recent training data would allow for the
incorporation of new administrative record sources not available in 2010. However, differences in ACS methodology and the smaller sample size may limit its applicability. This paper replicates earlier results and examines model predictions based on the ACS in comparison with NRFU outcomes. The evaluation
consists of a comparison of predicted counts and household compositions with actual 2015 NRFU outcomes. The main findings are an overall validation of the methodology using independent data.
View Full
Paper PDF
-
File Matching with Faulty Continuous Matching Variables
January 2017
Working Paper Number:
CES-17-45
We present LFCMV, a Bayesian file linking methodology designed to link records using continuous matching variables in situations where we do not expect values of these matching variables to agree exactly across matched pairs. The method involves a linking model for the distance between the matching variables of records in one file and the matching variables of their linked records in the second. This linking model is conditional on a vector indicating the links. We specify a mixture model for the distance component of the linking model, as this latent structure allows the distance between matching variables in linked pairs to vary across types of linked pairs. Finally, we specify a model for the linking vector. We describe the Gibbs sampling algorithm for sampling from the posterior distribution of this linkage model and use artificial data to illustrate model performance. We also introduce a linking application using public survey information and data from the U.S. Census of Manufactures and use
LFCMV to link the records.
View Full
Paper PDF