CREAT: Census Research Exploration and Analysis Tool

Disclosure Limitation and Confidentiality Protection in Linked Data

January 2018

Working Paper Number:

CES-18-07

Abstract

Confidentiality protection for linked administrative data is a combination of access modalities and statistical disclosure limitation. We review traditional statistical disclosure limitation methods and newer methods based on synthetic data, input noise infusion and formal privacy. We discuss how these methods are integrated with access modalities by providing three detailed examples. The first example is the linkages in the Health and Retirement Study to Social Security Administration data. The second example is the linkage of the Survey of Income and Program Participation to administrative data from the Internal Revenue Service and the Social Security Administration. The third example is the Longitudinal Employer-Household Dynamics data, which links state unemployment insurance records for workers and firms to a wide variety of censuses and surveys at the U.S. Census Bureau. For examples, we discuss access modalities, disclosure limitation methods, the effectiveness of those methods, and the resulting analytical validity. The final sections discuss recent advances in access modalities for linked administrative data.

Document Tags and Keywords

Keywords Keywords are automatically generated using KeyBERT, a powerful and innovative keyword extraction tool that utilizes BERT embeddings to ensure high-quality and contextually relevant keywords.

By analyzing the content of working papers, KeyBERT identifies terms and phrases that capture the essence of the text, highlighting the most significant topics and trends. This approach not only enhances searchability but provides connections that go beyond potentially domain-specific author-defined keywords.
:
statistical, data, microdata, database, census data, agency, respondent, survey, disclosure, confidentiality, information, statistician, privacy, record, employment statistics, irs, ssa, filing, employee data, datasets, statistical disclosure, linked census

Tags Tags are automatically generated using a pretrained language model from spaCy, which excels at several tasks, including entity tagging.

The model is able to label words and phrases by part-of-speech, including "organizations." By filtering for frequent words and phrases labeled as "organizations", papers are identified to contain references to specific institutions, datasets, and other organizations.
:
Bureau of Labor Statistics, National Science Foundation, Standard Industrial Classification, American Economic Association, Internal Revenue Service, Social Security Administration, Service Annual Survey, American Statistical Association, Center for Economic Studies, Stern School of Business, County Business Patterns, Federal Reserve Bank, Statistics Canada, National Longitudinal Survey of Youth, Department of Economics, Chicago Census Research Data Center, Survey of Income and Program Participation, Cornell University, Social Security, Unemployment Insurance, Research Data Center, North American Industry Classification System, American Community Survey, Social Security Number, Health and Retirement Study, Longitudinal Employer Household Dynamics, Alfred P Sloan Foundation, National Institute on Aging, Detailed Earnings Records, Summary Earnings Records, Federal Insurance Contributions Act, Sloan Foundation, National Center for Health Statistics, Quarterly Workforce Indicators, European Union, National Institutes of Health, Quarterly Census of Employment and Wages, University of Michigan, Census Bureau Disclosure Review Board, Commodity Flow Survey, United Nations, Disclosure Review Board, Business Dynamics Statistics, Federal Statistical Research Data Center, Centers for Medicare, Medicaid Services

Similar Working Papers Similarity between working papers are determined by an unsupervised neural network model know as Doc2Vec.

Doc2Vec is a model that represents entire documents as fixed-length vectors, allowing for the capture of semantic meaning in a way that relates to the context of words within the document. The model learns to associate a unique vector with each document while simultaneously learning word vectors, enabling tasks such as document classification, clustering, and similarity detection by preserving the order and structure of words. The document vectors are compared using cosine similarity/distance to determine the most similar working papers. Papers identified with 🔥 are in the top 20% of similarity.

The 10 most similar working papers to the working paper 'Disclosure Limitation and Confidentiality Protection in Linked Data' are listed below in order of similarity.