Official statistical agencies such as the Census Bureau and the Bureau of Labor Statistics collect enormous quantities of microdata in statistical surveys. These data are valuable for economic research and market and policy analysis. However, the data cannot be released to the public because of confidentiality commitments to individual respondents. These commitments, coupled with the strong research demand for microdata, have led the agencies to consider various proposals for releasing public use microdata. Most proposals for public use microdata call for the development of surrogate data that disguise the original data. Thus, they involve the addition of measurement errors to the data. In this paper, we examine disclosure issues and explore alternative masking methods for generating panels of useful economic microdata that can be released to researchers. While our analysis applies to all confidential microdata, applications using the Census Bureau's Longitudinal Research Data Base (LRD) are used for illustrative purposes throughout the discussion.
-
The Longitudinal Research Database (LRD): Status And Research Possibilities
July 1988
Working Paper Number:
CES-88-02
This paper discusses the development and use of the Longitudinal Research Data available at the Center for Economic Studies of the Bureau of the Census in terms of what has been accomplished thus far, what projects are currently in progress, and what plans are in place for the near future. The major achievement to date is the construction of the database itself, which contains data for manufacturing establishments collected by the Census in 1963, 1967, 1972, 1977 and 1982, and the Annual Survey of Manufactures for non-Census years from 1973 to 1985. These data now reside in the Center's computer in a consistent format across all years. In addition, a large software development task that greatly simplifies the task of selecting subsets of the database for specific research projects is well underway. Finally, a number of powerful microcomputers have been purchased for use by researchers for their statistical analysis. Current efforts underway at the Center include research on such policy-relevant issues as mergers and their impact on profits and production, high technology trade, import competition, plant level productivity, entry and exit, and productivity differences between large and small firms. Due to the confidentiality requirements of the Census data, most of their research is performed by Center staff and Special Sworn Employees. Under certain circumstances, the Center accepts user-written programs from outside researchers. These routines are executed by Center staff, and the resultant output is reviewed thoroughly for disclosure problems. The Center is also an active member of a task force working on methods on release "masked" or "cloned" microdata in public-use files that will protect the confidentiality of the data while at the same time provide a research tool for outside users. The Center research program contributes directly to future research possibilities. The current batch of research projects is adding insight into the nature of the LRD database. This information is continually being incorporated into the Center's software system, thus facilitating yet more research activity. Moreover, since a good portion of the research involves linking the Longitudinal Research Data to other data files, such as the NSF/Census R&D data, the scope of the databases is continually being expanded. Furthermore, the Center is exploring the possibility of linking the demographic data collected by the Census Bureau to the LRD database.
View Full
Paper PDF
-
Multiple Classification Systems For Economic Data: Can A Thousand Flowers Bloom? And Should They?
December 1991
Working Paper Number:
CES-91-08
The principle that the statistical system should provide flexibility-- possibilities for generating multiple groupings of data to satisfy multiple objectives--if it is to satisfy users is universally accepted. Yet in practice, this goal has not been achieved. This paper discusses the feasibility of providing flexibility in the statistical system to accommodate multiple uses of the industrial data now primarily examined within the Standard Industrial Classification (SIC) system. In one sense, the question of feasibility is almost trivial. With today's computer technology, vast amounts of data can be manipulated and stored at very low cost. Reconfigurations of the basic data are very inexpensive compared to the cost of collecting the data. Flexibility in the statistical system implies more than the technical ability to regroup data. It requires that the basic data are sufficiently detailed to support user needs and are processed and maintained in a fashion that makes the use of a variety of aggregation rules possible. For this to happen, statistical agencies must recognize the need for high quality microdata and build this into their planning processes. Agencies need to view their missions from a multiple use perspective and move away from use of a primary reporting and collection vehicle. Although the categories used to report data must be flexible, practical considerations dictate that data collection proceed within a fixed classification system. It is simply too expensive for both respondents and statistical agencies to process survey responses in the absence of standardized forms, data entry programs, etc. I argue for a basic classification centered on commodities--products, services, raw materials and labor inputs--as the focus of data collection. The idea is to make the principle variables of interest--the commodities--the vehicle for the collection and processing of the data. For completeness, the basic classification should include labor usage through some form of occupational classification. In most economic surveys at the Census Bureau, the reporting unit and the classified unit have been the establishment. But there is no need for this to be so. The basic principle to be followed in data collection is that the data should be collected in the most efficient way--efficiency being defined jointly in terms of statistical agency collection costs and respondent burdens.
View Full
Paper PDF
-
Analytic Use Of Economic Microdata; A Model For Researcher Access With Confidentiality Protection
August 1992
Working Paper Number:
CES-92-08
A primary responsibility of the Center for Economic Studies (CES) of the U.S. Bureau of the Census is to facilitate researcher access to confidential economic microdata files. Benefits from this program accrue not only to policy makers--there is a growing awareness of the importance of microdata for analyzing both the descriptive and welfare implications of regulatory and environmental changes--but also and importantly to the statistical agencies themselves. In fact, there is substantial recent literature arguing for the proposition that the largest single improvement that the U.S. statistical system could make is to improve its analytic capabilities. In this paper I briefly discuss these benefits to greater access for analytical work and ways to achieve them. Due to the nature of business data, public use databases and masking technologies are not available as vehicles for releasing useful microdata files. I conclude that a combination of outside and inside research programs, carefully coordinated and integrated is the best model for ensuring that statistical agencies reap the gains from analytic data users. For the United States, at least, this is fortuitous with respect to justifying access since any direct research with confidential data by outsiders must have a "statistical purpose". Until the advent of CES, it was virtually impossible for researchers to work with the economic microdata collected by the various economic censuses. While the CES program is quite large, as it now stands, researchers, or their representatives, must come to the Census Bureau in Washington, D.C. to access the data. The success of the program has led to increasing demands for data access in facilities outside of the Washington, D.C. area. Two options are considered: 1) Establish Census Bureau facilities in various universities or similar nonprofit research facilities and 2) Develop CES regional operations in existing Census Bureau regional offices.
View Full
Paper PDF
-
Simultaneous Edit-Imputation for Continuous Microdata
December 2015
Working Paper Number:
CES-15-44
Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation. To date, most approaches separate the error localization and imputation steps, typically using optimization methods to identify the variables to change followed by hot deck imputation. We present an approach that fully integrates editing and imputation for continuous microdata under linear constraints. Our approach relies on a Bayesian hierarchical model that includes (i) a flexible joint probability model for the underlying true values of the data with support only on the set of values that satisfy all editing constraints, (ii) a model for latent indicators of the variables that are in error, and (iii) a model for the reported responses for variables in error. We illustrate the potential advantages of the Bayesian editing approach over existing approaches using simulation studies. We apply the model to edit faulty data from the 2007 U.S. Census of Manufactures. Supplementary materials for this article are available online.
View Full
Paper PDF
-
Estimating the Distribution of Plant-Level Manufacturing Energy Efficiency with Stochastic Frontier Regression
March 2007
Working Paper Number:
CES-07-07
A feature commonly used to distinguish between parametric/statistical models and engineering models is that engineering models explicitly represent best practice technologies while the parametric/statistical models are typically based on average practice. Measures of energy intensity based on average practice are less useful in the corporate management of energy or for public policy goal setting. In the context of company or plant level energy management, it is more useful to have a measure of energy intensity capable of representing where a company or plant lies within a distribution of performance. In other words, is the performance close (or far) from the industry best practice? This paper presents a parametric/statistical approach that can be used to measure best practice, thereby providing a measure of the difference, or 'efficiency gap' at a plant, company or overall industry level. The approach requires plant level data and applies a stochastic frontier regression analysis to energy use. Stochastic frontier regression analysis separates the energy intensity into three components, systematic effects, inefficiency, and statistical (random) error. The stochastic frontier can be viewed as a sub-vector input distance function. One advantage of this approach is that physical product mix can be included in the distance function, avoiding the problem of aggregating output to define a single energy/output ratio to measure energy intensity. The paper outlines the methods and gives an example of the analysis conducted for a non-public micro-dataset of wet corn refining plants.
View Full
Paper PDF
-
New Approaches to Confidentiality Protection Synthetic Data, Remote Access and Research Data Centers
June 2004
Working Paper Number:
tp-2004-03
View Full
Paper PDF
-
Newly Recovered Microdata on U.S. Manufacturing Plants from the 1950s and 1960s: Some Early Glimpses
September 2011
Working Paper Number:
CES-11-29
Longitudinally-linked microdata on U.S. manufacturing plants are currently available to researchers for 1963, 1967, and 1972-2009. In this paper, we provide a first look at recently recovered manufacturing microdata files from the 1950s and 1960s. We describe their origins and background, discuss their contents, and begin to explore their sample coverage. We also begin to examine whether the available establishment identifier(s) allow record linking. Our preliminary analyses suggest that longitudinally-linked Annual Survey of Manufactures microdata from the mid-1950s through the present ' containing 16 years of additional data ' appears possible though challenging. While a great deal of work remains, we see tremendous value in extending the manufacturing microdata series back into time. With these data, new lines of research become possible and many others can be revisited.
View Full
Paper PDF
-
Large Plant Data in the LRD: Selection of a Sample for Estimation
March 1999
Working Paper Number:
CES-99-06
This paper describes preliminary work with the LRD during our tenure at the Census Bureau as participants in the ASA/NSF/Census Research Program. The objective of the work described here were two-fold. First, we wanted to examine the suitableness of these data for the calculation of plant-level productivity indexes, following procedures typically implemented with time series data. Second, we wanted to select a small number of 2-digit industry groups that would be well suited to the estimation of production functions and systems of factor share equations and factor demand forecasting equations with system-wide techniques. This description of our initial work may be useful to other researchers who are interested in the LRD for the analysis of productivity growth and/or the estimation of systems of factor equations, because the specific results reported in this memo suggest that the data are of good quality, or because the nature of the tasks undertaken provides insight into issues that arise in the analysis of longitudinal establishment data.
View Full
Paper PDF
-
Synthetic Data and Confidentiality Protection
September 2003
Working Paper Number:
tp-2003-10
View Full
Paper PDF
-
Concentration, Diversity, and Manufacturing Performance
July 2010
Working Paper Number:
CES-10-14
Regional economist Benjamin Chinitz was one of the most successful proponents of the idea that regional industrial structure is an important determinant of economic performance. His influential article in the American Economic Review in 1961 prompted substantial research measuring industrial structure at the regional scale and examining its relationships to economic outcomes. A considerable portion of this work operationalized the concept of regional industrial structure as sectoral diversity, the degree to which the composition of an economy is spread across heterogeneous activities. Diversity is a relatively simple construct to measure and interpret, but does not capture the implications of Chinitz's ideas fully. The structure within regional industries may also influence the performance of business enterprises. In particular, regional intra-industry concentration'the extent to which an industry is dominated by a few relatively large firms in a locality'has not appeared in empirical work studying economic performance apart from individual case studies, principally because accurately measuring concentration within a regional industry requires firm-level information. Multiple establishments of varying sizes in a given locality may be part of the same firm. Therefore, secondary data sources on establishment size distributions (such as County Business Patterns or aggregated information from the Census of Manufactures) can yield only deceptive portrayals of the level of regional industrial concentration. This paper uses the Longitudinal Research Database, a confidential establishment-level dataset compiled by the United States Census Bureau, to compare the influences of industrial diversity and intra-industry concentration upon regional and firm-level economic outcomes. Manufacturing establishments are aggregated into firms and several indicators of regional industrial concentration are calculated at multiple levels of industrial aggregation. These concentration indicators, along with a regional sectoral diversity measure, are related to employment change over time and incorporated into plant productivity estimations, in order to examine and distinguish the relationships between the differing aspects of regional industrial structure and economic performance. A better understanding of the particular links between regional industrial structure and economic performance can be used to improve economic development planning efforts. With continuing economic restructuring and associated workforce dislocation in the United States and worldwide, industrial concentration and over-specialization are separate mechanisms by which regions may 'lock in' to particular competencies and limit the capacity to adjust quickly and efficiently to changing markets and technologies. The most appropriate and effective policies for improving economic adaptability should reflect the structural characteristics that limit flexibility. This paper gauges the consequences of distinct facets of regional industrial structure, adding new depth to the study of regional industries by economic development planners and researchers.
View Full
Paper PDF