CREAT - Census Bureau

An Economist's Primer on Survey Samples

September 2000

Written by: William J Carrington, John L Eltinge, Kristin McCue

Working Paper Number:

CES-00-15

Abstract

Survey data underlie most empirical work in economics, yet economists typically have little familiarity with survey sample design and its effects on inference. This paper describes how sample designs depart from the simple random sampling model implicit in most econometrics textbooks, points out where the effects of this departure are likely to be greatest, and describes the relationship between design-based estimators developed by survey statisticians and related econometric methods for regression. Its intent is to provide empirical economists with enough background in survey methods to make informed use of design-based estimators. It emphasizes surveys of households (the source of most public-use files), but also considers how surveys of businesses differ. Examples from the National Longitudinal Survey of Youth of 1979 and the Current Population Survey illustrate practical aspects of design-based estimation.

Document Tags and Keywords

Keywords:

estimation, economist, econometric, estimating, statistical, survey, respondent, estimator, statistician, econometrician, gdp, sampling, sample, household surveys, survey households, income survey, inference

Tags:

Metropolitan Statistical Area, Bureau of Labor Statistics, Center for Economic Studies, Ordinary Least Squares, National Longitudinal Survey of Youth, Current Population Survey, Journal of Economic Literature, Department of Labor, Health and Retirement Study, National Health Interview Survey, Consumer Expenditure Survey

Similar Working Papers

The 10 most similar working papers to the working paper 'An Economist's Primer on Survey Samples' are listed below in order of similarity.

Working Paper

Estimation and Inference in Regression Discontinuity Designs with Clustered Sampling

August 2015

Authors: Quentin Brummet, Otávio Bartalotti

Working Paper Number:

carra-2015-06

Regression Discontinuity (RD) designs have become popular in empirical studies due to their attractive properties for estimating causal effects under transparent assumptions. Nonetheless, most popular procedures assume i.i.d. data, which is not reasonable in many common applications. To relax this assumption, we derive the properties of traditional non-parametric estimators in a setting that incorporates potential clustering at the level of the running variable, and propose an accompanying optimal-MSE bandwidth selection rule. Simulation results demonstrate that falsely assuming data are i.i.d. when selecting the bandwidth may lead to the choice of bandwidths that are too small relative to the optimal-MSE bandwidth. Last, we apply our procedure using person-level microdata that exhibits clustering at the census tract level to analyze the impact of the Low-Income Housing Tax Credit program on neighborhood characteristics and low-income housing supply.
View Full Paper PDF
Working Paper

Who are the people in my neighborhood? The 'contextual fallacy' of measuring individual context with census geographies

February 2018

Authors: David C. Folch, Seth E. Spielman, Christopher S. Fowler, Nathan Frey, Nicholas Nagle

Working Paper Number:

CES-18-11

Scholars deploy census-based measures of neighborhood context throughout the social sciences and epidemiology. Decades of research confirm that variation in how individuals are aggregated into geographic units to create variables that control for social, economic or political contexts can dramatically alter analyses. While most researchers are aware of the problem, they have lacked the tools to determine its magnitude in the literature and in their own projects. By using confidential access to the complete 2010 U.S. Decennial Census, we are able to construct'for all persons in the US'individual-specific contexts, which we group according to the Census-assigned block, block group, and tract. We compare these individual-specific measures to the published statistics at each scale, and we then determine the magnitude of variation in context for an individual with respect to the published measures using a simple statistic, the standard deviation of individual context (SDIC). For three key measures (percent Black, percent Hispanic, and Entropy'a measure of ethno-racial diversity), we find that block-level Census statistics frequently do not capture the actual context of individuals within them. More problematic, we uncover systematic spatial patterns in the contextual variables at all three scales. Finally, we show that within-unit variation is greater in some parts of the country than in others. We publish county-level estimates of the SDIC statistics that enable scholars to assess whether mis-specification in context variables is likely to alter analytic findings when measured at any of the three common Census units.
View Full Paper PDF
Working Paper

A METHOD OF CORRECTING FOR MISREPORTING APPLIED TO THE FOOD STAMP PROGRAM

May 2013

Authors: Nikolas Mittag

Working Paper Number:

CES-13-28

Survey misreporting is known to be pervasive and bias common statistical analyses. In this paper, I first use administrative data on SNAP receipt and amounts linked to American Community Survey data from New York State to show that survey data can misrepresent the program in important ways. For example, more than 1.4 billion dollars received are not reported in New York State alone. 46 percent of dollars received by house- holds with annual income above the poverty line are not reported in the survey data, while only 19 percent are missing below the poverty line. Standard corrections for measurement error cannot remove these biases. I then develop a method to obtain consistent estimates by combining parameter estimates from the linked data with publicly available data. This conditional density method recovers the correct estimates using public use data only, which solves the problem that access to linked administrative data is usually restricted. I examine the degree to which this approach can be used to extrapolate across time and geography, in order to solve the problem that validation data is often based on a convenience sample. I present evidence from within New York State that the extent of heterogeneity is small enough to make extrapolation work well across both time and geography. Extrapolation to the entire U.S. yields substantive differences to survey data and reduces deviations from official aggregates by a factor of 4 to 9 compared to survey aggregates.
View Full Paper PDF
Working Paper

SYNTHETIC DATA FOR SMALL AREA ESTIMATION IN THE AMERICAN COMMUNITY SURVEY

April 2013

Authors: Joseph W. Sakshaug, Trivellore Raghunathan

Working Paper Number:

CES-13-19

Small area estimates provide a critical source of information used to study local populations. Statistical agencies regularly collect data from small areas but are prevented from releasing detailed geographical identifiers in public-use data sets due to disclosure concerns. Alternative data dissemination methods used in practice include releasing summary/aggregate tables, suppressing detailed geographic information in public-use data sets, and accessing restricted data via Research Data Centers. This research examines an alternative method for disseminating microdata that contains more geographical details than are currently being released in public-use data files. Specifically, the method replaces the observed survey values with imputed, or synthetic, values simulated from a hierarchical Bayesian model. Confidentiality protection is enhanced because no actual values are released. The method is demonstrated using restricted data from the 2005-2009 American Community Survey. The analytic validity of the synthetic data is assessed by comparing small area estimates obtained from the synthetic data with those obtained from the observed data.
View Full Paper PDF
Working Paper

Gradient Boosting to Address Statistical Problems Arising from Non-Linkage of Census Bureau Datasets

June 2024

Authors: Narayan Sastry, Todd Gardner, Matthew Cefalu, John Sullivan, Elizabeth Fussell

Working Paper Number:

CES-24-27

This article introduces the twangRDC package, which contains functions to address non-linkage in US Census Bureau datasets. The Census Bureau's Person Identification Validation System facilitates data linkage by assigning unique person identifiers to federal, third party, decennial census, and survey data. Not all records in these datasets can be linked to the reference file and as such not all records will be assigned an identifier. This article is a tutorial for using the twangRDC to generate nonresponse weights to account for non-linkage of person records across US Census Bureau datasets.
View Full Paper PDF
Working Paper

Neighborhood Effects on High-School Drop-Out Rates and Teenage Childbearing: Tests for Non-Linearities, Race-Specific Effects, Interactions with Family Characteristics, and Endogenous Causation using Geocoded California Census Microdata

May 2008

Authors: Rhiannon Patterson

Working Paper Number:

CES-08-12

This paper examines the relationship between neighborhood characteristics and the likelihood that a youth will drop out of high school or have a child during the teenage years. Using a dataset that is uniquely wellsuited to the study of neighborhood effects, the impact of the neighborhood poverty rate and the percentage of professionals in the local labor force on youth outcomes in California is examined. The first section of the paper tests for non-linearities in the relationship between indicators of neighborhood distress and youth outcomes. Some evidence is found for a break-point at low levels of poverty. Suggestive but inconclusive evidence is also found for a second breakpoint, at very high levels of poverty, for African-American youth only. The second part of the paper examines interactions between family background characteristics and neighborhood effects, and finds that White youth are most sensitive to neighborhood effects, while the effect of parental education depends on the neighborhood measure in question. Among White youth, those from single-parent households are more vulnerable to neighborhood conditions. The third section of the paper finds that for White youth and Hispanic youth, the relevant neighborhood variables appear to be the own-race poverty rates and the percentage of professionals of youths' own race. The final section of the paper estimates a tract-fixed effects model, using the results from the third section to define multiple relevant poverty rates within each tract. The fixed-effects specification suggests that for White and Hispanic youth in California, neighborhood effects remain significant, even with the inclusion of controls for any unobserved family and neighborhood characteristics that are constant within tracts.
View Full Paper PDF
Working Paper

MISCLASSIFICATION IN BINARY CHOICE MODELS

May 2013

Authors: Bruce Meyer, Nikolas Mittag

Working Paper Number:

CES-13-27

We derive the asymptotic bias from misclassification of the dependent variable in binary choice models. Measurement error is necessarily non-classical in this case, which leads to bias in linear and non-linear models even if only the dependent variable is mismeasured. A Monte Carlo study and an application to food stamp receipt show that the bias formulas are useful to analyze the sensitivity of substantive conclusions, to interpret biased coefficients and imply features of the estimates that are robust to misclassification. Using administrative records linked to survey data as validation data, we examine estimators that are consistent under misclassification. They can improve estimates if their assumptions hold, but can aggravate the problem if the assumptions are invalid. The estimators differ in their robustness to such violations, which can be improved by incorporating additional information. We propose tests for the presence and nature of misclassification that can help to choose an estimator.
View Full Paper PDF
Working Paper

Optimal Stratified Sampling for Probability-Based Online Panels

September 2025

Authors: Jonathan Eggleston

Working Paper Number:

CES-25-69

Online probability-based panels have emerged as a cost-efficient means of conducting surveys in the 21st century. While there have been various recent advancements in sampling techniques for online panels, several critical aspects of sampling theory for online panels are lacking. Much of current sampling theory from the middle of the 20th century, when response rates were high, and online panels did not exist. This paper presents a mathematical model of stratified sampling for online panels that takes into account historical response rates and survey costs. Through some simplifying assumptions, the model shows that the optimal sample allocation for online panels can largely resemble the solution for a cross-sectional survey. To apply the model, I use the Census Household Panel to show how this method could improve the average precision of key estimates. Holding fielding costs constant, the new sample rates improve the average precision of estimates between 1.47 and 17.25 percent, depending on the importance weight given to an overall population mean compared to mean estimates for racial and ethnic subgroups.
View Full Paper PDF
Working Paper

Public Use Microdata: Disclosure And Usefulness

September 1988

Authors: Robert H Mcguckin, Sang V Nguyen

Working Paper Number:

CES-88-03

Official statistical agencies such as the Census Bureau and the Bureau of Labor Statistics collect enormous quantities of microdata in statistical surveys. These data are valuable for economic research and market and policy analysis. However, the data cannot be released to the public because of confidentiality commitments to individual respondents. These commitments, coupled with the strong research demand for microdata, have led the agencies to consider various proposals for releasing public use microdata. Most proposals for public use microdata call for the development of surrogate data that disguise the original data. Thus, they involve the addition of measurement errors to the data. In this paper, we examine disclosure issues and explore alternative masking methods for generating panels of useful economic microdata that can be released to researchers. While our analysis applies to all confidential microdata, applications using the Census Bureau's Longitudinal Research Data Base (LRD) are used for illustrative purposes throughout the discussion.
View Full Paper PDF
Working Paper

Incorporating Administrative Data in Survey Weights for the Basic Monthly Current Population Survey

January 2024

Authors: John Voorheis, Jonathan Eggleston, Carl Lieberman, Yarissa Gonzalez, Tim Trudell

Working Paper Number:

CES-24-02

Response rates to the Current Population Survey (CPS) have declined over time, raising the potential for nonresponse bias in key population statistics. A potential solution is to leverage administrative data from government agencies and third-party data providers when constructing survey weights. In this paper, we take two approaches. First, we use administrative data to build a non-parametric nonresponse adjustment step while leaving the calibration to population estimates unchanged. Second, we use administratively linked data in the calibration process, matching income data from the Internal Return Service and state agencies, demographic data from the Social Security Administration and the decennial census, and industry data from the Census Bureau's Business Register to both responding and nonresponding households. We use the matched data in the household nonresponse adjustment of the CPS weighting algorithm, which changes the weights of respondents to account for differential nonresponse rates among subpopulations. After running the experimental weighting algorithm, we compare estimates of the unemployment rate and labor force participation rate between the experimental weights and the production weights. Before March 2020, estimates of the labor force participation rates using the experimental weights are 0.2 percentage points higher than the original estimates, with minimal effect on unemployment rate. After March 2020, the new labor force participation rates are similar, but the unemployment rate is about 0.2 percentage points higher in some months during the height of COVID-related interviewing restrictions. These results are suggestive that if there is any nonresponse bias present in the CPS, the magnitude is comparable to the typical margin of error of the unemployment rate estimate. Additionally, the results are overall similar across demographic groups and states, as well as using alternative weighting methodology. Finally, we discuss how our estimates compare to those from earlier papers that calculate estimates of bias in key CPS labor force statistics. This paper is for research purposes only. No changes to production are being implemented at this time.
View Full Paper PDF

An Economist's Primer on Survey Samples

September 2000

Working Paper Number:

CES-00-15

Abstract

Document Tags and Keywords

The 10 most similar working papers to the working paper 'An Economist's Primer on Survey Samples' are listed below in order of similarity.

August 2015

Working Paper Number:

carra-2015-06

February 2018

Working Paper Number:

CES-18-11

May 2013

Working Paper Number:

CES-13-28

April 2013

Working Paper Number:

CES-13-19

June 2024

Working Paper Number:

CES-24-27

May 2008

Working Paper Number:

CES-08-12

May 2013

Working Paper Number:

CES-13-27

September 2025

Working Paper Number:

CES-25-69

September 1988

Working Paper Number:

CES-88-03

January 2024

Working Paper Number:

CES-24-02