4.1. Individual Characteristics Files - Demographics
4.1.1. Overview
The Individual Characteristics Files (ICF) family of files contains person specific demographic information. Characteristics include gender, age/date-of-birth (DOB), place-of-birth (POB), race, ethnicity, and education. Completed information is available for all individuals who appear in the wage data (workers).
Individual Characteristics File (ICF_US)
This is the core table with characteristics for all workers. There is one record for every person on the EHF. Characteristics have been completed using a combination of observed data and imputation.
Scope: National
Key: PIK
Date of Birth, Sex, and Place of Birth Implicates (ICF_US_IMPLICATES_AGE_SEX_POB)
Education Implicates (ICF_US_IMPLICATES_EDUCATION)
Race and Ethnicity Implicates (ICF_IMPLICATES_RACE_ETHNICITY)
Workers with incomplete information in the observed data table receive imputed characteristics using a set of missing data models. When a variable is missing, multiple implicates (10 draws from the Posterior Predictive Distribution, or PPD) are provided. The IMPLICATES tables contain one record for each worker missing one or more of the characteristics provided on that table.
Scope: National
Key: PIK
4.1.2. User Guidance
Imputation Flagging
On the core ICF table (ICF_US), each demographic characteristic for a PIK is reported with an observed value if available, and otherwise is imputed. If imputed data is given for a characteristic, the first implicate from the corresponding IMPLICATES table is used. Each characteristic has an associated impute flag ([char]_imputed). The impute flags on the file show whether the data was observed (1) or imputed (2). A flag with a value (3) identifies characteristics where reported information failed internal consistency checks and was imputed. The imputed value is given on the ICF_US table.
Estimated standard errors should be adjusted when using characteristics from the ICF. Please see McKinney et al. [2021] for a method of producing standard errors that account for the additional imputation variability.
Observed Demographic Data
Demographic information is received by LEHD from the following data sources:
- Decennial Census
2000 short form
2000 long form
2010 short form
- Census Numident
Edited and collapsed longitudinal transaction data dervied from the Social Security Administration (SSA) Numident
- American Community Survey (ACS)
2001-present
Observed data for sex, date of birth, place of birth, race, and ethicity may come from any of the above sources. The level of education is provided only on the 2000 long form and the ACS. The coding systems are reconciled across the data sources and to account for changes over time. The observed data are unduplicated and linked using the person identifier.
Protected Identification Key (PIK) Assignment and Unduplication
While the underlying person identifier is the Social Security Number (SSN), SSNs are replaced with different 9 digit Protected Identification Key (PIK) that maps one to one with a SSN. The Census Numident (as well as wage record data) uses the PIK as the person record identifier by default, however all other data sources have a PIK attached to the person record using the Census Bureau’s Person Identificaiton Validation System (PVS). If a record contains a SSN, name, and DOB then the match is first attempted using those three pieces of information. If the information is consistent with the information on the Numident, then the record is verified. However, the survey data sources (ACS and Decennial) typically do not have all three pieces of information required for verification. Instead, a PIK is typically assigned to the survey data record using validation. The primary method for validation is the GeoSearch method. Using administrative residence information attached to the Numident, the search for a name and DOB match is limited to SSN records residing at the same physical location as the survey record. This greatly reduces the match-candidates, improving the quality of the match. A general name and DOB search is also performed for the residual records that do not match using the GeoSearch method. Overall match rates are high with at least 90% of survey respondents receiving an assigned PIK. Note that assignment is not exclusive and multiple survey records at a point in time may receive the same PIK. For more information on the PIK process, see Wagner and Layne [2014] and Mulrow et al. [2011].
LEHD runs an unduplication process using additional characteristics present on both the Numident and the Survey data, resulting in one record per PIK for each data source-year.
Imputation Models
Each data source (Numident, Decennial, ACS) is the primary input for at least one of the observed demographic characteristics on the ICF_US file. The table below shows the ICF_US variable and the data sources used for each variable. When multiple reports exist, the most recent information is preferred for all characteristics.
Characteristic |
Source |
Percent Missing |
Model Stage |
---|---|---|---|
Gender |
Numident |
~5% |
A |
DOB |
Numident |
~5% |
A |
POB |
Numident Decennial (2000 and 2010), ACS |
~5% |
A |
Race |
Decennial (2000 and 2010), ACS |
~20% |
B |
Ethnicity |
Decennial (2000 and 2010), ACS |
~20% |
B |
Education |
Decennial (2000), ACS |
~85% |
C |
Gender, DOB, and POB have the smallest amount of missing data (~5%), race and ethnicity have the next highest amount of missing data (~20%), and education has the largest amount of missing data (~85%). The missing data follows a monotone missing data pattern, with one set of records missing all variables, the next set missing race/ethnicity/education, and the final set of records missing only education. The monotone missing data pattern allows us to complete the data in stages, starting from the least missing characteristics and then moving on to the next set of variables, continuing until all chracteristics are complete.
In stage A we complete (replace a missing value with an implicate (the actual value returned from an imputation)) gender, DOB, and POB. With these variables completed we move on to stage B and impute race and ethnicity conditional on the DOB, gender, and POB values from stage A. In the final stage we complete education conditional on the variables imputed in both stage A and B. In addition to the demographic characteristics themselves we also use information from a worker’s earnings history and information about the worker’s co-workers/co-residents.
We use a Classification and Regression Tree (CART) based approach to cluster groups of similar workers together into cells/bins. Our imputation models are defined by the cells we assign workers to. The more homogeneity of the imputation variable (i.e. education) within a cell, the more predictive power of the model. For example, if almost all of the workers within a cell with observed education have college degrees, the implicates for each worker with missing data would have low variability, resulting in a high quality impute. Ideally, once workers are assigned to cells the variability in education outcomes will be large across cells and small within cells.
The depth of the tree (number of cells) is limited by the amount of reported information for each characteristic. For example, characteristics such as Gender, DOB, and POB have substantially more available information than education. Once the set of cells are determined at each stage, the records with reported data are assigned a cell ID. Within each cell a joint Kernel Density Estimate (KDE) is estimated for the variables that need to be completed at that stage (i.e. gender, DOB, POB in stage A). The estimated joint densities are used to generate implicates for workers with missing data by cell ID by sampling randomly from the density. Implicit in our estimation strategy is the assumption that data is missing at random (MAR) within each cell ID.
The imputation models for each stage are shown below:
- Stage A (Gender, DOB, POB; 5% missing data)
Stage A missing data is different than the missing data for the workers that only appear in stage B and C. In principal, every worker with a formal employment arrangement should have an SSN. However, LEHD receives earnings records where the PIK does not exist on the Numident. The vast majority of these workers are likely to be persons residing in the U.S. without a legal right to work (i.e. illegal immigrants). These workers are clearly not MAR, therefore the characteristics of workers in the general population are unlikely to match the characteristics of the workers with missing data. To sidestep the issue we assume that recent legal immigrants have similar characteristics as recent illegal immigrants (the workers with missing data), especially when the worker is employed at firms with a large share of recent legal immigrants.
- Cell Variables
Modal Non-Native: If the co-worker fraction native is >0.5 then the modal non-native is set to U.S., else the modal non-native is set to the modal value across all co-workers at all jobs for each worker.
Co-Worker Fraction Male: If the fraction male is >0.5 then high, else low.
New-Worker: 1 if entered data after the first quarter the state appears, else 0. A worker is more likely to be an immigrant if they first appear after the first year the state reports data.
- Stage B (Race, Ethnicity; 20% missing data)
Stage B reported data comes from survey data sources. The workers with missing data are a combination of persons not sampled and non-respondents. The workers with missing race and ethnicity data are much more likely to be MAR than the workers in stage A.
- Cell Variables
Gender
DOB
Bestrace: Collapsed Numident race information
Earnings: Average annual earnings across all jobs (quartiles) if bestrace is missing
POB_race: Collapsed POB categories based on the modal race of each POB
Co-Resident and Co-Worker Black/White, Hispanic/Not-Hispanic groups
- Stage C (Education (age 25+); 85% missing data)
Stage C reported data comes from survey data sources. Education data is only used/imputed for workers age 25+. The workers with missing data are a combination of persons not sampled and non-respondents. The workers with missing education data are much more likely to be MAR than the workers in stage A, however, the share of missing data is significantly larger for Stage C than the earlier stages resulting in significantly less observed data available to build a model. For stage C we use two parallel approaches. The first approach uses a log-linear model to impute education based on chracteristics similar to those used for the second non-parametric approach as used above in stages A and B. However, the log-linear approach does not require fully interacted effects across all conditioning variables, providing additional modeling flexibility given the limited reported education data available. The additional modeling flexibility is used to control for smaller geographic groups than the non-parametric approach.
- Cell Variables
Gender
DOB
Race/Ethnicity
POB_race_educ: Collapsed POB based on country specific immigrant education levels.
Earnings: Average annual earnings across all jobs (ventiles)
Earnings FB: Low and high earnings category for foreign born workers
Industry: Collapsed NAICS industry groups based on observed education
Co-Resident and Co-Worker Black/White, Hispanic/Not-Hispanic groups
4.1.3. Codebook: The ICF_US File
Table Metadata for Individual Characteristics File (ICF_US)
State Approval Required | IRS Approval Required | SSA Approval Required | |
---|---|---|---|
Access Requirements | X |
- Description
Demographics, place of birth, and education of workers.
- Scope
National
- Key
PIK
- Sort Order
PIK
- File Format
SAS Data Table
- Download Codebook
Variable Information
Variable Name | Type | Length | Description |
---|---|---|---|
PIK | char | 9 | Protected Identification Key |
DOB | num | 4 | Date of birth |
char | 1 | Gender (See details below) |
|
char | 1 | Place of birth (See details below) |
|
char | 1 | Race (See details below) |
|
char | 1 | Ethnicity (See details below) |
|
char | 1 | Highest educational attainment (age 25+) (See details below) |
|
char | 1 | Imputation status for sex (See details below) |
|
char | 1 | Imputation status for race (See details below) |
|
char | 1 | Imputation status of ethnicity (See details below) |
|
char | 1 | Imputation status of educ_c (See details below) |
|
char | 1 | Imputation status for POB (See details below) |
|
char | 1 | Imputation status for DOB (See details below) |
Details for variable SEX on ICF_US
- Description
Gender
- Codebook
Value Label F Female M Male
Details for variable POB on ICF_US
- Description
Place of birth
- Codebook
Value Label 1 Central Asia 2 South East Asia 3 Middle East and North Africa 4 Caribbean 5 Central America 6 South America 7 Africa 8 Oceania 9 Not Specified A US or territory B Mexico C Philippines D Vietnam E India F Germany G Puerto Rico H El Salvador I Cuba J United Kingdom K Canada L China M South Korea N Taiwan O Guatemala P Japan Q Haiti R USSR Core S Jamaica T Colombia U Poland V Iran W Dominican Republic X Italy Y Former Socialist Europe Z Western Europe
Details for variable RACE on ICF_US
- Description
Race
- Codebook
Value Label 1 White 2 Black 3 American Indian/Alaska Native 4 Asian 5 Native Hawaiian/Pacific Islander 7 Two or more races
Details for variable ETHNICITY on ICF_US
- Description
Ethnicity
- Codebook
Value Label H Hispanic N Non-Hispanic
Details for variable EDUC_C on ICF_US
- Description
Highest educational attainment (age 25+)
- Codebook
Value Label 1 Less than high school 2 High school 3 Some college 4 Bachelor's degree or higher
Details for variables: SEX_IMPUTED, RACE_IMPUTED, ETHNICITY_IMPUTED, EDUC_C_IMPUTED, POB_IMPUTED, DOB_IMPUTED on ICF_US
- Description
SEX_IMPUTED: Imputation status for sex
RACE_IMPUTED: Imputation status for race
ETHNICITY_IMPUTED: Imputation status of ethnicity
EDUC_C_IMPUTED: Imputation status of educ_c
POB_IMPUTED: Imputation status for POB
DOB_IMPUTED: Imputation status for DOB
- Codebook
Value Label 1 Demographic characteristic observed 2 Demographic characteristic imputed 3 Demographic characteristic imputed (observed data unreliable)
4.1.4. Codebook: The ICF_US_IMPLICATES_AGE_SEX_POB File
Table Metadata for Date of Birth/Sex/Place of Birth Implicates (ICF_US_IMPLICATES_AGE_SEX_POB)
State Approval Required | IRS Approval Required | SSA Approval Required | |
---|---|---|---|
Access Requirements | X |
- Description
Multiply imputed date of birth, sex, and place of birth variables to complete missing information.
- Scope
National
- Key
PIK
- Sort Order
PIK
- File Format
SAS Data Table
- Download Codebook
Variable Information
Variable Name | Type | Length | Description |
---|---|---|---|
PIK | char | 9 | Protected Identification Key |
DOB1-DOB10 | num | 4 | Date of Birth Implicates |
char | 1 | Gender Implicates (See details below) |
|
char | 1 | Place of Birth Implicates (See details below) |
Details for variables SEX1-SEX10 on ICF_US_IMPLICATES_AGE_SEX_POB
- Description
Gender Implicates
- Codebook
Value Label F Female M Male
Details for variables POB1-POB10 on ICF_US_IMPLICATES_AGE_SEX_POB
- Description
Place of Birth Implicates
- Codebook
Value Label 1 Central Asia 2 South East Asia 3 Middle East and North Africa 4 Caribbean 5 Central America 6 South America 7 Africa 8 Oceania 9 Not Specified A US or territory B Mexico C Philippines D Vietnam E India F Germany G Puerto Rico H El Salvador I Cuba J United Kingdom K Canada L China M South Korea N Taiwan O Guatemala P Japan Q Haiti R USSR Core S Jamaica T Colombia U Poland V Iran W Dominican Republic X Italy Y Former Socialist Europe Z Western Europe
4.1.5. Codebook: The ICF_US_IMPLICATES_EDUCATION File
Table Metadata for Education Implicates (ICF_US_IMPLICATES_EDUCATION)
State Approval Required | IRS Approval Required | SSA Approval Required | |
---|---|---|---|
Access Requirements | X |
- Description
Multiply imputed education variables to complete missing information.
- Scope
National
- Key
PIK
- Sort Order
PIK
- File Format
SAS Data Table
- Download Codebook
Variable Information
Variable Name | Type | Length | Description |
---|---|---|---|
PIK | char | 9 | Protected Identification Key |
char | 1 | Highest educational attainment (age 25+) Implicates (See details below) |
Details for variables EDUC_C1-EDUC_C10 on ICF_US_IMPLICATES_EDUCATION
- Description
Highest educational attainment (age 25+) Implicates
- Codebook
Value Label 1 Less than high school 2 High school 3 Some college 4 Bachelor's degree or higher
4.1.6. Codebook: The ICF_US_IMPLICATES_RACE_ETHNICITY File
Table Metadata for Race/Ethnicity Implicates (ICF_US_IMPLICATES_RACE_ETHNICITY)
State Approval Required | IRS Approval Required | SSA Approval Required | |
---|---|---|---|
Access Requirements | X |
- Description
Multiply imputed race and ethnicity variables to complete missing information.
- Scope
National
- Key
PIK
- Sort Order
PIK
- File Format
SAS Data Table
- Download Codebook
Variable Information
Variable Name | Type | Length | Description |
---|---|---|---|
PIK | char | 9 | Protected Identification Key |
char | 1 | Race Implicates (See details below) |
|
char | 1 | Ethnicity Implicates (See details below) |
Details for variables RACE1-RACE10 on ICF_US_IMPLICATES_RACE_ETHNICITY
- Description
Race Implicates
- Codebook
Value Label 1 White 2 Black 3 American Indian/Alaska Native 4 Asian 5 Native Hawaiian/Pacific Islander 7 Two or more races
Details for variables ETHNICITY1-ETHNICITY10 on ICF_US_IMPLICATES_RACE_ETHNICITY
- Description
Ethnicity Implicates
- Codebook
Value Label H Hispanic N Non-Hispanic