4.1. Individual Characteristics Files - Demographics

4.1.1. Overview

The Individual Characteristics Files (ICF) family of files contains person specific demographic information. Characteristics include gender, age/date-of-birth (DOB), place-of-birth (POB), race, ethnicity, and education. Completed information is available for all individuals who appear in the wage data (workers).

  • Individual Characteristics File (ICF_US)

    This is the core table with characteristics for all workers. There is one record for every person on the EHF. Characteristics have been completed using a combination of observed data and imputation.

    • Scope: National

    • Key: PIK

  • Date of Birth, Sex, and Place of Birth Implicates (ICF_US_IMPLICATES_AGE_SEX_POB)

  • Education Implicates (ICF_US_IMPLICATES_EDUCATION)

  • Race and Ethnicity Implicates (ICF_IMPLICATES_RACE_ETHNICITY)

    Workers with incomplete information in the observed data table receive imputed characteristics using a set of missing data models. When a variable is missing, multiple implicates (10 draws from the Posterior Predictive Distribution, or PPD) are provided. The IMPLICATES tables contain one record for each worker missing one or more of the characteristics provided on that table.

    • Scope: National

    • Key: PIK

4.1.2. User Guidance

Imputation Flagging

On the core ICF table (ICF_US), each demographic characteristic for a PIK is reported with an observed value if available, and otherwise is imputed. If imputed data is given for a characteristic, the first implicate from the corresponding IMPLICATES table is used. Each characteristic has an associated impute flag ([char]_imputed). The impute flags on the file show whether the data was observed (1) or imputed (2). A flag with a value (3) identifies characteristics where reported information failed internal consistency checks and was imputed. The imputed value is given on the ICF_US table.

Estimated standard errors should be adjusted when using characteristics from the ICF. Please see McKinney et al. [2021] for a method of producing standard errors that account for the additional imputation variability.

Observed Demographic Data

Demographic information is received by LEHD from the following data sources:

  • Decennial Census
    • 2000 short form

    • 2000 long form

    • 2010 short form

  • Census Numident
    • Edited and collapsed longitudinal transaction data dervied from the Social Security Administration (SSA) Numident

  • American Community Survey (ACS)
    • 2001-present

Observed data for sex, date of birth, place of birth, race, and ethicity may come from any of the above sources. The level of education is provided only on the 2000 long form and the ACS. The coding systems are reconciled across the data sources and to account for changes over time. The observed data are unduplicated and linked using the person identifier.

Protected Identification Key (PIK) Assignment and Unduplication

While the underlying person identifier is the Social Security Number (SSN), SSNs are replaced with different 9 digit Protected Identification Key (PIK) that maps one to one with a SSN. The Census Numident (as well as wage record data) uses the PIK as the person record identifier by default, however all other data sources have a PIK attached to the person record using the Census Bureau’s Person Identificaiton Validation System (PVS). If a record contains a SSN, name, and DOB then the match is first attempted using those three pieces of information. If the information is consistent with the information on the Numident, then the record is verified. However, the survey data sources (ACS and Decennial) typically do not have all three pieces of information required for verification. Instead, a PIK is typically assigned to the survey data record using validation. The primary method for validation is the GeoSearch method. Using adiministrative residence information attached to the Numident, the search for a name and DOB match is limited to SSN records residing at the same physical location as the survey record. This greatly reduces the match-candidates, improving the quality of the match. A general name and DOB search is also performed for the residual records that do not match using the GeoSearch method. Overall match rates are high with at least 90% of survey respondents receiving an assigned PIK. Note that assignment is not exclusive and multiple survey records at a point in time may receive the same PIK. For more information on the PIK process, see Wagner and Layne [2014] and Mulrow et al. [2011].

LEHD runs an unduplication process using additional characteristics present on both the Numident and the Survey data, resulting in one record per PIK for each data source-year.

Imputation Models

Each data source (Numident, Decennial, ACS) is the primary input for at least one of the observed demographic characteristics on the ICF_US file. The table below shows the ICF_US variable and the data sources used for each variable. When multiple reports exist, the most recent information is preferred for all characteristics.

Data Sources Used for Characteristics in the ICF_US

Characteristic

Source

Percent Missing

Model Stage

Gender

Numident

~5%

A

DOB

Numident

~5%

A

POB

Numident Decennial (2000 and 2010), ACS

~5%

A

Race

Decennial (2000 and 2010), ACS

~20%

B

Ethnicity

Decennial (2000 and 2010), ACS

~20%

B

Education

Decennial (2000), ACS

~85%

C

Gender, DOB, and POB have the smallest amount of missing data (~5%), race and ethnicity have the next highest amount of missing data (~20%), and education has the largest amount of missing data (~85%). The missing data follows a monotone missing data pattern, with one set of records missing all variables, the next set missing race/ethnicity/education, and the final set of records missing only education. The monotone missing data pattern allows us to complete the data in stages, starting from the least missing characteristics and then moving on to the next set of variables, continuing until all chracteristics are complete.

In stage A we complete (replace a missing value with an implicate (the actual value returned from an imputation)) gender, DOB, and POB. With these variables completed we move on to stage B and impute race and ethnicity conditional on the DOB, gender, and POB values from stage A. In the final stage we complete education conditional on the variables imputed in both stage A and B. In addition to the demographic characteristics themselves we also use information from a worker’s earnings history and information about the worker’s co-workers/co-residents.

We use a Classification and Regression Tree (CART) based approach to cluster groups of similar workers together into cells/bins. Our imputation models are defined by the cells we assign workers to. The more homogeneity of the imputation variable (i.e. education) within a cell, the more predictive power of the model. For example, if almost all of the workers within a cell with observed education have college degrees, the implicates for each worker with missing data would have low variability, resulting in a high quality impute. Ideally, once workers are assigned to cells the variability in education outcomes will be large across cells and small within cells.

The depth of the tree (number of cells) is limited by the amount of reported information for each characteristic. For example, characteristics such as Gender, DOB, and POB have substantially more available information than education. Once the set of cells are determined at each stage, the records with reported data are assigned a cell ID. Within each cell a joint Kernel Density Estimate (KDE) is estimated for the variables that need to be completed at that stage (i.e. gender, DOB, POB in stage A). The estimated joint densities are used to generate implicates for workers with missing data by cell ID by sampling randomly from the density. Implicit in our estimation strategy is the assumption that data is missing at random (MAR) within each cell ID.

The imputation models for each stage are shown below:

Stage A (Gender, DOB, POB; 5% missing data)

Stage A missing data is different than the missing data for the workers that only appear in stage B and C. In principal, every worker with a formal employment arrangement should have an SSN. However, LEHD receives earnings records where the PIK does not exist on the Numident. The vast majority of these workers are likely to be persons residing in the U.S. without a legal right to work (i.e. illegal immigrants). These workers are clearly not MAR, therefore the characteristics of workers in the general population are unlikely to match the characteristics of the workers with missing data. To sidestep the issue we assume that recent legal immigrants have similar characteristics as recent illegal immigrants (the workers with missing data), especially when the worker is employed at firms with a large share of recent legal immigrants.

  • Cell Variables
    • Modal Non-Native: If the co-worker fraction native is >0.5 then the modal non-native is set to U.S., else the modal non-native is set to the modal value across all co-workers at all jobs for each worker.

    • Co-Worker Fraction Male: If the fraction male is >0.5 then high, else low.

    • New-Worker: 1 if entered data after the first quarter the state appears, else 0. A worker is more likely to be an immigrant if they first appear after the first year the state reports data.

Stage B (Race, Ethnicity; 20% missing data)

Stage B reported data comes from survey data sources. The workers with missing data are a combination of persons not sampled and non-respondents. The workers with missing race and ethnicity data are much more likely to be MAR than the workers in stage A.

  • Cell Variables
    • Gender

    • DOB

    • Bestrace: Collapsed Numident race information

    • Earnings: Average annual earnings across all jobs (quartiles) if bestrace is missing

    • POB_race: Collapsed POB categories based on the modal race of each POB

    • Co-Resident and Co-Worker Black/White, Hispanic/Not-Hispanic groups

Stage C (Education (age 25+); 85% missing data)

Stage C reported data comes from survey data sources. Education data is only used/imputed for workers age 25+. The workers with missing data are a combination of persons not sampled and non-respondents. The workers with missing education data are much more likely to be MAR than the workers in stage A, however, the share of missing data is significantly larger for Stage C than the earlier stages resulting in significantly less observed data available to build a model. For stage C we use two parallel approaches. The first approach uses a log-linear model to impute education based on chracteristics similar to those used for the second non-parametric approach as used above in stages A and B. However, the log-linear approach does not require fully interacted effects across all conditioning variables, providing additional modeling flexibility given the limited reported education data available. The additional modeling flexibility is used to control for smaller geographic groups than the non-parametric approach.

  • Cell Variables
    • Gender

    • DOB

    • Race/Ethnicity

    • POB_race_educ: Collapsed POB based on country specific immigrant education levels.

    • Earnings: Average annual earnings across all jobs (ventiles)

    • Earnings FB: Low and high earnings category for foreign born workers

    • Industry: Collapsed NAICS industry groups based on observed education

    • Co-Resident and Co-Worker Black/White, Hispanic/Not-Hispanic groups

4.1.3. Codebook: The ICF_US File

Table Metadata for Individual Characteristics File (ICF_US)

Access Requirements for ICF_US
State Approval Required IRS Approval Required SSA Approval Required
Access Requirements X
Description

Demographics, place of birth, and education of workers.

Scope

National

Key

PIK

Sort Order

PIK

File Format

SAS Data Table

Download Codebook

CSV

Variable Information

Variable Information for ICF_US
Variable Name Type Length Description
PIK char 9 Protected Identification Key
DOB num 4 Date of birth

SEX

char 1

Gender (See details below)

POB

char 1

Place of birth (See details below)

RACE

char 1

Race (See details below)

ETHNICITY

char 1

Ethnicity (See details below)

EDUC_C

char 1

Highest educational attainment (age 25+) (See details below)

SEX_IMPUTED

char 1

Imputation status for sex (See details below)

RACE_IMPUTED

char 1

Imputation status for race (See details below)

ETHNICITY_IMPUTED

char 1

Imputation status of ethnicity (See details below)

EDUC_C_IMPUTED

char 1

Imputation status of educ_c (See details below)

POB_IMPUTED

char 1

Imputation status for POB (See details below)

DOB_IMPUTED

char 1

Imputation status for DOB (See details below)

Details for variable SEX on ICF_US

Back

Description

Gender

Codebook

Download as CSV

Value Label
F Female
M Male

Details for variable POB on ICF_US

Back

Description

Place of birth

Codebook

Download as CSV

Value Label
1 Central Asia
2 South East Asia
3 Middle East and North Africa
4 Caribbean
5 Central America
6 South America
7 Africa
8 Oceania
9 Not Specified
A US or territory
B Mexico
C Philippines
D Vietnam
E India
F Germany
G Puerto Rico
H El Salvador
I Cuba
J United Kingdom
K Canada
L China
M South Korea
N Taiwan
O Guatemala
P Japan
Q Haiti
R USSR Core
S Jamaica
T Colombia
U Poland
V Iran
W Dominican Republic
X Italy
Y Former Socialist Europe
Z Western Europe

Details for variable RACE on ICF_US

Back

Description

Race

Codebook

Download as CSV

Value Label
1 White
2 Black
3 American Indian/Alaska Native
4 Asian
5 Native Hawaiian/Pacific Islander
7 Two or more races

Details for variable ETHNICITY on ICF_US

Back

Description

Ethnicity

Codebook

Download as CSV

Value Label
H Hispanic
N Non-Hispanic

Details for variable EDUC_C on ICF_US

Back

Description

Highest educational attainment (age 25+)

Codebook

Download as CSV

Value Label
1 Less than high school
2 High school
3 Some college
4 Bachelor's degree or higher

Details for variable SEX_IMPUTED on ICF_US

Back

Description

Imputation status for sex

Codebook

Download as CSV

Value Label
1 Demographic characteristic observed
2 Demographic characteristic imputed
3 Demographic characteristic imputed (observed data unreliable)

Details for variable RACE_IMPUTED on ICF_US

Back

Description

Imputation status for race

Codebook

Download as CSV

Value Label
1 Demographic characteristic observed
2 Demographic characteristic imputed
3 Demographic characteristic imputed (observed data unreliable)

Details for variable ETHNICITY_IMPUTED on ICF_US

Back

Description

Imputation status of ethnicity

Codebook

Download as CSV

Value Label
1 Demographic characteristic observed
2 Demographic characteristic imputed
3 Demographic characteristic imputed (observed data unreliable)

Details for variable EDUC_C_IMPUTED on ICF_US

Back

Description

Imputation status of educ_c

Codebook

Download as CSV

Value Label
1 Demographic characteristic observed
2 Demographic characteristic imputed
3 Demographic characteristic imputed (observed data unreliable)

Details for variable POB_IMPUTED on ICF_US

Back

Description

Imputation status for POB

Codebook

Download as CSV

Value Label
1 Demographic characteristic observed
2 Demographic characteristic imputed
3 Demographic characteristic imputed (observed data unreliable)

Details for variable DOB_IMPUTED on ICF_US

Back

Description

Imputation status for DOB

Codebook

Download as CSV

Value Label
1 Demographic characteristic observed
2 Demographic characteristic imputed
3 Demographic characteristic imputed (observed data unreliable)

4.1.4. Codebook: The ICF_US_IMPLICATES_AGE_SEX_POB File

Table Metadata for Date of Birth/Sex/Place of Birth Implicates (ICF_US_IMPLICATES_AGE_SEX_POB)

Access Requirements for ICF_US_IMPLICATES_AGE_SEX_POB
State Approval Required IRS Approval Required SSA Approval Required
Access Requirements X
Description

Multiply imputed date of birth, sex, and place of birth variables to complete missing information.

Scope

National

Key

PIK

Sort Order

PIK

File Format

SAS Data Table

Download Codebook

CSV

Variable Information

Variable Information for ICF_US_IMPLICATES_AGE_SEX_POB
Variable Name Type Length Description
PIK char 9 Protected Identification Key
DOB1-DOB10 num 4 Date of Birth Implicates

SEX1-SEX10

char 1

Gender Implicates (See details below)

POB1-POB10

char 1

Place of Birth Implicates (See details below)

Details for variables SEX1-SEX10 on ICF_US_IMPLICATES_AGE_SEX_POB

Back

Description

Gender Implicates

Codebook

Download as CSV

Value Label
F Female
M Male

Details for variables POB1-POB10 on ICF_US_IMPLICATES_AGE_SEX_POB

Back

Description

Place of Birth Implicates

Codebook

Download as CSV

Value Label
1 Central Asia
2 South East Asia
3 Middle East and North Africa
4 Caribbean
5 Central America
6 South America
7 Africa
8 Oceania
9 Not Specified
A US or territory
B Mexico
C Philippines
D Vietnam
E India
F Germany
G Puerto Rico
H El Salvador
I Cuba
J United Kingdom
K Canada
L China
M South Korea
N Taiwan
O Guatemala
P Japan
Q Haiti
R USSR Core
S Jamaica
T Colombia
U Poland
V Iran
W Dominican Republic
X Italy
Y Former Socialist Europe
Z Western Europe

4.1.5. Codebook: The ICF_US_IMPLICATES_EDUCATION File

Table Metadata for Education Implicates (ICF_US_IMPLICATES_EDUCATION)

Access Requirements for ICF_US_IMPLICATES_EDUCATION
State Approval Required IRS Approval Required SSA Approval Required
Access Requirements X
Description

Multiply imputed education variables to complete missing information.

Scope

National

Key

PIK

Sort Order

PIK

File Format

SAS Data Table

Download Codebook

CSV

Variable Information

Variable Information for ICF_US_IMPLICATES_EDUCATION
Variable Name Type Length Description
PIK char 9 Protected Identification Key

EDUC_C1-EDUC_C10

char 1

Highest educational attainment (age 25+) Implicates (See details below)

Details for variables EDUC_C1-EDUC_C10 on ICF_US_IMPLICATES_EDUCATION

Back

Description

Highest educational attainment (age 25+) Implicates

Codebook

Download as CSV

Value Label
1 Less than high school
2 High school
3 Some college
4 Bachelor's degree or higher

4.1.6. Codebook: The ICF_US_IMPLICATES_RACE_ETHNICITY File

Table Metadata for Race/Ethnicity Implicates (ICF_US_IMPLICATES_RACE_ETHNICITY)

Access Requirements for ICF_US_IMPLICATES_RACE_ETHNICITY
State Approval Required IRS Approval Required SSA Approval Required
Access Requirements X
Description

Multiply imputed race and ethnicity variables to complete missing information.

Scope

National

Key

PIK

Sort Order

PIK

File Format

SAS Data Table

Download Codebook

CSV

Variable Information

Variable Information for ICF_US_IMPLICATES_RACE_ETHNICITY
Variable Name Type Length Description
PIK char 9 Protected Identification Key

RACE1-RACE10

char 1

Race Implicates (See details below)

ETHNICITY1-ETHNICITY10

char 1

Ethnicity Implicates (See details below)

Details for variables RACE1-RACE10 on ICF_US_IMPLICATES_RACE_ETHNICITY

Back

Description

Race Implicates

Codebook

Download as CSV

Value Label
1 White
2 Black
3 American Indian/Alaska Native
4 Asian
5 Native Hawaiian/Pacific Islander
7 Two or more races

Details for variables ETHNICITY1-ETHNICITY10 on ICF_US_IMPLICATES_RACE_ETHNICITY

Back

Description

Ethnicity Implicates

Codebook

Download as CSV

Value Label
H Hispanic
N Non-Hispanic