4.1. Individual Characteristics Files - Demographics

4.1.1. Overview

The Individual Characteristics Files (ICF) family of files contains person specific demographic information. Characteristics include sex, age/date-of-birth (DOB), place-of-birth (POB), race, ethnicity, and education. Completed information is available for all individuals who appear in the wage data (workers).

Individual Characteristics File (ICF_US)

This is the core table with characteristics for all workers. There is one record for every person on the EHF. Characteristics have been completed using a combination of observed data and imputation.
- Scope: National
- Key: PIK
Date of Birth, Sex, and Place of Birth Implicates (ICF_US_IMPLICATES_AGE_SEX_POB)
Education Implicates (ICF_US_IMPLICATES_EDUCATION)
Race and Ethnicity Implicates (ICF_IMPLICATES_RACE_ETHNICITY)

Workers with incomplete information in the observed data table receive imputed characteristics using a set of missing data models. When a variable is missing, multiple implicates (10 draws from the Posterior Predictive Distribution, or PPD) are provided. The IMPLICATES tables contain one record for each worker missing one or more of the characteristics provided on that table.
- Scope: National
- Key: PIK

4.1.2. User Guidance

Imputation Flagging

On the core ICF table (ICF_US), each demographic characteristic for a PIK is reported with an observed value if available, and otherwise is imputed. If imputed data is given for a characteristic, the first implicate from the corresponding IMPLICATES table is used. Each characteristic has an associated impute flag ([char]_imputed). The impute flags on the file show whether the data was observed (1) or imputed (2). A flag with a value (3) identifies characteristics where reported information failed internal consistency checks and was imputed. The imputed value is given on the ICF_US table.

Estimated standard errors should be adjusted when using characteristics from the ICF. Please see McKinney et al. [2021] for a method of producing standard errors that account for the additional imputation variability.

Observed Demographic Data

Demographic information is received by LEHD from the following data sources:

Decennial Census
- 2000 short form
- 2000 long form
- 2010 short form
Census Numident
- Edited and collapsed longitudinal transaction data derived from the Social Security Administration (SSA) Numident
American Community Survey (ACS)
- 2001-present

Observed data for sex, date of birth, place of birth, race, and ethicity may come from any of the above sources. The level of education is provided only on the 2000 long form and the ACS. The coding systems are reconciled across the data sources and to account for changes over time. The observed data are unduplicated and linked using the person identifier.

Protected Identification Key (PIK) Assignment and Unduplication

While the underlying person identifier is the Social Security Number (SSN), SSNs are replaced with different 9 digit Protected Identification Key (PIK) that maps one to one with a SSN. The Census Numident (as well as wage record data) uses the PIK as the person record identifier by default, however all other data sources have a PIK attached to the person record using the Census Bureau’s Person Identificaiton Validation System (PVS). If a record contains a SSN, name, and DOB then the match is first attempted using those three pieces of information. If the information is consistent with the information on the Numident, then the record is verified. However, the survey data sources (ACS and Decennial) typically do not have all three pieces of information required for verification. Instead, a PIK is typically assigned to the survey data record using validation. The primary method for validation is the GeoSearch method. Using administrative residence information attached to the Numident, the search for a name and DOB match is limited to SSN records residing at the same physical location as the survey record. This greatly reduces the match-candidates, improving the quality of the match. A general name and DOB search is also performed for the residual records that do not match using the GeoSearch method. Overall match rates are high with at least 90% of survey respondents receiving an assigned PIK. Note that assignment is not exclusive and multiple survey records at a point in time may receive the same PIK. For more information on the PIK process, see Wagner and Layne [2014] and Mulrow et al. [2011].

LEHD runs an unduplication process using additional characteristics present on both the Numident and the Survey data, resulting in one record per PIK for each data source-year.

Imputation Models

The LEHD demographic characteristics imputation system consists of a set of “full” information imputation models developed by the research staff in 2010, combined with a “limited” information production imputation model originally designed to be used for the small set of new workers appearing each quarter. The imputed values for each worker (implicates) should have been updated on an annual basis, however, the full information imputation system was only run once by the research staff in 2010 and is no longer operational. During the ensuing almost 15 years since the full information imputation was run, the quality of the imputed data has deteriorated as younger birth cohorts who receive only the limited information impute enter the data and older birth cohorts who received the full information impute exit. For the variables with relatively low missing data rates (Sex, DOB, POB, Race, and Ethnicity) the quality impact of using the limited information imputation model is relatively mild, but for education with a missing data rate of almost 80%, the quality impact is more severe. The quality impact is most severe for more recent earnings year estimation cells composed primarily of younger workers. A good rule of thumb is, the older the earnings year and/or the older average age of the workers in an estimation cell, the higher quality the education data. A team of Census researchers is actively working to update and integrate the full information imputation system into regular production processing. This work is expected to be finished in 2025.

Data Sources and Missingness

Each data source (Numident, Decennial, ACS) is the primary input for at least one of the observed demographic characteristics on the ICF_US file. The table below shows the ICF_US variable and the data sources used for each variable. When multiple reports exist, the most recent information is preferred for all characteristics.

Data Sources Used for Characteristics in the ICF_US
Characteristic	Source	Percent Missing	Model Stage
Sex	Numident	~5%	A
DOB	Numident	~5%	A
POB	Numident Decennial (2000 and 2010), ACS	~5%	A
Race	Decennial (2000 and 2010), ACS	~20%	B
Ethnicity	Decennial (2000 and 2010), ACS	~20%	B
Education	Decennial (2000), ACS	~80%	C

Sex, DOB, and POB have the smallest amount of missing data (~5%), race and ethnicity have the next highest amount of missing data (~20%), and education has the largest amount of missing data (~80%). The missing data follows a monotone missing data pattern, with one set of records missing all variables, the next set missing race/ethnicity/education, and the final set of records missing only education. The monotone missing data pattern allows us to complete the data in stages, starting from the least missing characteristics and then moving on to the next set of variables, continuing until all chracteristics are complete.

In stage A we complete (replace a missing value with an implicate, the actual value returned from an imputation) sex, DOB, and POB. With these variables completed we move on to stage B and impute race and ethnicity conditional on the DOB, sex, and POB values from stage A. In the final stage we complete education conditional on the variables imputed in both stage A and B. In addition to the demographic characteristics themselves we also use information from a worker’s earnings history and information about the worker’s co-workers/co-residents.

Full Information Imputation System

We use a Classification and Regression Tree (CART) based approach to cluster groups of similar workers together into cells/bins. Our imputation models are defined by the cells we assign workers to. The more homogeneity of the imputation variable (i.e. education) within a cell, the more predictive power of the model. For example, if almost all of the workers within a cell with observed education have college degrees, the implicates for each worker with missing data would have low variability, resulting in a high quality impute. Ideally, once workers are assigned to cells the variability in education outcomes will be large across cells and small within cells.

The depth of the tree (number of cells) is limited by the amount of reported information for each characteristic. For example, characteristics such as sex, DOB, and POB have substantially more available information than education. Once the set of cells are determined at each stage, the records with reported data are assigned a cell ID. Within each cell a joint Kernel Density Estimate (KDE) is estimated for the variables that need to be completed at that stage (i.e. sex, DOB, POB in stage A). The estimated joint densities are used to generate implicates for workers with missing data by cell ID by sampling randomly from the density. Implicit in our estimation strategy is the assumption that data is missing at random (MAR) within each cell ID.

The imputation models for each stage are shown below:

Stage A (Sex, DOB, POB; 5% missing data): Stage A missing data is different than the missing data for the workers that only appear in stage B and C. In principal, every worker with a formal employment arrangement should have an SSN. However, LEHD receives earnings records where the PIK does not exist on the Numident. The vast majority of these workers are likely to be persons residing in the U.S. without a legal right to work (i.e. illegal immigrants). These workers are clearly not MAR, therefore the characteristics of workers in the general population are unlikely to match the characteristics of the workers with missing data. To sidestep the issue we assume that recent legal immigrants have similar characteristics as recent illegal immigrants (the workers with missing data), especially when the worker is employed at firms with a large share of recent legal immigrants.

Cell Variables
- Modal Non-Native: If the co-worker fraction native is >0.5 then the modal non-native is set to U.S., else the modal non-native is set to the modal value across all co-workers at all jobs for each worker.
- Co-Worker Fraction Male: If the fraction male is >0.5 then high, else low.
- New-Worker: 1 if entered data after the first quarter the state appears, else 0. A worker is more likely to be an immigrant if they first appear after the first year the state reports data.

Stage B (Race, Ethnicity; 20% missing data): Stage B reported data comes from survey data sources. The workers with missing data are a combination of persons not sampled and non-respondents. The workers with missing race and ethnicity data are much more likely to be MAR than the workers in stage A.

Cell Variables
- Sex
- DOB
- Bestrace: Collapsed Numident race information
- Earnings: Average annual earnings across all jobs (quartiles) if bestrace is missing
- POB_race: Collapsed POB categories based on the modal race of each POB
- Co-Resident and Co-Worker Black/White, Hispanic/Not-Hispanic groups

Stage C (Education (age 25+); 85% missing data): Stage C reported data comes from survey data sources. Education data is only used/imputed for workers age 25+. The workers with missing data are a combination of persons not sampled and non-respondents. The workers with missing education data are much more likely to be MAR than the workers in stage A, however, the share of missing data is significantly larger for Stage C than the earlier stages resulting in significantly less observed data available to build a model. For stage C we use two parallel approaches. The first approach uses a log-linear model to impute education based on chracteristics similar to those used for the second non-parametric approach as used above in stages A and B. However, the log-linear approach does not require fully interacted effects across all conditioning variables, providing additional modeling flexibility given the limited reported education data available. The additional modeling flexibility is used to control for smaller geographic groups than the non-parametric approach.

Cell Variables
- Sex
- DOB
- Race/Ethnicity
- POB_race_educ: Collapsed POB based on country specific immigrant education levels.
- Earnings: Average annual earnings across all jobs (ventiles)
- Earnings FB: Low and high earnings category for foreign born workers
- Industry: Collapsed NAICS industry groups based on observed education
- Co-Resident and Co-Worker Black/White, Hispanic/Not-Hispanic groups

See the appendix in McKinney et al. [2017] for more detailed methodolgical information.

Limited Information Imputation System

Workers in the Snapshot with positive UI earnings prior to 2009 and no observed education received the multiple stage full information imputation described in more detail above. Workers that entered the Snapshot from 2009 forward with no observed education received a lower quality “limited” information impute. This imputation model conditions on any observed characteristics (age, sex, place of birth, race, ethnicity) and draws missing characteristics using a set of posterior predictive distributions (PPD) provided in 2010 by the original demographic characteristics imputation system team. Most importantly for data quality, this imputation model does not condition on annual earnings or the observed characteristics of the worker’s co-workers and/or co-residents.

4.1.3. Codebook: The ICF_US File

Table Metadata for Individual Characteristics File (ICF_US)

Access Requirements for ICF_US
	State Approval Required	IRS Approval Required	SSA Approval Required
Access Requirements			X

Description: Demographics, place of birth, and education of workers.
Scope: National
Key: pik
Sort Order: pik
File Formats: SAS Data Table, Parquet (partitioned by PIK group)
Download Codebook: CSV

Variable Information

Variable Information for ICF_US
Variable Name	SAS Variable Type	SAS Variable Length	Parquet Variable Type	Description
pik	char	9	string	Protected Identification Key
dob	num	4	date	Date of birth
sex	char	1	string	Sex (See details below)
pob	char	1	string	Place of birth (See details below)
race	char	1	string	Race (See details below)
ethnicity	char	1	string	Ethnicity (See details below)
educ_c	char	1	string	Highest educational attainment (age 25+) (See details below)
sex_imputed	char	1	string	Imputation status for sex (See details below)
race_imputed	char	1	string	Imputation status for race (See details below)
ethnicity_imputed	char	1	string	Imputation status of ethnicity (See details below)
educ_c_imputed	char	1	string	Imputation status of educ_c (See details below)
pob_imputed	char	1	string	Imputation status for POB (See details below)
dob_imputed	char	1	string	Imputation status for DOB (See details below)

Details for variable sex on ICF_US

Back

Description

Sex

Codebook

Download as CSV

Code	Label
F	Female
M	Male

Details for variable pob on ICF_US

Back

Description

Place of birth

Codebook

Download as CSV

Code	Label
1	Central Asia
2	South East Asia
3	Middle East and North Africa
4	Caribbean
5	Central America
6	South America
7	Africa
8	Oceania
9	Not Specified
A	US or territory
B	Mexico
C	Philippines
D	Vietnam
E	India
F	Germany
G	Puerto Rico
H	El Salvador
I	Cuba
J	United Kingdom
K	Canada
L	China
M	South Korea
N	Taiwan
O	Guatemala
P	Japan
Q	Haiti
R	USSR Core
S	Jamaica
T	Colombia
U	Poland
V	Iran
W	Dominican Republic
X	Italy
Y	Former Socialist Europe
Z	Western Europe

Details for variable race on ICF_US

Back

Description

Race

Codebook

Download as CSV

Code	Label
1	White
2	Black
3	American Indian/Alaska Native
4	Asian
5	Native Hawaiian/Pacific Islander
7	Two or more races

Details for variable ethnicity on ICF_US

Back

Description

Ethnicity

Codebook

Download as CSV

Code	Label
H	Hispanic
N	Non-Hispanic

Details for variable educ_c on ICF_US

Back

Description

Highest educational attainment (age 25+)

Codebook

Download as CSV

Code	Label
1	Less than high school
2	High school
3	Some college
4	Bachelor's degree or higher

Details for variables: sex_imputed, race_imputed, ethnicity_imputed, educ_c_imputed, pob_imputed, dob_imputed on ICF_US

Back

Description

sex_imputed: Imputation status for sex

race_imputed: Imputation status for race

ethnicity_imputed: Imputation status of ethnicity

educ_c_imputed: Imputation status of educ_c

pob_imputed: Imputation status for POB

dob_imputed: Imputation status for DOB

Codebook

Download sex_imputed codebook as CSV

Download race_imputed codebook as CSV

Download ethnicity_imputed codebook as CSV

Download educ_c_imputed codebook as CSV

Download pob_imputed codebook as CSV

Download dob_imputed codebook as CSV

Code	Label
1	Demographic characteristic observed
2	Demographic characteristic imputed
3	Demographic characteristic imputed (observed data unreliable)

4.1.4. Codebook: The ICF_US_IMPLICATES_AGE_SEX_POB File

Table Metadata for Date of Birth/Sex/Place of Birth Implicates (ICF_US_IMPLICATES_AGE_SEX_POB)

Access Requirements for ICF_US_IMPLICATES_AGE_SEX_POB
	State Approval Required	IRS Approval Required	SSA Approval Required
Access Requirements			X

Description: Multiply imputed date of birth, sex, and place of birth variables to complete missing information.
Scope: National
Key: pik
Sort Order: pik
File Formats: SAS Data Table, Parquet (partitioned by PIK group)
Download Codebook: CSV

Variable Information

Variable Information for ICF_US_IMPLICATES_AGE_SEX_POB
Variable Name	SAS Variable Type	SAS Variable Length	Parquet Variable Type	Description
pik	char	9	string	Protected Identification Key
dob1-dob10	num	4	date	Date of Birth Implicates
sex1-sex10	char	1	string	Sex Implicates (See details below)
pob1-pob10	char	1	string	Place of Birth Implicates (See details below)

Details for variables sex1-sex10 on ICF_US_IMPLICATES_AGE_SEX_POB

Back

Description

Sex Implicates

Codebook

Download as CSV

Code	Label
F	Female
M	Male

Details for variables pob1-pob10 on ICF_US_IMPLICATES_AGE_SEX_POB

Back

Description

Place of Birth Implicates

Codebook

Download as CSV

Code	Label
1	Central Asia
2	South East Asia
3	Middle East and North Africa
4	Caribbean
5	Central America
6	South America
7	Africa
8	Oceania
9	Not Specified
A	US or territory
B	Mexico
C	Philippines
D	Vietnam
E	India
F	Germany
G	Puerto Rico
H	El Salvador
I	Cuba
J	United Kingdom
K	Canada
L	China
M	South Korea
N	Taiwan
O	Guatemala
P	Japan
Q	Haiti
R	USSR Core
S	Jamaica
T	Colombia
U	Poland
V	Iran
W	Dominican Republic
X	Italy
Y	Former Socialist Europe
Z	Western Europe

4.1.5. Codebook: The ICF_US_IMPLICATES_EDUCATION File

Table Metadata for Education Implicates (ICF_US_IMPLICATES_EDUCATION)

Access Requirements for ICF_US_IMPLICATES_EDUCATION
	State Approval Required	IRS Approval Required	SSA Approval Required
Access Requirements			X

Description: Multiply imputed education variables to complete missing information.
Scope: National
Key: pik
Sort Order: pik
File Formats: SAS Data Table, Parquet (partitioned by PIK group)
Download Codebook: CSV

Variable Information

Variable Information for ICF_US_IMPLICATES_EDUCATION
Variable Name	SAS Variable Type	SAS Variable Length	Parquet Variable Type	Description
pik	char	9	string	Protected Identification Key
educ_c1-educ_c10	char	1	string	Highest educational attainment (age 25+) Implicates (See details below)

Details for variables educ_c1-educ_c10 on ICF_US_IMPLICATES_EDUCATION

Back

Description

Highest educational attainment (age 25+) Implicates

Codebook

Download as CSV

Code	Label
1	Less than high school
2	High school
3	Some college
4	Bachelor's degree or higher

4.1.6. Codebook: The ICF_US_IMPLICATES_RACE_ETHNICITY File

Table Metadata for Race/Ethnicity Implicates (ICF_US_IMPLICATES_RACE_ETHNICITY)

Access Requirements for ICF_US_IMPLICATES_RACE_ETHNICITY
	State Approval Required	IRS Approval Required	SSA Approval Required
Access Requirements			X

Description: Multiply imputed race and ethnicity variables to complete missing information.
Scope: National
Key: pik
Sort Order: pik
File Formats: SAS Data Table, Parquet (partitioned by PIK group)
Download Codebook: CSV

Variable Information

Variable Information for ICF_US_IMPLICATES_RACE_ETHNICITY
Variable Name	SAS Variable Type	SAS Variable Length	Parquet Variable Type	Description
pik	char	9	string	Protected Identification Key
race1-race10	char	1	string	Race Implicates (See details below)
ethnicity1-ethnicity10	char	1	string	Ethnicity Implicates (See details below)

Details for variables race1-race10 on ICF_US_IMPLICATES_RACE_ETHNICITY

Back

Description

Race Implicates

Codebook

Download as CSV

Code	Label
1	White
2	Black
3	American Indian/Alaska Native
4	Asian
5	Native Hawaiian/Pacific Islander
7	Two or more races

Details for variables ethnicity1-ethnicity10 on ICF_US_IMPLICATES_RACE_ETHNICITY

Back

Description

Ethnicity Implicates

Codebook

Download as CSV

Code	Label
H	Hispanic
N	Non-Hispanic