1. LODES Code Building Blocks

1.1. Loading Data

There are four types of LODES data files, and all of them will be used in these samples. The types and their purposes/contents are briefly described here. For more information on the files, their organization, and naming convention, see the LODES Technical Documentation. Additionally, labels for labor market segments and worker/job characteristics can be found in the LODES Schema Files section in the Appendix.

  • Residence Area Characteristics (RAC): jobs are totaled by home census block.

  • Workplace Area Characteristics (WAC): jobs are totaled by work census block.

  • Origin-Destination (OD): jobs totals are associated with both a home census block and a work census block.

  • Geography Crosswalk: relationship file that establishes the hierarchical connection between each 2020 census tabulation block and all higher level geographic entities.

The naming conventions for the RAC, WAC, and OD files contain the following information:

  • Two letter state abbreviation (i.e. ‘ny’)

  • File type: RAC, WAC, or OD

  • Part of the file: “main” or “aux” (only for OD files)

  • Four character segment code (only for RAC and WAC files)

  • Four character job type

  • Four digit year

There are three easy ways to use LODES data in open source languages. You can (1) read directly from the online file structure, (2) download the raw data files directly, or (3) use output from the LED Extraction Tool.

1.1.1. Using the LODES Raw Data Files

Reading LODES directly from the the online directory structure may be the simplest option for users, and the code examples in all other sections will read the data in this way. Because the file types are determined entirely by the variables above, we can set variables equal to the desired values, and use that to construct the accurate URL/file name. The code below represents a function that can be used repeatedly and flexibly to read in OD, RAC, and WAC data (and in fact this function will be found repeatedly in these samples). Beneath the function is an example of its use. This code loads the RAC, WAC, and OD data for job type “JT00” (all workers) in 2004 for New Jersey. For the RAC and WAC files, the segment “SA01” (workers age 29 or younger) is selected. Note that for each file type, the geographic identifiers are read in specifically as a “string” datatype, this ensures consistency across file types, which will be important when merging files.

Note

These functions can also be used to read downloaded files, instead of reading them directly from the online file structure. Simply change the argument base to be a location on your computer (i.e. local filesystem) where the files are located.

import pandas as pd


def read_lodes_data(
    data_type,
    state,
    segment,
    job_type,
    year,
    main=True,
    base="https://lehd.ces.census.gov/data/lodes/LODES8/",
):
    """
    Reads LODES RAC, WAC, or OD data based on the specified type.

    Parameters:
    - data_type (str): One of 'rac', 'wac', or 'od' to specify the dataset.
    - state (str): Lowercase two-character state abbreviation (e.g., 'nj').
    - segment (str): Segment type (e.g., 'SA01'). Ignored if 'data_type' is 'od'.
    - job_type (str): Job type (e.g., 'JT00').
    - year (int): Year of data (e.g., 2004).
    - main (bool): Whether to read the 'main' file type. Only used if 'data_type' is 'od'.
    - base (str): Base file path to read data from. Can be URL or local file path.

    Returns:
    - pd.DataFrame: The requested LODES dataset as a pandas DataFrame.
    """
    if data_type not in ["rac", "wac", "od"]:
        raise ValueError("Invalid data type. Choose from 'rac', 'wac', or 'od'.")
    if data_type == "od":
        if main:
            file_url = f"{base}{state}/od/{state}_od_main_{job_type}_{year}.csv.gz"
        else:
            file_url = f"{base}{state}/od/{state}_od_aux_{job_type}_{year}.csv.gz"
        dtype_mapping = {"h_geocode": str, "w_geocode": str}
    else:
        file_url = f"{base}{state}/{data_type}/{state}_{data_type}_{segment}_{job_type}_{year}.csv.gz"
        dtype_mapping = {"h_geocode": str} if data_type == "rac" else {"w_geocode": str}
    return pd.read_csv(file_url, dtype=dtype_mapping)


# Example usage
rac_data = read_lodes_data("rac", "nj", "SA01", "JT00", 2004)
wac_data = read_lodes_data("wac", "nj", "SA01", "JT00", 2004)
od_data = read_lodes_data("od", "nj", None, "JT00", 2004)

print(od_data.head())
>>          w_geocode        h_geocode  S000  SA01  ...  SI01  SI02  SI03  createdate
>> 0  340010001001001  340010001001013     1     0  ...     0     0     1    20230321
>> 1  340010001001001  340010001002004     2     1  ...     0     0     2    20230321
>> 2  340010001001001  340010001002013     1     1  ...     0     0     1    20230321
>> 3  340010001001001  340010002002011     1     1  ...     0     0     1    20230321
>> 4  340010001001001  340010003001008     1     1  ...     0     0     1    20230321
>> 
>> [5 rows x 13 columns]
library(readr)

read_lodes_data <- function(data_type,
                            state,
                            segment,
                            job_type,
                            year,
                            main = TRUE,
                            base = "https://lehd.ces.census.gov/data/lodes/LODES8/") {
  #' Reads LODES RAC, WAC, or OD data based on the specified type.
  #'
  #' @param data_type One of 'rac', 'wac', or 'od' to specify the dataset.
  #' @param state Lowercase two-character state abbreviation (e.g., 'nj').
  #' @param segment Segment type (e.g., 'SA01'). Ignored if 'data_type' is 'od'.
  #' @param job_type Job type (e.g., 'JT00').
  #' @param year Year of data (e.g., 2004).
  #' @param main Whether to read the 'main' file type. Only used if 'data_type' is 'od'.
  #' @param base Base file path to read data from. Can be URL or local file path.
  #'
  #' @returns The requested LODES dataset.
  # Validate data type
  if (!(data_type %in% c("rac", "wac", "od"))) {
    stop("Invalid data type. Choose from 'rac', 'wac', or 'od'.")
  }

  # Construct file URL based on data type
  if (data_type == "od") {
    file_url <- if (main) {
      paste0(base, state, "/od/", state, "_od_main_", job_type, "_", year, ".csv.gz")
    } else {
      paste0(base, state, "/od/", state, "_od_aux_", job_type, "_", year, ".csv.gz")
    }
    col_types <- cols(h_geocode = col_character(), w_geocode = col_character())
  } else {
    file_url <- sprintf(
      "%s%s/%s/%s_%s_%s_%s_%d.csv.gz",
      base,
      state,
      data_type,
      state,
      data_type,
      segment,
      job_type,
      year
    )
    col_types <- if (data_type == "rac") {
      cols(h_geocode = col_character())
    } else {
      cols(w_geocode = col_character())
    }
  }

  # Read CSV with specified column types
  df <- read_csv(file_url, col_types = col_types)

  return(df)
}

# Example usage
rac_data <- read_lodes_data("rac", "nj", "SA01", "JT00", 2004)
wac_data <- read_lodes_data("wac", "nj", "SA01", "JT00", 2004)
od_data <- read_lodes_data("od", "nj", NULL, "JT00", 2004)

print(head(od_data))
>> # A tibble: 6 × 13
>>   w_geocode      h_geocode  S000  SA01  SA02  SA03  SE01  SE02  SE03  SI01  SI02
>>   <chr>          <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
>> 1 3400100010010… 34001000…     1     0     1     0     0     0     1     0     0
>> 2 3400100010010… 34001000…     2     1     0     1     1     0     1     0     0
>> 3 3400100010010… 34001000…     1     1     0     0     1     0     0     0     0
>> 4 3400100010010… 34001000…     1     1     0     0     1     0     0     0     0
>> 5 3400100010010… 34001000…     1     1     0     0     1     0     0     0     0
>> 6 3400100010010… 34001000…     1     1     0     0     1     0     0     0     0
>> # ℹ 2 more variables: SI03 <dbl>, createdate <dbl>

The process for reading a Geography Crosswalk is also relatively simple. There is one crosswalk for each state, so we can write a function (which will also be reused throughout these examples) to read a crosswalk given the lowercase two-digit state abbreviation. The example code directly beneath the function reads the Geography Crosswalk for New Jersey.

Note

The Geography Crosswalks are very wide. They contain more than 15 different geography types with name and codes for each geography. Therefore, it is recommended to read in only the columns are important to the problem at hand. The function below provides this functionality.

import pandas as pd


def read_crosswalk(
    state, cols="all", base="https://lehd.ces.census.gov/data/lodes/LODES8/"
):
    """
    Reads LODES Geography Crosswalk given state.

    Parameters:
    - state (str): Lowercase two-character state abbreviation (e.g., 'nj').
    - cols (list): List of columns to read from the crosswalk in addition to tabblk2020 (Block ID). If "all", all columns will be read and returned.
    - base (str): Base file path to read data from. Can be URL or local file path.

    Returns:
    - pd.DataFrame: The requested LODES crosswalk as a pandas DataFrame.
    """
    file_url = f"{base}{state}/{state}_xwalk.csv.gz"
    dtype_mapping = str
    if cols == "all":
        return pd.read_csv(file_url, dtype=dtype_mapping)
    cols.insert(0, "tabblk2020")
    return pd.read_csv(file_url, dtype=dtype_mapping, usecols=cols)


crosswalk = read_crosswalk("nj", cols=["cty", "ctyname"])
print(crosswalk.head())
>>         tabblk2020    cty              ctyname
>> 0  340010001001000  34001  Atlantic County, NJ
>> 1  340010001001001  34001  Atlantic County, NJ
>> 2  340010001001002  34001  Atlantic County, NJ
>> 3  340010001001003  34001  Atlantic County, NJ
>> 4  340010001001004  34001  Atlantic County, NJ
library(readr)
library(dplyr)

read_crosswalk <- function(state, cols = "all", base = "https://lehd.ces.census.gov/data/lodes/LODES8/") {
  #' Reads LODES Geography Crosswalk given state.
  #'
  #' @param state Lowercase two-character state abbreviatoin (e.g., 'nj').
  #' @param cols List of columns to read from the crosswalk in addition to tabblk2020 (Block ID). If "all", all columns will be read and returned.
  #' @param base Base file path to read data from. Can be URL or local file path.
  #'
  #' @returns The requested LODES crosswalk.


  # Construct file URL
  file_url <- paste0(base, state, "/", state, "_xwalk.csv.gz")
  # Read the full dataset first if all columns are needed
  if (identical(cols, "all")) {
    return(read_csv(file_url, col_types = cols(tabblk2020 = col_character())))
  }
  # Ensure tabblk2020 (GEOID) is included in the selected columns
  cols <- unique(c("tabblk2020", cols))
  # Read only the specified columns
  return(read_csv(
    file_url,
    col_types = cols(.default = col_character()),
    col_select = all_of(cols)
  ))
}

# Example usage
crosswalk <- read_crosswalk("nj", cols = c("cty", "ctyname"))
print(head(crosswalk))
>> # A tibble: 6 × 3
>>   tabblk2020      cty   ctyname            
>>   <chr>           <chr> <chr>              
>> 1 340010001001000 34001 Atlantic County, NJ
>> 2 340010001001001 34001 Atlantic County, NJ
>> 3 340010001001002 34001 Atlantic County, NJ
>> 4 340010001001003 34001 Atlantic County, NJ
>> 5 340010001001004 34001 Atlantic County, NJ
>> 6 340010001001005 34001 Atlantic County, NJ

1.1.2. Using Output from the LED Extraction Tool

Reading data from output from the LED Extraction Tool is very similar. The extraction tool allows users to get both broader and narrower sets of data than the raw files. For example, users can combine multiple geographies and years, instead of having to read and combine multiple files. The selection of specific geographies means that users do not have to receive full state data if they only need a specific geography. In the example below, a user has selected to receive RAC and WAC data from Camden County, NJ and the zip code of 10019 (a ZIP Code in Manhattan). The data was requested for four columns (total and broken into three age categories), across two years, and for primary jobs. There are a few differences to note from the raw files (as can be seen from the code and output below):

  • The naming of the files is based on a unique character identifier, rather than the human readable format of the raw data files. This unique character identifier will be on the name of the folder, the data files, and the crosswalk.

  • The extraction tool path for area characteristics data produces the data for RAC and WAC, as well as a geographic crosswalk, which is clipped to contain only the blocks in the selected geographies.

  • The resulting data files have a different selection of columns when compared to the raw data. Here, there are only four characteristic columns, because only four were selected. On the other hand, there are two columns not in the data, segment and year, which allow for clarity in this output, especially given that you can select multiple years of data.

Note

If you would like to recreate the data in this code sample, you can recreate the query in the LED Extraction Tool using this settings file. The extraction tool provides an easy format for saving and uploading settings files. These files were downloaded by clicking the “CSVs & Metadata (ZIP)” link.

import pandas as pd

# read in wac and rac data
rac_data = pd.read_csv(
    "data/ac_f5e832282bacfdea48d8690947888efb/rac_f5e832282bacfdea48d8690947888efb.csv"
)
wac_data = pd.read_csv(
    "data/ac_f5e832282bacfdea48d8690947888efb/wac_f5e832282bacfdea48d8690947888efb.csv"
)

# read in selected crosswalk
crosswalk = pd.read_csv(
    "data/ac_f5e832282bacfdea48d8690947888efb/xwalk_f5e832282bacfdea48d8690947888efb.csv"
)

# Sample 10 rows from rac_data with fixed seed for reproducibility
print(rac_data.sample(n=10, random_state=4334))
>>              h_geocode segment  year  C000  CA01  CA02  CA03
>> 1698   340076008001012    S000  2019    10     6     3     1
>> 493    340076012004006    S000  2018    11     3     6     2
>> 5824   340076039013007    S000  2019    10     0     6     4
>> 11415  340076103002026    S000  2018    10     5     4     1
>> 528    340076013002026    S000  2018    29    13    10     6
>> 12396  340076089011078    S000  2019    11     2     7     2
>> 7483   340076075053001    S000  2018     8     0     5     3
>> 12048  340076086002031    S000  2019    11     2     8     1
>> 6138   340076051001027    S000  2019    36     6    20    10
>> 13105  340076105002040    S000  2019    20     5     8     7
library(readr)
library(dplyr)

# Read in wac and rac data
rac_data <- read_csv(
  "data/ac_f5e832282bacfdea48d8690947888efb/rac_f5e832282bacfdea48d8690947888efb.csv"
)
wac_data <- read_csv(
  "data/ac_f5e832282bacfdea48d8690947888efb/wac_f5e832282bacfdea48d8690947888efb.csv"
)

# Read in selected crosswalk
crosswalk <- read_csv(
  "data/ac_f5e832282bacfdea48d8690947888efb/xwalk_f5e832282bacfdea48d8690947888efb.csv"
)

# Sample 10 rows from rac_data with fixed seed for reproducibility
set.seed(4334)
print(sample_n(rac_data, 10))
>> # A tibble: 10 × 7
>>    h_geocode segment  year  C000  CA01  CA02  CA03
>>        <dbl> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
>>  1   3.40e14 S000     2019     5     2     1     2
>>  2   3.40e14 S000     2018    13     6     6     1
>>  3   3.40e14 S000     2018    12     1     4     7
>>  4   3.40e14 S000     2018    14     2     6     6
>>  5   3.40e14 S000     2019     5     0     0     5
>>  6   3.40e14 S000     2018    24     8    10     6
>>  7   3.40e14 S000     2018    18     5     9     4
>>  8   3.40e14 S000     2019    28     6    17     5
>>  9   3.40e14 S000     2018    19     3    10     6
>> 10   3.40e14 S000     2018    22     7     7     8

1.2. Combining and Aggregating Data

1.2.1. Combining Data Files

Given that LODES data files are segmented significantly, there are many instances where users may want to combine data files. This could include different states, years, segments, and/or job types. Additionally, users may find use in combining the “main” and “aux” portions of OD files. These kinds of operations are relatively simple, with two main steps: (1) adding identifiers to the portions of data and (2) stacking or binding the data together.

The code below combines RAC files for New Jersey between the years of 2017 and 2020, across two job types and two segments are combined into one data structure. The same structure would be true for combining WAC and OD files. Here for convenience, we can use the function we wrote in Loading Data.

import pandas as pd

# Set variables
state = "nj"
segments = ["SA01", "SE03"]
job_types = ["JT00", "JT01"]
years = list(range(2017, 2021))

# Create empty list to store dataframes
all_data = []
for seg in segments:
    for job_type in job_types:
        for year in years:
            # Call the function to read LODES data
            rac = read_lodes_data("rac", state, seg, job_type, year)
            # Identify each portion of data
            rac["segment"] = seg
            rac["jobtype"] = job_type
            rac["year"] = year
            # Append to list
            all_data.append(rac)

# Combine all dataframes into one dataset
full_rac = pd.concat(all_data)
print(full_rac.head())
>>          h_geocode  C000  CA01  CA02  ...  createdate  segment  jobtype  year
>> 0  340010001001003     1     1     0  ...    20230321     SA01     JT00  2017
>> 1  340010001001005     5     5     0  ...    20230321     SA01     JT00  2017
>> 2  340010001001006     3     3     0  ...    20230321     SA01     JT00  2017
>> 3  340010001001009     4     4     0  ...    20230321     SA01     JT00  2017
>> 4  340010001001010     5     5     0  ...    20230321     SA01     JT00  2017
>> 
>> [5 rows x 46 columns]
library(dplyr)
library(purrr)

# Set variables
state <- "nj"
segments <- c("SA01", "SE03")
job_types <- c("JT00", "JT01")
years <- 2017:2020


# Use map functions to iterate over all combinations and bind rows
full_rac <- expand.grid(
  segment = segments,
  jobtype = job_types,
  year = years
) %>%
  pmap_dfr(~ {
    rac <- read_lodes_data("rac", state, ..1, ..2, ..3)
    rac <- mutate(rac,
      segment = ..1,
      jobtype = ..2,
      year = ..3
    )
    return(rac)
  })

print(head(full_rac))
>> # A tibble: 6 × 46
>>   h_geocode     C000  CA01  CA02  CA03  CE01  CE02  CE03 CNS01 CNS02 CNS03 CNS04
>>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
>> 1 34001000100…     1     1     0     0     0     1     0     0     0     0     0
>> 2 34001000100…     5     5     0     0     1     3     1     0     0     0     0
>> 3 34001000100…     3     3     0     0     1     2     0     0     0     0     0
>> 4 34001000100…     4     4     0     0     1     1     2     0     0     0     0
>> 5 34001000100…     5     5     0     0     2     2     1     0     0     0     0
>> 6 34001000100…     4     4     0     0     3     1     0     0     0     0     0
>> # ℹ 34 more variables: CNS05 <dbl>, CNS06 <dbl>, CNS07 <dbl>, CNS08 <dbl>,
>> #   CNS09 <dbl>, CNS10 <dbl>, CNS11 <dbl>, CNS12 <dbl>, CNS13 <dbl>,
>> #   CNS14 <dbl>, CNS15 <dbl>, CNS16 <dbl>, CNS17 <dbl>, CNS18 <dbl>,
>> #   CNS19 <dbl>, CNS20 <dbl>, CR01 <dbl>, CR02 <dbl>, CR03 <dbl>, CR04 <dbl>,
>> #   CR05 <dbl>, CR07 <dbl>, CT01 <dbl>, CT02 <dbl>, CD01 <dbl>, CD02 <dbl>,
>> #   CD03 <dbl>, CD04 <dbl>, CS01 <dbl>, CS02 <dbl>, createdate <dbl>,
>> #   segment <fct>, jobtype <fct>, year <int>

1.2.2. Combining Geographic Crosswalks

For combining geographic files, the process is even simpler. The files can simply be stacked on each other, since all the geographic identifiers are unique. The code below loops through all US states and combines the crosswalks into one crosswalk. Then, a user can simply use the crosswalk as a one national crosswalk. Again, we can use the function we wrote in Loading Data.

import os

# Make sure DC is included in the states
os.environ["DC_STATEHOOD"] = "1"
from us import states
import pandas as pd

# Create an empty list to store crosswallks
crosswalks = []

# Loop through all states (and DC)
for state in states.STATES:
    # The Crosswalk file contains many different geographies
    # For this example we've selected state, county, and tract.
    cols = ["st", "cty", "trct"]
    crosswalk = read_crosswalk(state.abbr.lower(), cols)
    # Append to list
    crosswalks.append(crosswalk)

# Combine all crosswalks into one.
all_crosswalks = pd.concat(crosswalks)

print(all_crosswalks.head())
>>         tabblk2020  st    cty         trct
>> 0  010010201001000  01  01001  01001020100
>> 1  010010201001001  01  01001  01001020100
>> 2  010010201001002  01  01001  01001020100
>> 3  010010201001003  01  01001  01001020100
>> 4  010010201001004  01  01001  01001020100
library(dplyr)
library(purrr)

# List of all states plus DC
states <- c(state.abb, "DC")

# Call function across all states
crosswalks <- map(states, ~ read_crosswalk(tolower(.x), c("st", "cty", "trct")))

# Combine all crosswalks into one data frame
all_crosswalks <- bind_rows(crosswalks)

print(head(all_crosswalks))
>> # A tibble: 6 × 4
>>   tabblk2020      st    cty   trct       
>>   <chr>           <chr> <chr> <chr>      
>> 1 010010201001000 01    01001 01001020100
>> 2 010010201001001 01    01001 01001020100
>> 3 010010201001002 01    01001 01001020100
>> 4 010010201001003 01    01001 01001020100
>> 5 010010201001004 01    01001 01001020100
>> 6 010010201001005 01    01001 01001020100

1.2.3. Aggregating to Higher-Level Geographies

All LODES data files are provided at the census block level. However, for many analysis purposes, users may desire the data aggregated to other geographies. This is the purpose of the geographic crosswalk. The geographic crosswalk provides geographic hierarchies for many geographies in the most current year of geographic data. The primary key/identifier in the Geography Crosswalk is the 2020 Census Tabulation Block Code (tabblk2020). This code does not change between Decennial Censuses and can be used to link to the geocodes in the OD/RAC/WAC files. All of the other columns in the crosswalk file refer to higher level geographic entities (both by code and name) that are supported by the OnTheMap application. The crosswalks contain all 2020 Census Tabulation Blocks for each state, and thus can be quite large. Users are recommended to only read in the columns of the crosswalk which correspond to geographies they are interested in. For a full list of geographies in the crosswalk, see the LODES Technical Documentation.

Warning

Because the geographic crosswalk is in the most current year of geographic data, all geographic analyses will be in terms of current boundaries, regardless of the year of LODES data. For example, if a user wishes to find the number of jobs in Bergen County, New Jersey in 2018 and 2021, the user can use the crosswalk to accomplish the county level aggregation, but the data will be in terms of the most recent geographic boundary of Bergen County for both 2018 and 2021. The same is true of analyses conducted in OnTheMap.

Case Study: Counts of Worker Residence by County in New Jersey

In order to find the number of residences by county in New Jersey, the block level RAC data must be aggregated to the county level. In order to aggregate RAC or WAC data, the Geographic Crosswalk must be joined to the data. Here, we seek to find the county level job totals (on the residence side). For loading the data, we can use the function we wrote in Loading Data. We utilize a “left join”, with only rows that are in the LODES data included in the merge. Note the code assumes a one-to-one relationship between the two datasets, which is true in this case, but would not be true if a user was combining the crosswalk with multiple years or segments of data. The process for merging data with the crosswalk and performing an aggregation is the same for WAC data, with the only modification being that the left dataset should be joined on w_geocode.

Note

In this case study a left join is used. There are likely blocks in the Geographic Crosswalk that are not in the LODES data, because in the LODES production process, blocks that do not contain data (for example if no workers live in a certain block) are excluded. There are certain situations in which an outer join/right join (which are equivalent in this case) might be of use to a user.


rac = read_lodes_data("rac", "nj", "S000", "JT00", 2019)
crosswalk = read_crosswalk("nj", cols=["cty", "ctyname"])

# Conduct left merge of RAC data with crosswalk on block IDs
rac_w_crosswalk = rac.merge(
    crosswalk,
    how="left",
    left_on="h_geocode",
    right_on="tabblk2020",
    validate="1:1",
)

# Drop non-numerical columns for aggregating
rac_w_crosswalk = rac_w_crosswalk.drop(
    ["h_geocode", "tabblk2020", "createdate"], axis=1
)

# Aggregate at county level
rac_county = rac_w_crosswalk.groupby(["cty", "ctyname"]).sum().reset_index()

print(rac_county.head())
>>      cty                ctyname    C000   CA01  ...    CD03    CD04    CS01    CS02
>> 0  34001    Atlantic County, NJ  129676  27741  ...   30981   28503   61999   67677
>> 1  34003      Bergen County, NJ  459959  84789  ...  101200  158152  229464  230495
>> 2  34005  Burlington County, NJ  234959  48625  ...   55876   66940  112377  122582
>> 3  34007      Camden County, NJ  252963  56857  ...   59068   63373  121146  131817
>> 4  34009    Cape May County, NJ   38129   7930  ...    9215    9390   18431   19698
>> 
>> [5 rows x 43 columns]
library(dplyr)
library(purrr)

rac <- read_lodes_data("rac", "nj", "S000", "JT00", 2019)
crosswalk <- read_crosswalk("nj", cols = c("cty", "ctyname"))

# Merge RAC data with Crosswalk on block IDs
rac_w_crosswalk <- rac %>%
  left_join(crosswalk, by = c("h_geocode" = "tabblk2020"), relationship = "one-to-one")

# Drop non-numerical columns for aggregation
rac_w_crosswalk <- rac_w_crosswalk %>%
  select(-h_geocode, -createdate)

# Aggregate at county level
rac_county <- rac_w_crosswalk %>%
  group_by(cty, ctyname) %>%
  summarise(across(where(is.numeric), \(x) sum(x, na.rm = TRUE)), .groups = "drop")

print(head(rac_county))
>> # A tibble: 6 × 43
>>   cty   ctyname   C000  CA01   CA02   CA03  CE01   CE02   CE03 CNS01 CNS02 CNS03
>>   <chr> <chr>    <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
>> 1 34001 Atlant… 129676 27741  65407  36528 31095  46383  52198   476    34   834
>> 2 34003 Bergen… 459959 84789 246613 128557 84519 102548 272892   292    87  2091
>> 3 34005 Burlin… 234959 48625 121643  64691 47992  58194 128773   460    73  1283
>> 4 34007 Camden… 252963 56857 133596  62510 55828  73012 124123   521    51  1150
>> 5 34009 Cape M…  38129  7930  18106  12093  9390  12088  16651   250    39   269
>> 6 34011 Cumber…  62080 14626  32737  14717 14021  21958  26101  1294   143   456
>> # ℹ 31 more variables: CNS04 <dbl>, CNS05 <dbl>, CNS06 <dbl>, CNS07 <dbl>,
>> #   CNS08 <dbl>, CNS09 <dbl>, CNS10 <dbl>, CNS11 <dbl>, CNS12 <dbl>,
>> #   CNS13 <dbl>, CNS14 <dbl>, CNS15 <dbl>, CNS16 <dbl>, CNS17 <dbl>,
>> #   CNS18 <dbl>, CNS19 <dbl>, CNS20 <dbl>, CR01 <dbl>, CR02 <dbl>, CR03 <dbl>,
>> #   CR04 <dbl>, CR05 <dbl>, CR07 <dbl>, CT01 <dbl>, CT02 <dbl>, CD01 <dbl>,
>> #   CD02 <dbl>, CD03 <dbl>, CD04 <dbl>, CS01 <dbl>, CS02 <dbl>

Why Can’t I Use the GEOID for Aggregation?

A user looking at the Census Block Code (h_geocode or w_geocode) may see that the GEOID appears to be various identifiers codes combined together into one 15-digit unique geocode. While this is not entirely incorrect, the Block Code should not be used to aggregate to higher geographies on its own. At the Decennial Census, the block code is generally aligned with the higher level geographies (i.e. state, county, tract, etc.). But, over time these higher level boundaries can change. The code below shows that this strategy will not work, and could lead to significant inaccuracies.


# Read in RAC and crosswalk
rac = read_lodes_data("rac", "oh", "S000", "JT00", 2022)
crosswalk = read_crosswalk("oh", cols=["cty"])

# Conduct left merge of RAC data with crosswalk on block IDs
rac_w_crosswalk = rac.merge(
    crosswalk,
    how="left",
    left_on="h_geocode",
    right_on="tabblk2020",
    validate="1:1",
)

# Create 'fake' county ID from first 5 digits of Block ID
rac_w_crosswalk["fake_county"] = rac_w_crosswalk["h_geocode"].str[0:5]

# Find blocks where 'fake' doesn't match real
num_blocks = (rac_w_crosswalk["fake_county"] != rac_w_crosswalk["cty"]).sum()

print(
    f"There are {num_blocks} blocks for which the first 5 digits of the Block ID does not match the county"
)
>> There are 2 blocks for which the first 5 digits of the Block ID does not match the county
library(dplyr)

# Read in RAC and crosswalk data
rac <- read_lodes_data("rac", "oh", "S000", "JT00", 2022)
crosswalk <- read_crosswalk("oh", cols = c("cty"))

# Conduct left join on block IDs
rac_w_crosswalk <- rac %>%
  left_join(crosswalk, by = c("h_geocode" = "tabblk2020"))

# Create 'fake' county ID from first 5 digits of Block ID
rac_w_crosswalk$fake_county <- substr(rac_w_crosswalk$h_geocode, 1, 5)

# Find number of mismatched blocks
num_blocks <- sum(rac_w_crosswalk$fake_county != rac_w_crosswalk$cty, na.rm = TRUE)

cat("There are", num_blocks, "blocks for which the first 5 digits of the Block ID do not match the county\n")
>> There are 2 blocks for which the first 5 digits of the Block ID do not match the county

Case Study: County to County Workplace and Residence Pairs

The goal here is to find the number of jobs with a workplace in one county and a residence in another county for each unique pair of counties. For this, we need to aggregate files again to the county level, but for OD files. The process for merging Origin-Destination (OD) files with the Geographic Crosswalk is similar to the process for RAC/WAC data, with the exception that the OD files have two unique geographic identifiers (residence and workplace), where the RAC and WAC files on have one. Therefore, for many purposes, aggregating OD files may require two merges with the geographic crosswalk.

Note

One thing to always be aware of when using OD data is the distinction between “main” and “auxiliary” data. The “main” data files contain data on all jobs whose workplaces and residences are within the state selected, while the “auxiliary” data files contain jobs whose workplaces are in the state but whose residences are in another state. This means that if a user wanted data on jobs with residences within a specific state and work anywhere, they would need to collect the “main” file from the desired target state, and the “auxiliary” files from every other state.


od = read_lodes_data("od", "nj", None, "JT00", 2019)
crosswalk = read_crosswalk("nj", cols=["cty", "ctyname"])

# Create crosswalk with home and workplace labels
h_crosswalk = crosswalk.rename(columns=lambda col: f"h_{col}")
w_crosswalk = crosswalk.rename(columns=lambda col: f"w_{col}")

# Merge on home labels
od_home = od.merge(
    h_crosswalk,
    how="left",
    left_on="h_geocode",
    right_on="h_tabblk2020",
    validate="m:1",
    indicator=True,
)

# Sanity check, no rows should be one-sided merge
assert (od_home["_merge"] == "both").all()

# Drop non-numerical columns for aggregating
od_home = od_home.drop(["h_geocode", "h_tabblk2020", "_merge", "createdate"], axis=1)

# Merge on work labels
od_home_and_work = od_home.merge(
    w_crosswalk,
    how="left",
    left_on="w_geocode",
    right_on="w_tabblk2020",
    validate="m:1",
    indicator=True,
)

# Sanity check
assert (od_home_and_work["_merge"] == "both").all()

# Drop non-numerical columns for aggregating
od_home_and_work = od_home_and_work.drop(
    ["w_geocode", "w_tabblk2020", "_merge"], axis=1
)

# Aggregate at county level
od_county = (
    od_home_and_work.groupby(["h_cty", "h_ctyname", "w_cty", "w_ctyname"])
    .sum()
    .reset_index()
)

print(od_county[["h_ctyname", "w_ctyname", "S000"]].head())
>>              h_ctyname              w_ctyname   S000
>> 0  Atlantic County, NJ    Atlantic County, NJ  83828
>> 1  Atlantic County, NJ      Bergen County, NJ   1170
>> 2  Atlantic County, NJ  Burlington County, NJ   3841
>> 3  Atlantic County, NJ      Camden County, NJ   5695
>> 4  Atlantic County, NJ    Cape May County, NJ   4634
library(dplyr)
library(purrr)

od <- read_lodes_data("od", "nj", NULL, "JT00", 2019)
crosswalk <- read_crosswalk("nj", cols = c("cty", "ctyname"))

# Create crosswalk with home and workplace labels
h_crosswalk <- crosswalk %>%
  rename_with(~ paste0("h_", .x))
w_crosswalk <- crosswalk %>%
  rename_with(~ paste0("w_", .x))

# Merge on home labels
od_home <- od %>%
  left_join(h_crosswalk,
    by = c("h_geocode" = "h_tabblk2020"),
    relationship = "many-to-one"
  ) %>%
  mutate(merge = ifelse(is.na(h_cty), "left_only", "both"))

# Sanity check
if (any(od_home$merge != "both")) {
  stop("Merge contains mismatches on home geocode!")
}

# Drop non-numerical columns for aggregation
od_home <- od_home %>%
  select(-h_geocode, -merge, -createdate)

# Merge on work labels
od_home_and_work <- od_home %>%
  left_join(w_crosswalk,
    by = c("w_geocode" = "w_tabblk2020"),
    relationship = "many-to-one"
  ) %>%
  mutate(merge = ifelse(is.na(w_cty), "left_only", "both"))

# Sanity check: Ensure all merged rows are "both"
if (any(od_home_and_work$merge != "both")) {
  stop("Merge contains mismatches on work geocode!")
}

# Drop non-numerical columns for aggregation
od_home_and_work <- od_home_and_work %>%
  select(-w_geocode, -merge)

# Aggregate at county level
od_county <- od_home_and_work %>%
  group_by(h_cty, h_ctyname, w_cty, w_ctyname) %>%
  summarise(across(where(is.numeric), ~ sum(.x, na.rm = TRUE)), .groups = "drop")

print(head(od_county[, c("h_ctyname", "w_ctyname", "S000")]))
>> # A tibble: 6 × 3
>>   h_ctyname           w_ctyname              S000
>>   <chr>               <chr>                 <dbl>
>> 1 Atlantic County, NJ Atlantic County, NJ   83828
>> 2 Atlantic County, NJ Bergen County, NJ      1170
>> 3 Atlantic County, NJ Burlington County, NJ  3841
>> 4 Atlantic County, NJ Camden County, NJ      5695
>> 5 Atlantic County, NJ Cape May County, NJ    4634
>> 6 Atlantic County, NJ Cumberland County, NJ  4465

1.3. Spatial and Geographic Analysis

LODES data can be treated as geographic in nature, due to how small the base geography (Census Blocks) is. Users may wish to do spatial aggregations, merges, and/or other analyses using the geographic nature of LODES. The crosswalk files contain the latitude and longitude for an internal point of each block. This internal point is used by the OnTheMap application to assist in selecting blocks. While it is not a centroid, and the only guarantee is that the point is inside the block, it is useful to use the point in spatial analyses.

1.3.1. Case Study: Finding all the Jobs Within 5 KM of a Point

All Census Blocks with 5 KM of Point

This image shows the goal of the first part of the code below (finding all blocks within 5 kilometers of the specific point). For the Python code that generated this image, see Visualizing Blocks around a Point in New Jersey.

The code below finds all jobs in 2018 where workers are 29 years or younger, whose workplace is within 5 km of a specific point (Latitude: 40.736679° N, Longitude: 74.175975° W) and whose residence is in Camden County, NJ. The following steps are required:

  1. Read in the appropriate OD file and Geography Crosswalk. The OD file is needed because to answer the question, both residence and workplace are required. For the data question, the “main” OD file for New Jersey in 2018 for job type JT00 is required. The Geography Crosswalk used is the New Jersey crosswalk. Note that for the crosswalk, we include the longitude and latitude for each block.

  2. The crosswalk must be converted to a data format that can handle spatial calculations. This includes using a proper projection. In this example, the projection ESRI:102003 is used, which is the same projection used in the OnTheMap application.

  3. Limit the OD rows to those that have a workplace within 5 km of the specified point (find all blocks that have their internal points within this radius).

  4. Limit further the OD rows to those whose residence is in Camden county.

  5. Sum the column “SA01”, which is the segment of jobs whose worker is 29 years of age or younger.

Note

In dealing with spatial data and conducting spatial analyses, projections are incredibly important. In this example, projection ESRI:102003 is used, which is a projection that is optimized for the contiguous United States. However, if an analysis is localized to a specific smaller geographic area, a different projection may be more accurate. For more on projections and how to choose a projection, take a look at this resource.

import pandas as pd
import geopandas as gpd
from shapely import Point

# Read the OD file
od = read_lodes_data("od", "nj", None, "JT00", 2018)

# Read the Crosswalk file
crosswalk = read_crosswalk("nj", cols=["cty", "blklatdd", "blklondd"])

# Convert Crosswalk to GeoDataFrame and convert to appropriate projection
crosswalk = gpd.GeoDataFrame(
    crosswalk,
    geometry=gpd.points_from_xy(
        x=crosswalk["blklondd"], y=crosswalk["blklatdd"], crs="EPSG:4326"
    ),
).to_crs("ESRI:102003")

# Create point for buffering and change to appropriate projection
reference_point = Point(-74.175975, 40.736679)
reference_point_projected = gpd.GeoSeries([reference_point], crs="EPSG:4326").to_crs(
    "ESRI:102003"
)

# Create 5 km buffer polygon around point
buffer_5km = reference_point_projected.buffer(5000)

# Find list of blocks within buffer
blocks_within_5km = crosswalk.loc[
    crosswalk.geometry.within(buffer_5km.iloc[0]), "tabblk2020"
].to_list()

# Limit OD to just blocks whose workplace are within buffer
nj_blocks_in_buffer = od[od["w_geocode"].isin(blocks_within_5km)]

# Find list of blocks in Camden County, NJ
camden_blocks = crosswalk.loc[crosswalk["cty"] == "34007", "tabblk2020"].to_list()

# Limit OD to just blocks whose residence are in Camden County
nj_buffer_camden = nj_blocks_in_buffer[
    nj_blocks_in_buffer["h_geocode"].isin(camden_blocks)
]
num_jobs = nj_buffer_camden["SA01"].sum()

print(f"There are {num_jobs} jobs meeting the selected criteria.")
>> There are 242 jobs meeting the selected criteria.
library(sf)
library(dplyr)
library(readr)


# Read the OD file
od <- read_lodes_data("od", "nj", None, "JT00", 2018)

# Read the Crosswalk file
crosswalk <- read_crosswalk("nj", cols = c("cty", "blklatdd", "blklondd"))

# Convert Crosswalk to sf object and transform to EPSG:102003
crosswalk_sf <- crosswalk %>%
  filter(!is.na(blklatdd) &
    !is.na(blklondd)) %>% # Remove missing lat/lon
  st_as_sf(coords = c("blklondd", "blklatdd"), crs = 4326) %>%
  st_transform(crs = 102003)

# Create reference point and buffer in EPSG:102003
reference_point <- st_sfc(st_point(c(-74.175975, 40.736679)), crs = 4326) %>%
  st_transform(crs = 102003)

# Create 5 km buffer polygon around point
buffer_5km <- st_buffer(reference_point, 5000)

# Find list of blocks within buffer
blocks_within_5km <- crosswalk_sf %>%
  filter(as.logical(st_within(geometry, buffer_5km, sparse = FALSE))) %>%
  pull(tabblk2020)

# Limit OD to just blocks whose workplace is within buffer
nj_blocks_in_buffer <- od %>% filter(w_geocode %in% blocks_within_5km)

# Find list of blocks in Camden County, NJ (FIPS: 34007)
camden_blocks <- crosswalk_sf %>%
  filter(cty == "34007") %>%
  pull(tabblk2020)

# Limit OD to just blocks whose residence is in Camden County
nj_buffer_camden <- nj_blocks_in_buffer %>% filter(h_geocode %in% camden_blocks)

# Sum number of workers
num_jobs <- sum(nj_buffer_camden$SA01, na.rm = TRUE)

cat(sprintf("There are %d jobs meeting the selected criteria.\n", num_jobs))
>> There are 242 jobs meeting the selected criteria.

1.4. Combining Other Data with LODES

Users may desire to combine other data products from the Census Bureau or non-Census data products with LODES. In many cases, merging on FIPS code or GEOID may be feasible, both on Census and non-Census data. In other cases, spatial merges may be required, if the external data is in a spatial format and does not have a GEOID that maps with Census data.

1.4.1. American Community Survery (ACS) Data

First, let’s consider the example of merging LODES data with data from the American Community Survery (ACS). ACS data uses GEOIDs for identifiers, so the process can be relatively simple. However, the tricky part is in aligning geography. If the year of ACS data has the same geography as the current LODES crosswalk, then the process is relatively simple. This is the case with ACS 2023 data and LODES 2022 data, which both rely on 2023 TIGER/Line Shapefiles.

Case Study: Combine the Number of Residences with Median Household Income by Tract

The code below aggregates 2022 RAC data to the tract level and then merges it with 2023 ACS median household income data. In order to run this code, you should request an API key. Note that limited queries can be made without an API key (simply leave off the &key= portion of the request), but that for repeated queries, an API key should be used.

import pandas as pd
import requests

rac = read_lodes_data("rac", "nj", "S000", "JT00", 2022)

# Read the Crosswalk file
crosswalk = read_crosswalk("nj", cols=["trct"])

# Define your API Key (Get one from https://api.census.gov/data/key_signup.html)
API_KEY = "YOUR_API_KEY_HERE"
if API_KEY == "YOUR_API_KEY_HERE":
    raise ValueError("Please create and enter your own API Key before continuing.")


# Define the base URL for ACS 5-Year Estimates
BASE_URL = "https://api.census.gov/data/2023/acs/acs5"

# Define the variables to retrieve (median household income)
VARIABLES = ["B19013_001E"]

# Define the geography (Tract-Level Data in New Jersey)
STATE_FIPS = "34"
GEOGRAPHY = f"for=tract:*&in=state:{STATE_FIPS}"

# Construct the API request URL
url = f"{BASE_URL}?get={','.join(VARIABLES)}&{GEOGRAPHY}&key={API_KEY}"

# Make the request
response = requests.get(url)
data = response.json()

# Convert to DataFrame
acs = pd.DataFrame(data[1:], columns=data[0])

# Add a GEOID column for merging (State FIPS + County FIPS + Tract Code)
acs["GEOID"] = acs["state"] + acs["county"] + acs["tract"]

# Merge and aggregate RAC data
rac_tract = rac.merge(
    crosswalk, how="left", left_on="h_geocode", right_on="tabblk2020", validate="1:1"
)
rac_tract = rac_tract.drop(["h_geocode", "tabblk2020", "createdate"], axis=1)
rac_agg = rac_tract.groupby("trct").sum().reset_index()

# Combine ACS and RAC data
rac_acs = rac_agg.merge(
    acs, how="left", left_on="trct", right_on="GEOID", validate="1:1"
)

# Drop duplicate GEOID column
rac_acs = rac_acs.drop(["GEOID"], axis=1)

print(rac_acs.head())
>>           trct  C000  CA01  CA02  CA03  ...  CS02  B19013_001E  state  county   tract
>> 0  34001000100  1065   235   546   284  ...   530        44870     34     001  000100
>> 1  34001000200  1123   246   520   357  ...   569        54858     34     001  000200
>> 2  34001000300  1514   347   775   392  ...   694        53463     34     001  000300
>> 3  34001000400  1147   218   563   366  ...   542        34895     34     001  000400
>> 4  34001000500  1080   254   576   250  ...   529        50704     34     001  000500
>> 
>> [5 rows x 46 columns]
library(httr)
library(jsonlite)
library(dplyr)
library(readr)

# Read the RAC file
rac <- read_lodes_data("rac", "nj", "S000", "JT00", 2022)

# Read the Crosswalk file
crosswalk <- read_crosswalk("nj", cols = c("trct"))

# Define your API Key (replace with your actual API Key)
api_key <- "YOUR_API_KEY_HERE"
if (api_key == "YOUR_API_KEY_HERE") {
  stop("Please create and enter your own API Key before continuing.")
}

# Define the base URL for ACS 5-Year Estimates
BASE_URL <- "https://api.census.gov/data/2023/acs/acs5"

# Define the variables to retrieve (median household income)
VARIABLES <- "B19013_001E"

# Define the geography (Tract-Level Data in New Jersey)
STATE_FIPS <- "34"
GEOGRAPHY <- paste0("for=tract:*&in=state:", STATE_FIPS)

# Construct the API request URL
url <- paste0(BASE_URL, "?get=", VARIABLES, "&", GEOGRAPHY, "&key=", api_key)

# Make the request to the Census API
response <- GET(url)

# Parse the response into JSON
data <- fromJSON(content(response, "text"))

# Convert the response to a data frame
acs <- as.data.frame(data[-1, ], stringsAsFactors = FALSE)
colnames(acs) <- data[1, ]

# Add a GEOID column for merging (State FIPS + County FIPS + Tract Code)
acs$GEOID <- paste0(acs$state, acs$county, acs$tract)

# Merge RAC data with Crosswalk
rac_tract <- rac %>%
  left_join(crosswalk, by = c("h_geocode" = "tabblk2020")) %>%
  select(-h_geocode, -createdate)

# Aggregate RAC data by tract
rac_agg <- rac_tract %>%
  group_by(trct) %>%
  summarise(across(everything(), \(x) sum(x, na.rm = TRUE)))

# Merge ACS data with RAC data
rac_acs <- rac_agg %>%
  left_join(acs, by = c("trct" = "GEOID"))

print(head(rac_acs))
>> # A tibble: 6 × 46
>>   trct    C000  CA01  CA02  CA03  CE01  CE02  CE03 CNS01 CNS02 CNS03 CNS04 CNS05
>>   <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
>> 1 34001…  1065   235   546   284   210   443   412     0     0     8    43    25
>> 2 34001…  1123   246   520   357   197   461   465     1     0     5    26    24
>> 3 34001…  1514   347   775   392   282   699   533     0     0     4    43    37
>> 4 34001…  1147   218   563   366   195   457   495     2     0    10    37    30
>> 5 34001…  1080   254   576   250   232   481   367     2     0     0    23    19
>> 6 34001…   721   161   380   180   180   327   214     1     0     6    12    10
>> # ℹ 33 more variables: CNS06 <dbl>, CNS07 <dbl>, CNS08 <dbl>, CNS09 <dbl>,
>> #   CNS10 <dbl>, CNS11 <dbl>, CNS12 <dbl>, CNS13 <dbl>, CNS14 <dbl>,
>> #   CNS15 <dbl>, CNS16 <dbl>, CNS17 <dbl>, CNS18 <dbl>, CNS19 <dbl>,
>> #   CNS20 <dbl>, CR01 <dbl>, CR02 <dbl>, CR03 <dbl>, CR04 <dbl>, CR05 <dbl>,
>> #   CR07 <dbl>, CT01 <dbl>, CT02 <dbl>, CD01 <dbl>, CD02 <dbl>, CD03 <dbl>,
>> #   CD04 <dbl>, CS01 <dbl>, CS02 <dbl>, B19013_001E <chr>, state <chr>,
>> #   county <chr>, tract <chr>

However, the process of matching LODES and ACS data becomes much more complicated if the ACS year desired does not have the same geography as the current LODES crosswalk. This process would require using the TIGER/Line shapfiles.

1.4.2. TIGER/Line Shapefiles

The TIGER/Line Shapefiles are extracts of selected geographic and cartographic information published by the Census Bureau. The shapefiles include polygon boundaries of geographic areas and features, linear features including roads and hydrography, and point features. The TIGER/Line Shapefiles contain a standard geographic identifier (GEOID) for each entity that links to the GEOID in the data from censuses and surveys, but they do not contain demographic or other data themselves.Users of LODES data may be interested in TIGER/Line Shapefiles if they desire aggregations or tabulations at geographic levels that are not current.

Case Study: Aggregating LODES Data by Former Connecticut Counties

In 2022, the Census Bureau begain adopting Connecticut’s nine planning regions as county-equivalents. Thus, those are the tabulations present in the current crosswalk. If a user desired the previous counties as a tabulation area, those are not published in the LODES geographic crosswalk. However, they are available in historical TIGER/Line Shapefiles. Thus, a user can spatially merge LODES data with that shapefile (by finding the blocks whose internal points are inside each former Connecticut county) and aggregate at that level.

The code below finds the number of primary jobs whose workplace location is in each former county in Connecticut in 2020. This requires a spatial merge of the crosswalk with the TIGER/Line Shapefile and then a column-based merge to the WAC data.

Note

There are several situations in which a user may want to user TIGER/Line Shapefiles or other shapefiles to access historic or different geographies. The process is essentially the same for any geographic file (as long as the projections are appropriate and matching). However, it should be noted that census block boundaries can change over time, and this can make historical comparisons challenging.

import pandas as pd
import geopandas as gpd

# Load counties from TIGER shapefiles
counties = gpd.read_file(
    "https://www2.census.gov/geo/tiger/TIGER2020/COUNTY/tl_2020_us_county.zip"
)

# Filter counties to just those in CT
ct_counties = counties[counties["STATEFP"] == "09"]

# Read WAC data for CT
ct_wac = read_lodes_data("wac", "ct", "S000", "JT01", 2020)

# Read crosswalk data for CT
ct_crosswalk = read_crosswalk("ct", cols=["blklatdd", "blklondd"])

# Convert crosswalk to geographic format
ct_crosswalk = gpd.GeoDataFrame(
    ct_crosswalk,
    geometry=gpd.points_from_xy(ct_crosswalk["blklondd"], ct_crosswalk["blklatdd"]),
    crs="EPSG:4326",
)

# Convert TIGER/Line data to same projection
ct_counties = ct_counties.to_crs("EPSG:4326")

# Spatial join: assign counties based on block internal points (points within polygons)
blocks_with_counties = gpd.sjoin(
    ct_crosswalk, ct_counties, how="left", predicate="within"
)

# Merge block-county with WAC on workplace block ID
merged_wac = ct_wac.merge(
    blocks_with_counties,
    left_on="w_geocode",
    right_on="tabblk2020",
    how="left",
    validate="1:1",
)

jobs_by_ct = (
    merged_wac[["COUNTYFP", "NAME", "C000"]]
    .groupby(["COUNTYFP", "NAME"])
    .sum()
    .reset_index()
)

print(jobs_by_ct)
>>   COUNTYFP        NAME    C000
>> 0      001   Fairfield  351266
>> 1      003    Hartford  460016
>> 2      005  Litchfield   50648
>> 3      007   Middlesex   54315
>> 4      009   New Haven  330795
>> 5      011  New London   95092
>> 6      013     Tolland   34619
>> 7      015     Windham   30589
library(sf)
library(dplyr)
library(tigris)
options(tigris_use_cache = TRUE)

# Load CT counties from TIGER shapefiles
ct_counties <- counties(state = "09", year = 2020, cb = FALSE) %>%
  st_transform(crs = 4326)

# Read WAC data for CT
ct_wac <- read_lodes_data("wac", "ct", "S000", "JT01", 2020)

# Read crosswalk data for CT
ct_crosswalk <- read_crosswalk("ct", cols = c("blklondd", "blklatdd"))

# Convert to sf (point geometry)
ct_crosswalk <- st_as_sf(ct_crosswalk, coords = c("blklondd", "blklatdd"), crs = 4326)

# Spatial join: assign counties based on block internal points (points within polygons)
blocks_with_counties <- st_join(ct_crosswalk, ct_counties, join = st_within, left = FALSE)

# Merge block-county with WAC on workplace block ID
merged_wac <- ct_wac %>%
  left_join(blocks_with_counties, by = c("w_geocode" = "tabblk2020"))

jobs_by_ct <- merged_wac %>%
  group_by(COUNTYFP, NAME) %>%
  summarise(C000 = sum(C000, na.rm = TRUE), .groups = "drop")

print(jobs_by_ct)
>> # A tibble: 8 × 3
>>   COUNTYFP NAME         C000
>>   <chr>    <chr>       <dbl>
>> 1 001      Fairfield  351266
>> 2 003      Hartford   460016
>> 3 005      Litchfield  50648
>> 4 007      Middlesex   54315
>> 5 009      New Haven  330795
>> 6 011      New London  95092
>> 7 013      Tolland     34619
>> 8 015      Windham     30589