1. LODES Code Building Blocks
1.1. Loading Data
There are four types of LODES data files, and all of them will be used in these samples. The types and their purposes/contents are briefly described here. For more information on the files, their organization, and naming convention, see the LODES Technical Documentation. Additionally, labels for labor market segments and worker/job characteristics can be found in the LODES Schema Files section in the Appendix.
Residence Area Characteristics (RAC): jobs are totaled by home census block.
Workplace Area Characteristics (WAC): jobs are totaled by work census block.
Origin-Destination (OD): jobs totals are associated with both a home census block and a work census block.
Geography Crosswalk: relationship file that establishes the hierarchical connection between each 2020 census tabulation block and all higher level geographic entities.
The naming conventions for the RAC, WAC, and OD files contain the following information:
Two letter state abbreviation (i.e. ‘ny’)
File type: RAC, WAC, or OD
Part of the file: “main” or “aux” (only for OD files)
Four character segment code (only for RAC and WAC files)
Four character job type
Four digit year
There are three easy ways to use LODES data in open source languages. You can (1) read directly from the online file structure, (2) download the raw data files directly, or (3) use output from the LED Extraction Tool.
1.1.1. Using the LODES Raw Data Files
Reading LODES directly from the the online directory structure may be the simplest option for users, and the code examples in all other sections will read the data in this way. Because the file types are determined entirely by the variables above, we can set variables equal to the desired values, and use that to construct the accurate URL/file name. The code below represents a function that can be used repeatedly and flexibly to read in OD, RAC, and WAC data (and in fact this function will be found repeatedly in these samples). Beneath the function is an example of its use. This code loads the RAC, WAC, and OD data for job type “JT00” (all workers) in 2004 for New Jersey. For the RAC and WAC files, the segment “SA01” (workers age 29 or younger) is selected. Note that for each file type, the geographic identifiers are read in specifically as a “string” datatype, this ensures consistency across file types, which will be important when merging files.
Note
These functions can also be used to read downloaded files, instead of reading them directly from the online file structure. Simply change the argument base
to be a location on your computer (i.e. local filesystem) where the files are located.
import pandas as pd def read_lodes_data( data_type, state, segment, job_type, year, main=True, base="https://lehd.ces.census.gov/data/lodes/LODES8/", ): """ Reads LODES RAC, WAC, or OD data based on the specified type. Parameters: - data_type (str): One of 'rac', 'wac', or 'od' to specify the dataset. - state (str): Lowercase two-character state abbreviation (e.g., 'nj'). - segment (str): Segment type (e.g., 'SA01'). Ignored if 'data_type' is 'od'. - job_type (str): Job type (e.g., 'JT00'). - year (int): Year of data (e.g., 2004). - main (bool): Whether to read the 'main' file type. Only used if 'data_type' is 'od'. - base (str): Base file path to read data from. Can be URL or local file path. Returns: - pd.DataFrame: The requested LODES dataset as a pandas DataFrame. """ if data_type not in ["rac", "wac", "od"]: raise ValueError("Invalid data type. Choose from 'rac', 'wac', or 'od'.") if data_type == "od": if main: file_url = f"{base}{state}/od/{state}_od_main_{job_type}_{year}.csv.gz" else: file_url = f"{base}{state}/od/{state}_od_aux_{job_type}_{year}.csv.gz" dtype_mapping = {"h_geocode": str, "w_geocode": str} else: file_url = f"{base}{state}/{data_type}/{state}_{data_type}_{segment}_{job_type}_{year}.csv.gz" dtype_mapping = {"h_geocode": str} if data_type == "rac" else {"w_geocode": str} return pd.read_csv(file_url, dtype=dtype_mapping) # Example usage rac_data = read_lodes_data("rac", "nj", "SA01", "JT00", 2004) wac_data = read_lodes_data("wac", "nj", "SA01", "JT00", 2004) od_data = read_lodes_data("od", "nj", None, "JT00", 2004) print(od_data.head())>> w_geocode h_geocode S000 SA01 ... SI01 SI02 SI03 createdate >> 0 340010001001001 340010001001013 1 0 ... 0 0 1 20230321 >> 1 340010001001001 340010001002004 2 1 ... 0 0 2 20230321 >> 2 340010001001001 340010001002013 1 1 ... 0 0 1 20230321 >> 3 340010001001001 340010002002011 1 1 ... 0 0 1 20230321 >> 4 340010001001001 340010003001008 1 1 ... 0 0 1 20230321 >> >> [5 rows x 13 columns]
library(readr) read_lodes_data <- function(data_type, state, segment, job_type, year, main = TRUE, base = "https://lehd.ces.census.gov/data/lodes/LODES8/") { #' Reads LODES RAC, WAC, or OD data based on the specified type. #' #' @param data_type One of 'rac', 'wac', or 'od' to specify the dataset. #' @param state Lowercase two-character state abbreviation (e.g., 'nj'). #' @param segment Segment type (e.g., 'SA01'). Ignored if 'data_type' is 'od'. #' @param job_type Job type (e.g., 'JT00'). #' @param year Year of data (e.g., 2004). #' @param main Whether to read the 'main' file type. Only used if 'data_type' is 'od'. #' @param base Base file path to read data from. Can be URL or local file path. #' #' @returns The requested LODES dataset. # Validate data type if (!(data_type %in% c("rac", "wac", "od"))) { stop("Invalid data type. Choose from 'rac', 'wac', or 'od'.") } # Construct file URL based on data type if (data_type == "od") { file_url <- if (main) { paste0(base, state, "/od/", state, "_od_main_", job_type, "_", year, ".csv.gz") } else { paste0(base, state, "/od/", state, "_od_aux_", job_type, "_", year, ".csv.gz") } col_types <- cols(h_geocode = col_character(), w_geocode = col_character()) } else { file_url <- sprintf( "%s%s/%s/%s_%s_%s_%s_%d.csv.gz", base, state, data_type, state, data_type, segment, job_type, year ) col_types <- if (data_type == "rac") { cols(h_geocode = col_character()) } else { cols(w_geocode = col_character()) } } # Read CSV with specified column types df <- read_csv(file_url, col_types = col_types) return(df) } # Example usage rac_data <- read_lodes_data("rac", "nj", "SA01", "JT00", 2004) wac_data <- read_lodes_data("wac", "nj", "SA01", "JT00", 2004) od_data <- read_lodes_data("od", "nj", NULL, "JT00", 2004) print(head(od_data))>> # A tibble: 6 × 13 >> w_geocode h_geocode S000 SA01 SA02 SA03 SE01 SE02 SE03 SI01 SI02 >> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> >> 1 3400100010010… 34001000… 1 0 1 0 0 0 1 0 0 >> 2 3400100010010… 34001000… 2 1 0 1 1 0 1 0 0 >> 3 3400100010010… 34001000… 1 1 0 0 1 0 0 0 0 >> 4 3400100010010… 34001000… 1 1 0 0 1 0 0 0 0 >> 5 3400100010010… 34001000… 1 1 0 0 1 0 0 0 0 >> 6 3400100010010… 34001000… 1 1 0 0 1 0 0 0 0 >> # ℹ 2 more variables: SI03 <dbl>, createdate <dbl>
The process for reading a Geography Crosswalk is also relatively simple. There is one crosswalk for each state, so we can write a function (which will also be reused throughout these examples) to read a crosswalk given the lowercase two-digit state abbreviation. The example code directly beneath the function reads the Geography Crosswalk for New Jersey.
Note
The Geography Crosswalks are very wide. They contain more than 15 different geography types with name and codes for each geography. Therefore, it is recommended to read in only the columns are important to the problem at hand. The function below provides this functionality.
import pandas as pd def read_crosswalk( state, cols="all", base="https://lehd.ces.census.gov/data/lodes/LODES8/" ): """ Reads LODES Geography Crosswalk given state. Parameters: - state (str): Lowercase two-character state abbreviation (e.g., 'nj'). - cols (list): List of columns to read from the crosswalk in addition to tabblk2020 (Block ID). If "all", all columns will be read and returned. - base (str): Base file path to read data from. Can be URL or local file path. Returns: - pd.DataFrame: The requested LODES crosswalk as a pandas DataFrame. """ file_url = f"{base}{state}/{state}_xwalk.csv.gz" dtype_mapping = str if cols == "all": return pd.read_csv(file_url, dtype=dtype_mapping) cols.insert(0, "tabblk2020") return pd.read_csv(file_url, dtype=dtype_mapping, usecols=cols) crosswalk = read_crosswalk("nj", cols=["cty", "ctyname"]) print(crosswalk.head())>> tabblk2020 cty ctyname >> 0 340010001001000 34001 Atlantic County, NJ >> 1 340010001001001 34001 Atlantic County, NJ >> 2 340010001001002 34001 Atlantic County, NJ >> 3 340010001001003 34001 Atlantic County, NJ >> 4 340010001001004 34001 Atlantic County, NJ
library(readr) library(dplyr) read_crosswalk <- function(state, cols = "all", base = "https://lehd.ces.census.gov/data/lodes/LODES8/") { #' Reads LODES Geography Crosswalk given state. #' #' @param state Lowercase two-character state abbreviatoin (e.g., 'nj'). #' @param cols List of columns to read from the crosswalk in addition to tabblk2020 (Block ID). If "all", all columns will be read and returned. #' @param base Base file path to read data from. Can be URL or local file path. #' #' @returns The requested LODES crosswalk. # Construct file URL file_url <- paste0(base, state, "/", state, "_xwalk.csv.gz") # Read the full dataset first if all columns are needed if (identical(cols, "all")) { return(read_csv(file_url, col_types = cols(tabblk2020 = col_character()))) } # Ensure tabblk2020 (GEOID) is included in the selected columns cols <- unique(c("tabblk2020", cols)) # Read only the specified columns return(read_csv( file_url, col_types = cols(.default = col_character()), col_select = all_of(cols) )) } # Example usage crosswalk <- read_crosswalk("nj", cols = c("cty", "ctyname")) print(head(crosswalk))>> # A tibble: 6 × 3 >> tabblk2020 cty ctyname >> <chr> <chr> <chr> >> 1 340010001001000 34001 Atlantic County, NJ >> 2 340010001001001 34001 Atlantic County, NJ >> 3 340010001001002 34001 Atlantic County, NJ >> 4 340010001001003 34001 Atlantic County, NJ >> 5 340010001001004 34001 Atlantic County, NJ >> 6 340010001001005 34001 Atlantic County, NJ
1.1.2. Using Output from the LED Extraction Tool
Reading data from output from the LED Extraction Tool is very similar. The extraction tool allows users to get both broader and narrower sets of data than the raw files. For example, users can combine multiple geographies and years, instead of having to read and combine multiple files. The selection of specific geographies means that users do not have to receive full state data if they only need a specific geography. In the example below, a user has selected to receive RAC and WAC data from Camden County, NJ and the zip code of 10019 (a ZIP Code in Manhattan). The data was requested for four columns (total and broken into three age categories), across two years, and for primary jobs. There are a few differences to note from the raw files (as can be seen from the code and output below):
The naming of the files is based on a unique character identifier, rather than the human readable format of the raw data files. This unique character identifier will be on the name of the folder, the data files, and the crosswalk.
The extraction tool path for area characteristics data produces the data for RAC and WAC, as well as a geographic crosswalk, which is clipped to contain only the blocks in the selected geographies.
The resulting data files have a different selection of columns when compared to the raw data. Here, there are only four characteristic columns, because only four were selected. On the other hand, there are two columns not in the data,
segment
andyear
, which allow for clarity in this output, especially given that you can select multiple years of data.
Note
If you would like to recreate the data in this code sample, you can recreate the query in the LED Extraction Tool using this settings file. The extraction tool provides an easy format for saving and uploading settings files. These files were downloaded by clicking the “CSVs & Metadata (ZIP)” link.
import pandas as pd # read in wac and rac data rac_data = pd.read_csv( "data/ac_f5e832282bacfdea48d8690947888efb/rac_f5e832282bacfdea48d8690947888efb.csv" ) wac_data = pd.read_csv( "data/ac_f5e832282bacfdea48d8690947888efb/wac_f5e832282bacfdea48d8690947888efb.csv" ) # read in selected crosswalk crosswalk = pd.read_csv( "data/ac_f5e832282bacfdea48d8690947888efb/xwalk_f5e832282bacfdea48d8690947888efb.csv" ) # Sample 10 rows from rac_data with fixed seed for reproducibility print(rac_data.sample(n=10, random_state=4334))>> h_geocode segment year C000 CA01 CA02 CA03 >> 1698 340076008001012 S000 2019 10 6 3 1 >> 493 340076012004006 S000 2018 11 3 6 2 >> 5824 340076039013007 S000 2019 10 0 6 4 >> 11415 340076103002026 S000 2018 10 5 4 1 >> 528 340076013002026 S000 2018 29 13 10 6 >> 12396 340076089011078 S000 2019 11 2 7 2 >> 7483 340076075053001 S000 2018 8 0 5 3 >> 12048 340076086002031 S000 2019 11 2 8 1 >> 6138 340076051001027 S000 2019 36 6 20 10 >> 13105 340076105002040 S000 2019 20 5 8 7
library(readr) library(dplyr) # Read in wac and rac data rac_data <- read_csv( "data/ac_f5e832282bacfdea48d8690947888efb/rac_f5e832282bacfdea48d8690947888efb.csv" ) wac_data <- read_csv( "data/ac_f5e832282bacfdea48d8690947888efb/wac_f5e832282bacfdea48d8690947888efb.csv" ) # Read in selected crosswalk crosswalk <- read_csv( "data/ac_f5e832282bacfdea48d8690947888efb/xwalk_f5e832282bacfdea48d8690947888efb.csv" ) # Sample 10 rows from rac_data with fixed seed for reproducibility set.seed(4334) print(sample_n(rac_data, 10))>> # A tibble: 10 × 7 >> h_geocode segment year C000 CA01 CA02 CA03 >> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> >> 1 3.40e14 S000 2019 5 2 1 2 >> 2 3.40e14 S000 2018 13 6 6 1 >> 3 3.40e14 S000 2018 12 1 4 7 >> 4 3.40e14 S000 2018 14 2 6 6 >> 5 3.40e14 S000 2019 5 0 0 5 >> 6 3.40e14 S000 2018 24 8 10 6 >> 7 3.40e14 S000 2018 18 5 9 4 >> 8 3.40e14 S000 2019 28 6 17 5 >> 9 3.40e14 S000 2018 19 3 10 6 >> 10 3.40e14 S000 2018 22 7 7 8
1.2. Combining and Aggregating Data
1.2.1. Combining Data Files
Given that LODES data files are segmented significantly, there are many instances where users may want to combine data files. This could include different states, years, segments, and/or job types. Additionally, users may find use in combining the “main” and “aux” portions of OD files. These kinds of operations are relatively simple, with two main steps: (1) adding identifiers to the portions of data and (2) stacking or binding the data together.
The code below combines RAC files for New Jersey between the years of 2017 and 2020, across two job types and two segments are combined into one data structure. The same structure would be true for combining WAC and OD files. Here for convenience, we can use the function we wrote in Loading Data.
import pandas as pd # Set variables state = "nj" segments = ["SA01", "SE03"] job_types = ["JT00", "JT01"] years = list(range(2017, 2021)) # Create empty list to store dataframes all_data = [] for seg in segments: for job_type in job_types: for year in years: # Call the function to read LODES data rac = read_lodes_data("rac", state, seg, job_type, year) # Identify each portion of data rac["segment"] = seg rac["jobtype"] = job_type rac["year"] = year # Append to list all_data.append(rac) # Combine all dataframes into one dataset full_rac = pd.concat(all_data) print(full_rac.head())>> h_geocode C000 CA01 CA02 ... createdate segment jobtype year >> 0 340010001001003 1 1 0 ... 20230321 SA01 JT00 2017 >> 1 340010001001005 5 5 0 ... 20230321 SA01 JT00 2017 >> 2 340010001001006 3 3 0 ... 20230321 SA01 JT00 2017 >> 3 340010001001009 4 4 0 ... 20230321 SA01 JT00 2017 >> 4 340010001001010 5 5 0 ... 20230321 SA01 JT00 2017 >> >> [5 rows x 46 columns]
library(dplyr) library(purrr) # Set variables state <- "nj" segments <- c("SA01", "SE03") job_types <- c("JT00", "JT01") years <- 2017:2020 # Use map functions to iterate over all combinations and bind rows full_rac <- expand.grid( segment = segments, jobtype = job_types, year = years ) %>% pmap_dfr(~ { rac <- read_lodes_data("rac", state, ..1, ..2, ..3) rac <- mutate(rac, segment = ..1, jobtype = ..2, year = ..3 ) return(rac) }) print(head(full_rac))>> # A tibble: 6 × 46 >> h_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 CNS03 CNS04 >> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> >> 1 34001000100… 1 1 0 0 0 1 0 0 0 0 0 >> 2 34001000100… 5 5 0 0 1 3 1 0 0 0 0 >> 3 34001000100… 3 3 0 0 1 2 0 0 0 0 0 >> 4 34001000100… 4 4 0 0 1 1 2 0 0 0 0 >> 5 34001000100… 5 5 0 0 2 2 1 0 0 0 0 >> 6 34001000100… 4 4 0 0 3 1 0 0 0 0 0 >> # ℹ 34 more variables: CNS05 <dbl>, CNS06 <dbl>, CNS07 <dbl>, CNS08 <dbl>, >> # CNS09 <dbl>, CNS10 <dbl>, CNS11 <dbl>, CNS12 <dbl>, CNS13 <dbl>, >> # CNS14 <dbl>, CNS15 <dbl>, CNS16 <dbl>, CNS17 <dbl>, CNS18 <dbl>, >> # CNS19 <dbl>, CNS20 <dbl>, CR01 <dbl>, CR02 <dbl>, CR03 <dbl>, CR04 <dbl>, >> # CR05 <dbl>, CR07 <dbl>, CT01 <dbl>, CT02 <dbl>, CD01 <dbl>, CD02 <dbl>, >> # CD03 <dbl>, CD04 <dbl>, CS01 <dbl>, CS02 <dbl>, createdate <dbl>, >> # segment <fct>, jobtype <fct>, year <int>
1.2.2. Combining Geographic Crosswalks
For combining geographic files, the process is even simpler. The files can simply be stacked on each other, since all the geographic identifiers are unique. The code below loops through all US states and combines the crosswalks into one crosswalk. Then, a user can simply use the crosswalk as a one national crosswalk. Again, we can use the function we wrote in Loading Data.
import os # Make sure DC is included in the states os.environ["DC_STATEHOOD"] = "1" from us import states import pandas as pd # Create an empty list to store crosswallks crosswalks = [] # Loop through all states (and DC) for state in states.STATES: # The Crosswalk file contains many different geographies # For this example we've selected state, county, and tract. cols = ["st", "cty", "trct"] crosswalk = read_crosswalk(state.abbr.lower(), cols) # Append to list crosswalks.append(crosswalk) # Combine all crosswalks into one. all_crosswalks = pd.concat(crosswalks) print(all_crosswalks.head())>> tabblk2020 st cty trct >> 0 010010201001000 01 01001 01001020100 >> 1 010010201001001 01 01001 01001020100 >> 2 010010201001002 01 01001 01001020100 >> 3 010010201001003 01 01001 01001020100 >> 4 010010201001004 01 01001 01001020100
library(dplyr) library(purrr) # List of all states plus DC states <- c(state.abb, "DC") # Call function across all states crosswalks <- map(states, ~ read_crosswalk(tolower(.x), c("st", "cty", "trct"))) # Combine all crosswalks into one data frame all_crosswalks <- bind_rows(crosswalks) print(head(all_crosswalks))>> # A tibble: 6 × 4 >> tabblk2020 st cty trct >> <chr> <chr> <chr> <chr> >> 1 010010201001000 01 01001 01001020100 >> 2 010010201001001 01 01001 01001020100 >> 3 010010201001002 01 01001 01001020100 >> 4 010010201001003 01 01001 01001020100 >> 5 010010201001004 01 01001 01001020100 >> 6 010010201001005 01 01001 01001020100
1.2.3. Aggregating to Higher-Level Geographies
All LODES data files are provided at the census block level. However, for many analysis purposes, users may desire the data aggregated to other geographies. This is the purpose of the geographic crosswalk. The geographic crosswalk provides geographic hierarchies for many geographies in the most current year of geographic data. The primary key/identifier in the Geography Crosswalk is the 2020 Census Tabulation Block Code (tabblk2020
). This code does not change between Decennial Censuses and can be used to link to the geocodes in the OD/RAC/WAC files. All of the other columns in the crosswalk file refer to higher level geographic entities (both by code and name) that are supported by the OnTheMap application. The crosswalks contain all 2020 Census Tabulation Blocks for each state, and thus can be quite large. Users are recommended to only read in the columns of the crosswalk which correspond to geographies they are interested in. For a full list of geographies in the crosswalk, see the LODES Technical Documentation.
Warning
Because the geographic crosswalk is in the most current year of geographic data, all geographic analyses will be in terms of current boundaries, regardless of the year of LODES data. For example, if a user wishes to find the number of jobs in Bergen County, New Jersey in 2018 and 2021, the user can use the crosswalk to accomplish the county level aggregation, but the data will be in terms of the most recent geographic boundary of Bergen County for both 2018 and 2021. The same is true of analyses conducted in OnTheMap.
Case Study: Counts of Worker Residence by County in New Jersey
In order to find the number of residences by county in New Jersey, the block level RAC data must be aggregated to the county level. In order to aggregate RAC or WAC data, the Geographic Crosswalk must be joined to the data. Here, we seek to find the county level job totals (on the residence side). For loading the data, we can use the function we wrote in Loading Data. We utilize a “left join”, with only rows that are in the LODES data included in the merge. Note the code assumes a one-to-one relationship between the two datasets, which is true in this case, but would not be true if a user was combining the crosswalk with multiple years or segments of data. The process for merging data with the crosswalk and performing an aggregation is the same for WAC data, with the only modification being that the left dataset should be joined on w_geocode
.
Note
In this case study a left join is used. There are likely blocks in the Geographic Crosswalk that are not in the LODES data, because in the LODES production process, blocks that do not contain data (for example if no workers live in a certain block) are excluded. There are certain situations in which an outer join/right join (which are equivalent in this case) might be of use to a user.
rac = read_lodes_data("rac", "nj", "S000", "JT00", 2019) crosswalk = read_crosswalk("nj", cols=["cty", "ctyname"]) # Conduct left merge of RAC data with crosswalk on block IDs rac_w_crosswalk = rac.merge( crosswalk, how="left", left_on="h_geocode", right_on="tabblk2020", validate="1:1", ) # Drop non-numerical columns for aggregating rac_w_crosswalk = rac_w_crosswalk.drop( ["h_geocode", "tabblk2020", "createdate"], axis=1 ) # Aggregate at county level rac_county = rac_w_crosswalk.groupby(["cty", "ctyname"]).sum().reset_index() print(rac_county.head())>> cty ctyname C000 CA01 ... CD03 CD04 CS01 CS02 >> 0 34001 Atlantic County, NJ 129676 27741 ... 30981 28503 61999 67677 >> 1 34003 Bergen County, NJ 459959 84789 ... 101200 158152 229464 230495 >> 2 34005 Burlington County, NJ 234959 48625 ... 55876 66940 112377 122582 >> 3 34007 Camden County, NJ 252963 56857 ... 59068 63373 121146 131817 >> 4 34009 Cape May County, NJ 38129 7930 ... 9215 9390 18431 19698 >> >> [5 rows x 43 columns]
library(dplyr) library(purrr) rac <- read_lodes_data("rac", "nj", "S000", "JT00", 2019) crosswalk <- read_crosswalk("nj", cols = c("cty", "ctyname")) # Merge RAC data with Crosswalk on block IDs rac_w_crosswalk <- rac %>% left_join(crosswalk, by = c("h_geocode" = "tabblk2020"), relationship = "one-to-one") # Drop non-numerical columns for aggregation rac_w_crosswalk <- rac_w_crosswalk %>% select(-h_geocode, -createdate) # Aggregate at county level rac_county <- rac_w_crosswalk %>% group_by(cty, ctyname) %>% summarise(across(where(is.numeric), \(x) sum(x, na.rm = TRUE)), .groups = "drop") print(head(rac_county))>> # A tibble: 6 × 43 >> cty ctyname C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 CNS03 >> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> >> 1 34001 Atlant… 129676 27741 65407 36528 31095 46383 52198 476 34 834 >> 2 34003 Bergen… 459959 84789 246613 128557 84519 102548 272892 292 87 2091 >> 3 34005 Burlin… 234959 48625 121643 64691 47992 58194 128773 460 73 1283 >> 4 34007 Camden… 252963 56857 133596 62510 55828 73012 124123 521 51 1150 >> 5 34009 Cape M… 38129 7930 18106 12093 9390 12088 16651 250 39 269 >> 6 34011 Cumber… 62080 14626 32737 14717 14021 21958 26101 1294 143 456 >> # ℹ 31 more variables: CNS04 <dbl>, CNS05 <dbl>, CNS06 <dbl>, CNS07 <dbl>, >> # CNS08 <dbl>, CNS09 <dbl>, CNS10 <dbl>, CNS11 <dbl>, CNS12 <dbl>, >> # CNS13 <dbl>, CNS14 <dbl>, CNS15 <dbl>, CNS16 <dbl>, CNS17 <dbl>, >> # CNS18 <dbl>, CNS19 <dbl>, CNS20 <dbl>, CR01 <dbl>, CR02 <dbl>, CR03 <dbl>, >> # CR04 <dbl>, CR05 <dbl>, CR07 <dbl>, CT01 <dbl>, CT02 <dbl>, CD01 <dbl>, >> # CD02 <dbl>, CD03 <dbl>, CD04 <dbl>, CS01 <dbl>, CS02 <dbl>
Why Can’t I Use the GEOID for Aggregation?
A user looking at the Census Block Code (h_geocode
or w_geocode
) may see that the GEOID appears to be various identifiers codes combined together into one 15-digit unique geocode. While this is not entirely incorrect, the Block Code should not be used to aggregate to higher geographies on its own. At the Decennial Census, the block code is generally aligned with the higher level geographies (i.e. state, county, tract, etc.). But, over time these higher level boundaries can change. The code below shows that this strategy will not work, and could lead to significant inaccuracies.
# Read in RAC and crosswalk rac = read_lodes_data("rac", "oh", "S000", "JT00", 2022) crosswalk = read_crosswalk("oh", cols=["cty"]) # Conduct left merge of RAC data with crosswalk on block IDs rac_w_crosswalk = rac.merge( crosswalk, how="left", left_on="h_geocode", right_on="tabblk2020", validate="1:1", ) # Create 'fake' county ID from first 5 digits of Block ID rac_w_crosswalk["fake_county"] = rac_w_crosswalk["h_geocode"].str[0:5] # Find blocks where 'fake' doesn't match real num_blocks = (rac_w_crosswalk["fake_county"] != rac_w_crosswalk["cty"]).sum() print( f"There are {num_blocks} blocks for which the first 5 digits of the Block ID does not match the county" )>> There are 2 blocks for which the first 5 digits of the Block ID does not match the county
library(dplyr) # Read in RAC and crosswalk data rac <- read_lodes_data("rac", "oh", "S000", "JT00", 2022) crosswalk <- read_crosswalk("oh", cols = c("cty")) # Conduct left join on block IDs rac_w_crosswalk <- rac %>% left_join(crosswalk, by = c("h_geocode" = "tabblk2020")) # Create 'fake' county ID from first 5 digits of Block ID rac_w_crosswalk$fake_county <- substr(rac_w_crosswalk$h_geocode, 1, 5) # Find number of mismatched blocks num_blocks <- sum(rac_w_crosswalk$fake_county != rac_w_crosswalk$cty, na.rm = TRUE) cat("There are", num_blocks, "blocks for which the first 5 digits of the Block ID do not match the county\n")>> There are 2 blocks for which the first 5 digits of the Block ID do not match the county
Case Study: County to County Workplace and Residence Pairs
The goal here is to find the number of jobs with a workplace in one county and a residence in another county for each unique pair of counties. For this, we need to aggregate files again to the county level, but for OD files. The process for merging Origin-Destination (OD) files with the Geographic Crosswalk is similar to the process for RAC/WAC data, with the exception that the OD files have two unique geographic identifiers (residence and workplace), where the RAC and WAC files on have one. Therefore, for many purposes, aggregating OD files may require two merges with the geographic crosswalk.
Note
One thing to always be aware of when using OD data is the distinction between “main” and “auxiliary” data. The “main” data files contain data on all jobs whose workplaces and residences are within the state selected, while the “auxiliary” data files contain jobs whose workplaces are in the state but whose residences are in another state. This means that if a user wanted data on jobs with residences within a specific state and work anywhere, they would need to collect the “main” file from the desired target state, and the “auxiliary” files from every other state.
od = read_lodes_data("od", "nj", None, "JT00", 2019) crosswalk = read_crosswalk("nj", cols=["cty", "ctyname"]) # Create crosswalk with home and workplace labels h_crosswalk = crosswalk.rename(columns=lambda col: f"h_{col}") w_crosswalk = crosswalk.rename(columns=lambda col: f"w_{col}") # Merge on home labels od_home = od.merge( h_crosswalk, how="left", left_on="h_geocode", right_on="h_tabblk2020", validate="m:1", indicator=True, ) # Sanity check, no rows should be one-sided merge assert (od_home["_merge"] == "both").all() # Drop non-numerical columns for aggregating od_home = od_home.drop(["h_geocode", "h_tabblk2020", "_merge", "createdate"], axis=1) # Merge on work labels od_home_and_work = od_home.merge( w_crosswalk, how="left", left_on="w_geocode", right_on="w_tabblk2020", validate="m:1", indicator=True, ) # Sanity check assert (od_home_and_work["_merge"] == "both").all() # Drop non-numerical columns for aggregating od_home_and_work = od_home_and_work.drop( ["w_geocode", "w_tabblk2020", "_merge"], axis=1 ) # Aggregate at county level od_county = ( od_home_and_work.groupby(["h_cty", "h_ctyname", "w_cty", "w_ctyname"]) .sum() .reset_index() ) print(od_county[["h_ctyname", "w_ctyname", "S000"]].head())>> h_ctyname w_ctyname S000 >> 0 Atlantic County, NJ Atlantic County, NJ 83828 >> 1 Atlantic County, NJ Bergen County, NJ 1170 >> 2 Atlantic County, NJ Burlington County, NJ 3841 >> 3 Atlantic County, NJ Camden County, NJ 5695 >> 4 Atlantic County, NJ Cape May County, NJ 4634
library(dplyr) library(purrr) od <- read_lodes_data("od", "nj", NULL, "JT00", 2019) crosswalk <- read_crosswalk("nj", cols = c("cty", "ctyname")) # Create crosswalk with home and workplace labels h_crosswalk <- crosswalk %>% rename_with(~ paste0("h_", .x)) w_crosswalk <- crosswalk %>% rename_with(~ paste0("w_", .x)) # Merge on home labels od_home <- od %>% left_join(h_crosswalk, by = c("h_geocode" = "h_tabblk2020"), relationship = "many-to-one" ) %>% mutate(merge = ifelse(is.na(h_cty), "left_only", "both")) # Sanity check if (any(od_home$merge != "both")) { stop("Merge contains mismatches on home geocode!") } # Drop non-numerical columns for aggregation od_home <- od_home %>% select(-h_geocode, -merge, -createdate) # Merge on work labels od_home_and_work <- od_home %>% left_join(w_crosswalk, by = c("w_geocode" = "w_tabblk2020"), relationship = "many-to-one" ) %>% mutate(merge = ifelse(is.na(w_cty), "left_only", "both")) # Sanity check: Ensure all merged rows are "both" if (any(od_home_and_work$merge != "both")) { stop("Merge contains mismatches on work geocode!") } # Drop non-numerical columns for aggregation od_home_and_work <- od_home_and_work %>% select(-w_geocode, -merge) # Aggregate at county level od_county <- od_home_and_work %>% group_by(h_cty, h_ctyname, w_cty, w_ctyname) %>% summarise(across(where(is.numeric), ~ sum(.x, na.rm = TRUE)), .groups = "drop") print(head(od_county[, c("h_ctyname", "w_ctyname", "S000")]))>> # A tibble: 6 × 3 >> h_ctyname w_ctyname S000 >> <chr> <chr> <dbl> >> 1 Atlantic County, NJ Atlantic County, NJ 83828 >> 2 Atlantic County, NJ Bergen County, NJ 1170 >> 3 Atlantic County, NJ Burlington County, NJ 3841 >> 4 Atlantic County, NJ Camden County, NJ 5695 >> 5 Atlantic County, NJ Cape May County, NJ 4634 >> 6 Atlantic County, NJ Cumberland County, NJ 4465
1.3. Spatial and Geographic Analysis
LODES data can be treated as geographic in nature, due to how small the base geography (Census Blocks) is. Users may wish to do spatial aggregations, merges, and/or other analyses using the geographic nature of LODES. The crosswalk files contain the latitude and longitude for an internal point of each block. This internal point is used by the OnTheMap application to assist in selecting blocks. While it is not a centroid, and the only guarantee is that the point is inside the block, it is useful to use the point in spatial analyses.
1.3.1. Case Study: Finding all the Jobs Within 5 KM of a Point

This image shows the goal of the first part of the code below (finding all blocks within 5 kilometers of the specific point). For the Python code that generated this image, see Visualizing Blocks around a Point in New Jersey.
The code below finds all jobs in 2018 where workers are 29 years or younger, whose workplace is within 5 km of a specific point (Latitude: 40.736679° N, Longitude: 74.175975° W) and whose residence is in Camden County, NJ. The following steps are required:
Read in the appropriate OD file and Geography Crosswalk. The OD file is needed because to answer the question, both residence and workplace are required. For the data question, the “main” OD file for New Jersey in 2018 for job type JT00 is required. The Geography Crosswalk used is the New Jersey crosswalk. Note that for the crosswalk, we include the longitude and latitude for each block.
The crosswalk must be converted to a data format that can handle spatial calculations. This includes using a proper projection. In this example, the projection ESRI:102003 is used, which is the same projection used in the OnTheMap application.
Limit the OD rows to those that have a workplace within 5 km of the specified point (find all blocks that have their internal points within this radius).
Limit further the OD rows to those whose residence is in Camden county.
Sum the column “SA01”, which is the segment of jobs whose worker is 29 years of age or younger.
Note
In dealing with spatial data and conducting spatial analyses, projections are incredibly important. In this example, projection ESRI:102003 is used, which is a projection that is optimized for the contiguous United States. However, if an analysis is localized to a specific smaller geographic area, a different projection may be more accurate. For more on projections and how to choose a projection, take a look at this resource.
import pandas as pd import geopandas as gpd from shapely import Point # Read the OD file od = read_lodes_data("od", "nj", None, "JT00", 2018) # Read the Crosswalk file crosswalk = read_crosswalk("nj", cols=["cty", "blklatdd", "blklondd"]) # Convert Crosswalk to GeoDataFrame and convert to appropriate projection crosswalk = gpd.GeoDataFrame( crosswalk, geometry=gpd.points_from_xy( x=crosswalk["blklondd"], y=crosswalk["blklatdd"], crs="EPSG:4326" ), ).to_crs("ESRI:102003") # Create point for buffering and change to appropriate projection reference_point = Point(-74.175975, 40.736679) reference_point_projected = gpd.GeoSeries([reference_point], crs="EPSG:4326").to_crs( "ESRI:102003" ) # Create 5 km buffer polygon around point buffer_5km = reference_point_projected.buffer(5000) # Find list of blocks within buffer blocks_within_5km = crosswalk.loc[ crosswalk.geometry.within(buffer_5km.iloc[0]), "tabblk2020" ].to_list() # Limit OD to just blocks whose workplace are within buffer nj_blocks_in_buffer = od[od["w_geocode"].isin(blocks_within_5km)] # Find list of blocks in Camden County, NJ camden_blocks = crosswalk.loc[crosswalk["cty"] == "34007", "tabblk2020"].to_list() # Limit OD to just blocks whose residence are in Camden County nj_buffer_camden = nj_blocks_in_buffer[ nj_blocks_in_buffer["h_geocode"].isin(camden_blocks) ] num_jobs = nj_buffer_camden["SA01"].sum() print(f"There are {num_jobs} jobs meeting the selected criteria.")>> There are 242 jobs meeting the selected criteria.
library(sf) library(dplyr) library(readr) # Read the OD file od <- read_lodes_data("od", "nj", None, "JT00", 2018) # Read the Crosswalk file crosswalk <- read_crosswalk("nj", cols = c("cty", "blklatdd", "blklondd")) # Convert Crosswalk to sf object and transform to EPSG:102003 crosswalk_sf <- crosswalk %>% filter(!is.na(blklatdd) & !is.na(blklondd)) %>% # Remove missing lat/lon st_as_sf(coords = c("blklondd", "blklatdd"), crs = 4326) %>% st_transform(crs = 102003) # Create reference point and buffer in EPSG:102003 reference_point <- st_sfc(st_point(c(-74.175975, 40.736679)), crs = 4326) %>% st_transform(crs = 102003) # Create 5 km buffer polygon around point buffer_5km <- st_buffer(reference_point, 5000) # Find list of blocks within buffer blocks_within_5km <- crosswalk_sf %>% filter(as.logical(st_within(geometry, buffer_5km, sparse = FALSE))) %>% pull(tabblk2020) # Limit OD to just blocks whose workplace is within buffer nj_blocks_in_buffer <- od %>% filter(w_geocode %in% blocks_within_5km) # Find list of blocks in Camden County, NJ (FIPS: 34007) camden_blocks <- crosswalk_sf %>% filter(cty == "34007") %>% pull(tabblk2020) # Limit OD to just blocks whose residence is in Camden County nj_buffer_camden <- nj_blocks_in_buffer %>% filter(h_geocode %in% camden_blocks) # Sum number of workers num_jobs <- sum(nj_buffer_camden$SA01, na.rm = TRUE) cat(sprintf("There are %d jobs meeting the selected criteria.\n", num_jobs))>> There are 242 jobs meeting the selected criteria.
1.4. Combining Other Data with LODES
Users may desire to combine other data products from the Census Bureau or non-Census data products with LODES. In many cases, merging on FIPS code or GEOID may be feasible, both on Census and non-Census data. In other cases, spatial merges may be required, if the external data is in a spatial format and does not have a GEOID that maps with Census data.
1.4.1. American Community Survery (ACS) Data
First, let’s consider the example of merging LODES data with data from the American Community Survery (ACS). ACS data uses GEOIDs for identifiers, so the process can be relatively simple. However, the tricky part is in aligning geography. If the year of ACS data has the same geography as the current LODES crosswalk, then the process is relatively simple. This is the case with ACS 2023 data and LODES 2022 data, which both rely on 2023 TIGER/Line Shapefiles.
Case Study: Combine the Number of Residences with Median Household Income by Tract
The code below aggregates 2022 RAC data to the tract level and then merges it with 2023 ACS median household income data. In order to run this code, you should request an API key. Note that limited queries can be made without an API key (simply leave off the &key= portion of the request), but that for repeated queries, an API key should be used.
import pandas as pd import requests rac = read_lodes_data("rac", "nj", "S000", "JT00", 2022) # Read the Crosswalk file crosswalk = read_crosswalk("nj", cols=["trct"]) # Define your API Key (Get one from https://api.census.gov/data/key_signup.html) API_KEY = "YOUR_API_KEY_HERE" if API_KEY == "YOUR_API_KEY_HERE": raise ValueError("Please create and enter your own API Key before continuing.") # Define the base URL for ACS 5-Year Estimates BASE_URL = "https://api.census.gov/data/2023/acs/acs5" # Define the variables to retrieve (median household income) VARIABLES = ["B19013_001E"] # Define the geography (Tract-Level Data in New Jersey) STATE_FIPS = "34" GEOGRAPHY = f"for=tract:*&in=state:{STATE_FIPS}" # Construct the API request URL url = f"{BASE_URL}?get={','.join(VARIABLES)}&{GEOGRAPHY}&key={API_KEY}" # Make the request response = requests.get(url) data = response.json() # Convert to DataFrame acs = pd.DataFrame(data[1:], columns=data[0]) # Add a GEOID column for merging (State FIPS + County FIPS + Tract Code) acs["GEOID"] = acs["state"] + acs["county"] + acs["tract"] # Merge and aggregate RAC data rac_tract = rac.merge( crosswalk, how="left", left_on="h_geocode", right_on="tabblk2020", validate="1:1" ) rac_tract = rac_tract.drop(["h_geocode", "tabblk2020", "createdate"], axis=1) rac_agg = rac_tract.groupby("trct").sum().reset_index() # Combine ACS and RAC data rac_acs = rac_agg.merge( acs, how="left", left_on="trct", right_on="GEOID", validate="1:1" ) # Drop duplicate GEOID column rac_acs = rac_acs.drop(["GEOID"], axis=1) print(rac_acs.head())>> trct C000 CA01 CA02 CA03 ... CS02 B19013_001E state county tract >> 0 34001000100 1065 235 546 284 ... 530 44870 34 001 000100 >> 1 34001000200 1123 246 520 357 ... 569 54858 34 001 000200 >> 2 34001000300 1514 347 775 392 ... 694 53463 34 001 000300 >> 3 34001000400 1147 218 563 366 ... 542 34895 34 001 000400 >> 4 34001000500 1080 254 576 250 ... 529 50704 34 001 000500 >> >> [5 rows x 46 columns]
library(httr) library(jsonlite) library(dplyr) library(readr) # Read the RAC file rac <- read_lodes_data("rac", "nj", "S000", "JT00", 2022) # Read the Crosswalk file crosswalk <- read_crosswalk("nj", cols = c("trct")) # Define your API Key (replace with your actual API Key) api_key <- "YOUR_API_KEY_HERE" if (api_key == "YOUR_API_KEY_HERE") { stop("Please create and enter your own API Key before continuing.") } # Define the base URL for ACS 5-Year Estimates BASE_URL <- "https://api.census.gov/data/2023/acs/acs5" # Define the variables to retrieve (median household income) VARIABLES <- "B19013_001E" # Define the geography (Tract-Level Data in New Jersey) STATE_FIPS <- "34" GEOGRAPHY <- paste0("for=tract:*&in=state:", STATE_FIPS) # Construct the API request URL url <- paste0(BASE_URL, "?get=", VARIABLES, "&", GEOGRAPHY, "&key=", api_key) # Make the request to the Census API response <- GET(url) # Parse the response into JSON data <- fromJSON(content(response, "text")) # Convert the response to a data frame acs <- as.data.frame(data[-1, ], stringsAsFactors = FALSE) colnames(acs) <- data[1, ] # Add a GEOID column for merging (State FIPS + County FIPS + Tract Code) acs$GEOID <- paste0(acs$state, acs$county, acs$tract) # Merge RAC data with Crosswalk rac_tract <- rac %>% left_join(crosswalk, by = c("h_geocode" = "tabblk2020")) %>% select(-h_geocode, -createdate) # Aggregate RAC data by tract rac_agg <- rac_tract %>% group_by(trct) %>% summarise(across(everything(), \(x) sum(x, na.rm = TRUE))) # Merge ACS data with RAC data rac_acs <- rac_agg %>% left_join(acs, by = c("trct" = "GEOID")) print(head(rac_acs))>> # A tibble: 6 × 46 >> trct C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 CNS03 CNS04 CNS05 >> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> >> 1 34001… 1065 235 546 284 210 443 412 0 0 8 43 25 >> 2 34001… 1123 246 520 357 197 461 465 1 0 5 26 24 >> 3 34001… 1514 347 775 392 282 699 533 0 0 4 43 37 >> 4 34001… 1147 218 563 366 195 457 495 2 0 10 37 30 >> 5 34001… 1080 254 576 250 232 481 367 2 0 0 23 19 >> 6 34001… 721 161 380 180 180 327 214 1 0 6 12 10 >> # ℹ 33 more variables: CNS06 <dbl>, CNS07 <dbl>, CNS08 <dbl>, CNS09 <dbl>, >> # CNS10 <dbl>, CNS11 <dbl>, CNS12 <dbl>, CNS13 <dbl>, CNS14 <dbl>, >> # CNS15 <dbl>, CNS16 <dbl>, CNS17 <dbl>, CNS18 <dbl>, CNS19 <dbl>, >> # CNS20 <dbl>, CR01 <dbl>, CR02 <dbl>, CR03 <dbl>, CR04 <dbl>, CR05 <dbl>, >> # CR07 <dbl>, CT01 <dbl>, CT02 <dbl>, CD01 <dbl>, CD02 <dbl>, CD03 <dbl>, >> # CD04 <dbl>, CS01 <dbl>, CS02 <dbl>, B19013_001E <chr>, state <chr>, >> # county <chr>, tract <chr>
However, the process of matching LODES and ACS data becomes much more complicated if the ACS year desired does not have the same geography as the current LODES crosswalk. This process would require using the TIGER/Line shapfiles.
1.4.2. TIGER/Line Shapefiles
The TIGER/Line Shapefiles are extracts of selected geographic and cartographic information published by the Census Bureau. The shapefiles include polygon boundaries of geographic areas and features, linear features including roads and hydrography, and point features. The TIGER/Line Shapefiles contain a standard geographic identifier (GEOID) for each entity that links to the GEOID in the data from censuses and surveys, but they do not contain demographic or other data themselves.Users of LODES data may be interested in TIGER/Line Shapefiles if they desire aggregations or tabulations at geographic levels that are not current.
Case Study: Aggregating LODES Data by Former Connecticut Counties
In 2022, the Census Bureau begain adopting Connecticut’s nine planning regions as county-equivalents. Thus, those are the tabulations present in the current crosswalk. If a user desired the previous counties as a tabulation area, those are not published in the LODES geographic crosswalk. However, they are available in historical TIGER/Line Shapefiles. Thus, a user can spatially merge LODES data with that shapefile (by finding the blocks whose internal points are inside each former Connecticut county) and aggregate at that level.
The code below finds the number of primary jobs whose workplace location is in each former county in Connecticut in 2020. This requires a spatial merge of the crosswalk with the TIGER/Line Shapefile and then a column-based merge to the WAC data.
Note
There are several situations in which a user may want to user TIGER/Line Shapefiles or other shapefiles to access historic or different geographies. The process is essentially the same for any geographic file (as long as the projections are appropriate and matching). However, it should be noted that census block boundaries can change over time, and this can make historical comparisons challenging.
import pandas as pd import geopandas as gpd # Load counties from TIGER shapefiles counties = gpd.read_file( "https://www2.census.gov/geo/tiger/TIGER2020/COUNTY/tl_2020_us_county.zip" ) # Filter counties to just those in CT ct_counties = counties[counties["STATEFP"] == "09"] # Read WAC data for CT ct_wac = read_lodes_data("wac", "ct", "S000", "JT01", 2020) # Read crosswalk data for CT ct_crosswalk = read_crosswalk("ct", cols=["blklatdd", "blklondd"]) # Convert crosswalk to geographic format ct_crosswalk = gpd.GeoDataFrame( ct_crosswalk, geometry=gpd.points_from_xy(ct_crosswalk["blklondd"], ct_crosswalk["blklatdd"]), crs="EPSG:4326", ) # Convert TIGER/Line data to same projection ct_counties = ct_counties.to_crs("EPSG:4326") # Spatial join: assign counties based on block internal points (points within polygons) blocks_with_counties = gpd.sjoin( ct_crosswalk, ct_counties, how="left", predicate="within" ) # Merge block-county with WAC on workplace block ID merged_wac = ct_wac.merge( blocks_with_counties, left_on="w_geocode", right_on="tabblk2020", how="left", validate="1:1", ) jobs_by_ct = ( merged_wac[["COUNTYFP", "NAME", "C000"]] .groupby(["COUNTYFP", "NAME"]) .sum() .reset_index() ) print(jobs_by_ct)>> COUNTYFP NAME C000 >> 0 001 Fairfield 351266 >> 1 003 Hartford 460016 >> 2 005 Litchfield 50648 >> 3 007 Middlesex 54315 >> 4 009 New Haven 330795 >> 5 011 New London 95092 >> 6 013 Tolland 34619 >> 7 015 Windham 30589
library(sf) library(dplyr) library(tigris) options(tigris_use_cache = TRUE) # Load CT counties from TIGER shapefiles ct_counties <- counties(state = "09", year = 2020, cb = FALSE) %>% st_transform(crs = 4326) # Read WAC data for CT ct_wac <- read_lodes_data("wac", "ct", "S000", "JT01", 2020) # Read crosswalk data for CT ct_crosswalk <- read_crosswalk("ct", cols = c("blklondd", "blklatdd")) # Convert to sf (point geometry) ct_crosswalk <- st_as_sf(ct_crosswalk, coords = c("blklondd", "blklatdd"), crs = 4326) # Spatial join: assign counties based on block internal points (points within polygons) blocks_with_counties <- st_join(ct_crosswalk, ct_counties, join = st_within, left = FALSE) # Merge block-county with WAC on workplace block ID merged_wac <- ct_wac %>% left_join(blocks_with_counties, by = c("w_geocode" = "tabblk2020")) jobs_by_ct <- merged_wac %>% group_by(COUNTYFP, NAME) %>% summarise(C000 = sum(C000, na.rm = TRUE), .groups = "drop") print(jobs_by_ct)>> # A tibble: 8 × 3 >> COUNTYFP NAME C000 >> <chr> <chr> <dbl> >> 1 001 Fairfield 351266 >> 2 003 Hartford 460016 >> 3 005 Litchfield 50648 >> 4 007 Middlesex 54315 >> 5 009 New Haven 330795 >> 6 011 New London 95092 >> 7 013 Tolland 34619 >> 8 015 Windham 30589