Importing US Census Bureau Data into R

The US Census Bureau collects a vast array of demographic and economic data on a continual basis. Much of that data at various scales is made available through tables that can be downloaded from their data.census.gov data portal.

The challenge with using that data is that it is made available to the public in formats that require processing so they can be used in R.

This tutorial describes how to download USCB data, import it into R, and process it create choropleths of demographic data from existing layers in ArcGIS Onlin, and how to create new layers by joining US Census Bureau tables with TIGER/Line shapefiles for state, county, and census tract boundary polygons.

data.census.gov

Acquire the Data

Download the Data Table

Data from a variety of different programs is available on data.census.gov, and data from each program is sliced and diced across a variety of different tables.

Three ACS tables containing useful groups of data include:

The video below shows how to download the DP03 (Selected Economic Characteristics) table with county-level data for Illinois from data.census.gov.

  1. From the data.census.gov home page, search for DP03. The default table shows values for the entire USA.
  2. Click Customize Table.
  3. Click Geos to select the type of geographic area. For this example, we will use County, within Illinois, and All counties in Illinois.
  4. Close to update the displayed table.
  5. Click Download, make sure the correct data set is selected and that the File Type is CSV.
  6. Select the appropriate Product if needed. For this example we will use the most recent five-year estimates so that data is available for all areas.
  7. Save the .zip file to your local computer.
  8. Create a working directory for your data and script.
  9. Open the .zip archive, copy the file with a name containing the word "data" in it, and paste it into your working directory.
  10. If you wish to inspect the data table and see what is in it, you can open it in a spreadsheet program.
Downloading table data from data.census.gov

Downloading the Polygons

The data table contains names and identifier codes for counties, but no latitude / longitude information on where those counties are located. To be able to map and spatially analyze the counties, you need to add geospatial location information to the table.

One way to do that with census data in R is to download a shapefile of county polygons from the US Census Bureau and join (merge) the table with that polygon data.

Polygon data is provided in the shapefile, which is a file format developed by the GIS company ESRI in the late 1990s. While this file is old and has many limitations (including a ten-character limit on field names), it can be opened by most GIS software and is still commonly used to distribute geospatial data on the web.

The term shapefile is actually a misnomer since a single shapefile consists of five or more separate files that contain different parts of the geospatial information.

  1. Go to the TIGER Cartographic Boundary Files download page.
  2. Download appropriate file from the US Census Bureau's The 20m resolution files are good enough for a web map while being fairly small for quick upload. The 5m file may be better for a printable map.
  3. Open the .zip archive, copy all of the shapefile files, and paste them in your working directory.
Importing a TIGER Shapefile of Polygons

Processing the Data

Load the Table

You can read the CSV file into an R data frame using the read.csv() function.

Start R or R Studio and Session -> Set Working Directory to the working directory you created for your data files above.

> data = read.csv("ACSDP5Y2019.DP03_data_with_overlays_2021-06-10T141325.csv", skip=1, as.is=T)

Select Columns

As downloaded, data from data.cenus.gov often contains many more fields than you need, and the names are often somewhat cryptic. To make your data more manageable and to make your R scripts easier to understand, you should remove unneeded columns and rename columns with names that are cryptic or too long to be used conveniently.

You can list the column names using the names() function. The DP03 table has 550 columns.

> names(data)

 [1] "id"                                                                                      
 [2] "Geographic.Area.Name"                                                                    
 [3] "Estimate..EMPLOYMENT.STATUS..Population.16.years.and.over"                               
 [4] "Margin.of.Error..EMPLOYMENT.STATUS..Population.16.years.and.over"                        
 [5] "Percent..EMPLOYMENT.STATUS..Population.16.years.and.over"                                
 [6] "Percent.Margin.of.Error..EMPLOYMENT.STATUS..Population.16.years.and.over"                
 [7] "Estimate..EMPLOYMENT.STATUS..Population.16.years.and.over..In.labor.force"               
 [8] "Margin.of.Error..EMPLOYMENT.STATUS..Population.16.years.and.over..In.labor.force"        
 [9] "Percent..EMPLOYMENT.STATUS..Population.16.years.and.over..In.labor.force"                
[10] "Percent.Margin.of.Error..EMPLOYMENT.STATUS..Population.16.years.and.over..In.labor.force"

(truncated)

Unless you have downloaded a small table, you will probably want to isolate specific columns and rename them.

If you need help searching through a long list of column names, you can use the grepl() function to find columns, and the which() function to give the column numbers:

> which(grepl("alone", names(data)))
[1] 75 76 77 78

> which(grepl("Median.household.income", names(data)))
[1] 247 248 249 250

For this example, we choose these variables:

Data frame rows and columns can be selected using square brackets: [row, column]

To select columns these columns, put the column numbers in a c() vector and use that vector as a column index in brackets. Note the comma before the variable name is important to indicate that we want all rows, but just the columns in the vector.

> colnumbers = c(1,3,4,77,78,247,248)

> data = data[,colnumbers]

> names(data)
[1] "id"
[3] "Estimate..EMPLOYMENT.STATUS..Population.16.years.and.over"
[4] "Margin.of.Error..EMPLOYMENT.STATUS..Population.16.years.and.over"
[5] "Percent..COMMUTING.TO.WORK..Workers.16.years.and.over..Car..truck..or.van....drove.alone"
[6] "Percent.Margin.of.Error..COMMUTING.TO.WORK..Workers.16.years.and.over..Car..truck..or.van....drove.alone"
[7] "Estimate..INCOME.AND.BENEFITS..IN.2019.INFLATION.ADJUSTED.DOLLARS...Total.households..Median.household.income..dollars."
[8] "Margin.of.Error..INCOME.AND.BENEFITS..IN.2019.INFLATION.ADJUSTED.DOLLARS...Total.households..Median.household.income..dollars."

To make it easier to work with your data values, you should rename the columns to names that are more manageable. You can put the names in a vector and assign that to the names() of the data frame.

Note that you should use underscores (_) or periods rather than spaces or dashes in your column names.

> colnames = c("ID",
"Working.Age.Population", "Working.Age.Population.MOE",
"Drive.To.Work.Percent", "Drive.To.Work.MOE", 
"Median.Household.Income", "Median.Household.Income.MOE")

>names(data) = colnames

>names(data)
[1] "ID"                          "Name"                       
[3] "Working.Age.Population"      "Working.Age.Population.MOE" 
[5] "Drive.To.Work.Percent"       "Drive.To.Work.MOE"          
[7] "Median.Household.Income"     "Median.Household.Income.MOE"

Joining the Table and Polygons

Thematic maps indicate what is where. Data downloaded from data.census.gov data is the what. In order to map that data, you need where polygons that represent the areas that can be colored with the what data.

A join is an operation where two data sets are connected to form a single data set based on common key values shared between the data sets.

Attribute Join

The shapefile of county polygons downloaded above can be loaded into R with the st_read() function from the sf (simple features) library.

> library(sf)

> polygons = st_read("cb_2020_us_county_5m.shp", stringsAsFactors = F, quiet = T)

You can then join the table to the polygons using the merge() function.

> polygons = merge(polygons, data, by.x="AFFGEOID", by.y="ID")

> plot(polygons["Median.Household.Income"])
Default plot() of the merged data

Exporting Data

If you want to save a copy of your processed data for later use, you can use the st_write() function to create a variety of different types of geospatial data files.

One simple, open file format is GeoJSON

st_write(polygons, "2019-illinois-acs.geojson")

Visualization

There are three types of variables that you will find in geospatial census data, and the type of data indicates what kind of visualization you can use.

Single-Color Choropleth

Amount variables are commonly mapped as choropleths, where areas are colored based on the values of the variable in the different areas.

The palette of different colors and how those colors are organized from low to high is referred to as a color ramp.

This is an example of a choropleth that uses different shades of green (darker to lighter) for the percent of people driving to work alone in each county.

> plot(polygons["Drive.To.Work.Percent"], 
	pal = colorRampPalette(c("darkgreen", "green1")),
	border="lightgray", breaks="kmeans", key.pos=1, main="")

The map clearly indicates that most Illinois workers drive to work alone, although there are notable clusters of folks who use other means to get to work in Cook County (Chicago) and Champaign County (the University of Illinois).

2019 percent of Illinois workers who drive to work alone (2021 USCB)

Diverging Color Choropleth

Choropleths can also use diverging color ramps based on two or three different hues that make it easier to see the difference between areas of low or high values.

A diverging color ramp can be created simply by providing three different colors to the colorRampPalette().

> plot(polygons["Median.Household.Income"], 
	pal = colorRampPalette(c("red3", "white", "blue3")),
	border="lightgray", breaks="jenks", key.pos=1, main="")

This map shows that the highest income areas of the state is the suburbs around Chicago, with smaller clusters around Peoria, Springfield (the capitol), and the St. Louis suburbs.

2019 median household income in Illinois counties in dollars (2021 USCB)

Bubble Map Formatting

It is generally considered to be poor cartographic practice to map counts like population as choropleths. We tend to associate larger map objects with larger numeric values and when smaller areas have larger values (such as densely populated urban areas), it is easy to misinterpret a choropleth of counts.

An alternative to a choropleth is a graduated symbol map where symbols are overlaid on top of areas and sized based on a variable. A common symbol is a circle, giving rise to the name bubble map.

centroids = st_centroid(polygons)

categories = quantile(polygons$Working.Age.Population, na.rm=T)

category = cut(polygons$Working.Age.Population, categories)

sizes = c(0.5, 1, 1.5, 2)

size = sizes[category]

plot(st_geometry(polygons), border="lightgray")

plot(st_geometry(centroids), add=T, pch=16, cex=size, col="navy")

legend(x = "topright", pch=16, pt.cex=sizes, legend = round(categories)[1:4], col="navy")
2019 working-age population Illinois counties (2021 USCB)

Rank-order graph

If you want to see how a particular value ranks relative to the other values in the distribution, a rank-order graph can be useful. A rank-order graph shows bars for each value from low to high.

data = data[order(data$Median.Household.Income),]

color = ifelse(grepl("Champaign", data$Name), "red3", "navy")

barplot(data$Median.Household.Income, col=color,
	border="gray", fg="white", las=1,
	xaxs='i', yaxs='i')

grid(nx=NA, ny=NULL, col="#00000080", lty="solid")

abline(a=0, b=0, lwd=3)
2019 Champaign County median household income relative to other Illinois counties (2021 USCB)