Importing US Census Bureau Data into R
The US Census Bureau collects a vast array of demographic and economic data on a continual basis. Much of that data at various scales is made available through tables that can be downloaded from their data.census.gov data portal.
The challenge with using that data is that it is made available to the public in formats that require processing so they can be used in R.
This tutorial describes how to download USCB data, import it into R, and process it create choropleths of demographic data from existing layers in ArcGIS Onlin, and how to create new layers by joining US Census Bureau tables with TIGER/Line shapefiles for state, county, and census tract boundary polygons.
Acquire the Data
Download the Data Table
Data from a variety of different programs is available on data.census.gov, and data from each program is sliced and diced across a variety of different tables.
Three ACS tables containing useful groups of data include:
- DP02 Selected Social Characteristics in the United States
- DP03 Selected Economic Characteristics in the United States
- DP04 Selected Housing Characteristics
The video below shows how to download the DP03 (Selected Economic Characteristics) table with county-level data for Illinois from data.census.gov.
- From the data.census.gov home page, search for DP03. The default table shows values for the entire USA.
- Click Customize Table.
- Click Geos to select the type of geographic area. For this example, we will use County, within Illinois, and All counties in Illinois.
- Close to update the displayed table.
- Click Download, make sure the correct data set is selected and that the File Type is CSV.
- Select the appropriate Product if needed. For this example we will use the most recent five-year estimates so that data is available for all areas.
- Save the .zip file to your local computer.
- Create a working directory for your data and script.
- Open the .zip archive, copy the file with a name containing the word "data" in it, and paste it into your working directory.
- If you wish to inspect the data table and see what is in it, you can open it in a spreadsheet program.
Downloading the Polygons
The data table contains names and identifier codes for counties, but no latitude / longitude information on where those counties are located. To be able to map and spatially analyze the counties, you need to add geospatial location information to the table.
One way to do that with census data in R is to download a shapefile of county polygons from the US Census Bureau and join (merge) the table with that polygon data.
Polygon data is provided in the shapefile, which is a file format developed by the GIS company ESRI in the late 1990s. While this file is old and has many limitations (including a ten-character limit on field names), it can be opened by most GIS software and is still commonly used to distribute geospatial data on the web.
The term shapefile is actually a misnomer since a single shapefile consists of five or more separate files that contain different parts of the geospatial information.
- Go to the TIGER Cartographic Boundary Files download page.
- Download appropriate file from the US Census Bureau's The 20m resolution files are good enough for a web map while being fairly small for quick upload. The 5m file may be better for a printable map.
- Open the .zip archive, copy all of the shapefile files, and paste them in your working directory.
Processing the Data
Load the Table
You can read the CSV file into an R data frame using the read.csv() function.
Start R or R Studio and Session -> Set Working Directory to the working directory you created for your data files above.
> data = read.csv("ACSDP5Y2019.DP03_data_with_overlays_2021-06-10T141325.csv", skip=1, as.is=T)
- The first parameter is a string with the name of the .csv table file. This name will vary by downloads and will depend on the name of the file you extracted from the .zip archive.
- The skip=1 parameter is used to skip the top row of cryptic field codes.
- The as.is=T parameter brings strings in as text rather than factors, which can be hard to use with this type of data.
As downloaded, data from data.cenus.gov often contains many more fields than you need, and the names are often somewhat cryptic. To make your data more manageable and to make your R scripts easier to understand, you should remove unneeded columns and rename columns with names that are cryptic or too long to be used conveniently.
You can list the column names using the names() function. The DP03 table has 550 columns.
> names(data)  "id"  "Geographic.Area.Name"  "Estimate..EMPLOYMENT.STATUS..Population.16.years.and.over"  "Margin.of.Error..EMPLOYMENT.STATUS..Population.16.years.and.over"  "Percent..EMPLOYMENT.STATUS..Population.16.years.and.over"  "Percent.Margin.of.Error..EMPLOYMENT.STATUS..Population.16.years.and.over"  "Estimate..EMPLOYMENT.STATUS..Population.16.years.and.over..In.labor.force"  "Margin.of.Error..EMPLOYMENT.STATUS..Population.16.years.and.over..In.labor.force"  "Percent..EMPLOYMENT.STATUS..Population.16.years.and.over..In.labor.force"  "Percent.Margin.of.Error..EMPLOYMENT.STATUS..Population.16.years.and.over..In.labor.force" (truncated)
Unless you have downloaded a small table, you will probably want to isolate specific columns and rename them.
If you need help searching through a long list of column names, you can use the grepl() function to find columns, and the which() function to give the column numbers:
> which(grepl("alone", names(data)))  75 76 77 78 > which(grepl("Median.household.income", names(data)))  247 248 249 250
For this example, we choose these variables:
-  "id" (this column must be included to use in the join below)
-  "Geographic.Area.Name"
-  "Estimate..EMPLOYMENT.STATUS..Population.16.years.and.over"
-  "Margin.of.Error..EMPLOYMENT.STATUS..Population.16.years.and.over"
-  "Percent..COMMUTING.TO.WORK..Workers.16.years.and.over..Car..truck..or.van....drove.alone"
-  "Percent.Margin.of.Error..COMMUTING.TO.WORK..Workers.16.years.and.over..Car..truck..or.van....drove.alone"
-  "Estimate..INCOME.AND.BENEFITS..IN.2019.INFLATION.ADJUSTED.DOLLARS...Total.households..Median.household.income..dollars."
-  "Margin.of.Error..INCOME.AND.BENEFITS..IN.2019.INFLATION.ADJUSTED.DOLLARS...Total.households..Median.household.income..dollars."
Data frame rows and columns can be selected using square brackets: [row, column]
To select columns these columns, put the column numbers in a c() vector and use that vector as a column index in brackets. Note the comma before the variable name is important to indicate that we want all rows, but just the columns in the vector.
> colnumbers = c(1,3,4,77,78,247,248) > data = data[,colnumbers] > names(data)  "id"  "Estimate..EMPLOYMENT.STATUS..Population.16.years.and.over"  "Margin.of.Error..EMPLOYMENT.STATUS..Population.16.years.and.over"  "Percent..COMMUTING.TO.WORK..Workers.16.years.and.over..Car..truck..or.van....drove.alone"  "Percent.Margin.of.Error..COMMUTING.TO.WORK..Workers.16.years.and.over..Car..truck..or.van....drove.alone"  "Estimate..INCOME.AND.BENEFITS..IN.2019.INFLATION.ADJUSTED.DOLLARS...Total.households..Median.household.income..dollars."  "Margin.of.Error..INCOME.AND.BENEFITS..IN.2019.INFLATION.ADJUSTED.DOLLARS...Total.households..Median.household.income..dollars."
To make it easier to work with your data values, you should rename the columns to names that are more manageable. You can put the names in a vector and assign that to the names() of the data frame.
Note that you should use underscores (_) or periods rather than spaces or dashes in your column names.
> colnames = c("ID", "Working.Age.Population", "Working.Age.Population.MOE", "Drive.To.Work.Percent", "Drive.To.Work.MOE", "Median.Household.Income", "Median.Household.Income.MOE") >names(data) = colnames >names(data)  "ID" "Name"  "Working.Age.Population" "Working.Age.Population.MOE"  "Drive.To.Work.Percent" "Drive.To.Work.MOE"  "Median.Household.Income" "Median.Household.Income.MOE"
Joining the Table and Polygons
Thematic maps indicate what is where. Data downloaded from data.census.gov data is the what. In order to map that data, you need where polygons that represent the areas that can be colored with the what data.
A join is an operation where two data sets are connected to form a single data set based on common key values shared between the data sets.
The shapefile of county polygons downloaded above can be loaded into R with the st_read() function from the sf (simple features) library.
> library(sf) > polygons = st_read("cb_2020_us_county_5m.shp", stringsAsFactors = F, quiet = T)
You can then join the table to the polygons using the merge() function.
> polygons = merge(polygons, data, by.x="AFFGEOID", by.y="ID") > plot(polygons["Median.Household.Income"])
- The by.x parameter indicates the name of the field from the polygon file that should be used as a key. An AFFGEOID is a type of code used by the US Census Bureau to uniquely identify locations. This name comes from the American FactFinder, which is what data.census.gov used to be called.
- The by.y parameter indicates the name of the field from the data table that should be used as the key for the join.
If you want to save a copy of your processed data for later use, you can use the st_write() function to create a variety of different types of geospatial data files.
One simple, open file format is GeoJSON
There are three types of variables that you will find in geospatial census data, and the type of data indicates what kind of visualization you can use.
- Nominal data: Text identifiers that can be used as labels
- Location names: These can be used as labels, such as the county names in this example.
- ID codes: These are usually only used for data manipulation, such as the AFFGEOID code used in the join.
- Count data: Numeric values that represent a number of things, In this example, working age population is a count variable.
- Amount data: Values that are measured or calculated
- Central tendency: Values that indicate a middle value (like an average or median) that can be used to compare different areas. In this example, median household income is a central tendency amount variable.
- Percentages: Values that indicate a ratio of the population or households that has a particular characteristic. In this example, percent of workers who drive to work alone is expressed as a percentage.
Amount variables are commonly mapped as choropleths, where areas are colored based on the values of the variable in the different areas.
The palette of different colors and how those colors are organized from low to high is referred to as a color ramp.
This is an example of a choropleth that uses different shades of green (darker to lighter) for the percent of people driving to work alone in each county.
> plot(polygons["Drive.To.Work.Percent"], pal = colorRampPalette(c("darkgreen", "green1")), border="lightgray", breaks="kmeans", key.pos=1, main="")
- The shorthand to indicate which variable in a data frame to map is to include the quoted name in square brackets after the data frame object name.
- The pal (palette) parameter uses the colorRampPalette() function, which takes a vector of colors and stretches a color ramp between them. Colors can be specified using RGB hex codes or named colors. A list of R color names is available on Tian Zheng's website.
- The border paramter specifies the color used for polygon borders. Shades of gray give a less harsh look than the default of black.
- breaks="kmeans" uses a k-means clustering algorithm for assigning colors to values. This algorithm keeps clusters of values together so that the colors have a better chance of representing groups of similar characteristics.
- key.pos=1 puts the legend at the bottom.
- main="" turns off the title, since a chart like this when you plan on putting a descriptive caption under the map when you use it in a report or notebook.
The map clearly indicates that most Illinois workers drive to work alone, although there are notable clusters of folks who use other means to get to work in Cook County (Chicago) and Champaign County (the University of Illinois).
Diverging Color Choropleth
Choropleths can also use diverging color ramps based on two or three different hues that make it easier to see the difference between areas of low or high values.
A diverging color ramp can be created simply by providing three different colors to the colorRampPalette().
> plot(polygons["Median.Household.Income"], pal = colorRampPalette(c("red3", "white", "blue3")), border="lightgray", breaks="jenks", key.pos=1, main="")
This map shows that the highest income areas of the state is the suburbs around Chicago, with smaller clusters around Peoria, Springfield (the capitol), and the St. Louis suburbs.
Bubble Map Formatting
It is generally considered to be poor cartographic practice to map counts like population as choropleths. We tend to associate larger map objects with larger numeric values and when smaller areas have larger values (such as densely populated urban areas), it is easy to misinterpret a choropleth of counts.
An alternative to a choropleth is a graduated symbol map where symbols are overlaid on top of areas and sized based on a variable. A common symbol is a circle, giving rise to the name bubble map.
centroids = st_centroid(polygons) categories = quantile(polygons$Working.Age.Population, na.rm=T) category = cut(polygons$Working.Age.Population, categories) sizes = c(0.5, 1, 1.5, 2) size = sizes[category] plot(st_geometry(polygons), border="lightgray") plot(st_geometry(centroids), add=T, pch=16, cex=size, col="navy") legend(x = "topright", pch=16, pt.cex=sizes, legend = round(categories)[1:4], col="navy")
- The bubbles are drawn at the centers of polygons, which are created using the st_centroid() function.
- The different ranges of values for the four different sizes of bubbles are calculated from the variable using the quantile() function, which breaks the range of values into four even groups.
- The cut() function classifies each variable value into one of the categories.
- The sizes object is a list of four different bubble sizes. These can be adjusted if you need larger or smaller bubbles.
- The size object is used to keep track of the size of each bubble.
- The first plot() plots the st_geometry() outline of the county polygons, so we have context for knowing what area each bubble represents.
- The second plot() plots the bubbles. The add=T parameter indicates that the bubbles are drawn on top of the existing plot of county outlines.
- The legend() draws a legend indicating the values that the different bubble sizes represent.
If you want to see how a particular value ranks relative to the other values in the distribution, a rank-order graph can be useful. A rank-order graph shows bars for each value from low to high.
data = data[order(data$Median.Household.Income),] color = ifelse(grepl("Champaign", data$Name), "red3", "navy") barplot(data$Median.Household.Income, col=color, border="gray", fg="white", las=1, xaxs='i', yaxs='i') grid(nx=NA, ny=NULL, col="#00000080", lty="solid") abline(a=0, b=0, lwd=3)