Introduction to Geospatial Data in R
This tutorial will give a brief introduction on how to perform simple import and visualization of geospatial area data in R.
Installing and Using R
R is an open-source computer language and software environment for statistical analysis. R was first developed in 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. It based on the S statistical computing language that was originally developed at the legendary Bell Labs in 1976. R has become a popular package for data science and financial analysis.
You may download R for PC or Mac on the Comprehensive R Archive Network (CRAN) home page if you wish to use it on your own machine.
If you are a member of the University of Illinois community, R is available on most open lab computers and on computers you access remotely via UIUC AnyWare and can be found in the Start. Use the 64-bit version.
The R user interface displays three different types of screens:
- The console is where you can type and run R commands.
- Graph display windows will pop up when you plot() in R.
- The script editor is a simple text editor that you can use to prepare script files containing R commands. Scripts are helpful for preserving long sequences of complex commands that you can debug, run multiple times, or submit as classroom assignments.
Because you interact with R almost exclusively with typed commands, you need to be familiar with some basic language elements in order to use R. While this is an unfamiliar way of working for many users and involves a learning curve, it is also a very fast and precise way of working with software once you gain a modest level of mastery of the language.
Examples in this Tutorial
In the examples given in this tutorial, commands you type are displayed preceded by a > symbol that is a prompt from the program.
After you type a command you press the Enter or Return key to execute the command.
The examples show what R will display in response to your commands.
To get the full benefit of this tutorial, it is recommended that you actually type commands into R as you skim through the examples.
At it's simplest, you can use R as a calculator and it will display the value.
> 2+2  4
R expressions are similar to formulas in Excel and use the same mathematical symbols or operators: + - * /. The carat operator (^) is used for exponents.
> 2+2  4 > 10 + 2  12 > 10 - 2  8 > 10 * 2  20 > 10 / 2  5 > 10^2  100
To make it possible to use the values of calculations in subsequent formulas, you can assign values to named objects. Objects are also sometimes referred to as variables. These are symbolic names you can use in later formulas to save the effort of repeating calculations, or simply to make formulas easier to read.
To display the contents of a object at the console, you simply type the name of the object.
> x = 10 * 2 > x  20 > x + 15  35
R has a "<-" operator that is also used to assign values to objects and you should know about this in case you look at R examples shared on the web. This operator give R a distinctive look and has additional capabilites, but also makes the code more difficult for casual R-users to read, so this tutorial just uses the equals sign (=) common to most other computer languages.
> y <- 12 + 16 > y  28 > y = 12 + 16 > y  28
Object names must start with a letter and are case sensitive.
> hello = 12 > Hello = 15 > hello  12 > Hello  15
You should always try to make your object names meaningful so that you and other people can understand what your objects are. Rather than calling a object containing a standard deviation "s" you might call it "stdev". The extra time spent typing now will save you confusion later.
Object names cannot contain spaces. However there are techniques for representing multi-word object names that get around this issue:
- word_word (underscore)
- word.word (period)
- wordWord (camelback capitalization)
One of the most powerful features of R is that objects can contain many different types of data. Objects can also contain text. Segments of text are called strings of characters. You assign text by enclosing your text in quotation marks.
> x = "Hello" > Hello = 12 > x  "Hello" > Hello  12
In statistical calculations, we commonly deal with multiple numbers at the same time. One of the most powerful features of R is that it permits objects to contain multiple numbers at the same time. These collections of numbers are called vectors.
Vectors can be created by enclosing multiple numbers in the c() function (c is for combine).
> x = c(1,3,5,7,10) > x  1 3 5 7 10
Note that objects by default are vectors in R. If you assign a single value to an object, it is a vector of one element, which is why display of a single-valued object is preceded by a "" indicating that it is the first (and only) value in the vector:
> x=20 > x  20
You can then perform multiple operations of the same kind on a vector with a single expression.
> x  1 3 5 7 10 > x + 2  3 5 7 9 12 > x * 10  10 30 50 70 100 > x - 5  -4 -2 0 2 5 > x / 2  0.5 1.5 2.5 3.5 5.0 > x ^ 2  1 9 25 49 100
Functions in R are similar to functions in Excel. In an expression, you call a function with function name, an open parenthesis, a set of zero or more parameters separated by commas, and then a closing parenthesis. The function then returns and object based on the parameters.
name(parameter1, parameter2, ...)
R has dozens of functions available with a default installation, and hundreds more that can be brought in from libraries that permit many different types of mathematical operations to be performed on many different types of data.
The basic descriptive statistical functions are similar to those in Excel.
> x = c(2, 5, 13, 24, 35, 40, 35, 24, 13, 5,2) > x  2 5 13 24 35 40 35 24 13 5 > max(x)  40 > min(x)  2 > mean(x)  18 > median(x)  13 > sd(x) # standard deviation  14.26184
R has functions for most common statistical operations, so analysis of data in R often simply involves finding the proper functions to use and determining what parameters those functions need.
The plot() function allows you to create visualizations of objects. By default, when passed a single vector, the plot() function draws a point graph:
> x = c(2, 5, 13, 24, 35, 40, 35, 24, 13, 5,2) > plot(x)
You can change that to a line graph by adding the parameter type="l":
> x = c(2, 5, 13, 24, 35, 40, 35, 24, 13, 5,2) > plot(x, type="l")
Geospatial Data in R
The Simple Features Package
Geospatial data objects can be loaded from files and manipulated with functions from the Simple Features (sf) library. You may also see examples on the web using the sp package, although that package is old and hard to work with, and you should use sf for new code.
The first time you use sf on a new machine, you may need to install it first from inside R:
Then, you load it with the include() function. This should also be at the beginning of every script you create in R that uses the sf package:
Loading Geospatial Data
The st_read() function will read geospatial data from most established geospatial data file formats.
For this example, we will use a GeoJSON file (2020-electoral-counties.geojson) of county-level election data that you can download here.
> counties = st_read("2020-electoral-counties.geojson", stringsAsFactors=F) Reading layer `OGRGeoJSON' from data source `/home/www/michaelminn.net/tutorials/data/2020-electoral-counties.geojson' using driver `GeoJSON' Simple feature collection with 3139 features and 48 fields geometry type: MULTIPOLYGON dimension: XY bbox: xmin: -178.9921 ymin: 18.91747 xmax: -66.9499 ymax: 71.35256 epsg (SRID): 4269 proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
You can display the available geospatial attributes in the object using the names() command:
> names(counties)  "FIPS" "Name"  "ST" "Latitude"  "Longitude" "Land.Square.Miles"  "Water.Square.Miles" "Votes.Dem.2020"  "Votes.GOP.2020" "Votes.Total.2020"  "Votes.Dem.2016" "Votes.GOP.2016"  "Votes.Total.2016" "Votes.Dem.2012"  "Votes.GOP.2012" "Votes.Total.2012"  "Senator1.Name" "Senator1.Party"  "Senator1.First.Elect" "Senator1.Last.Elect"  "Senator1.Next.Elect" "Senator1.Votes"  "Senator1.Total.Votes" "Senator2.Name"  "Senator2.Party" "Senator2.First.Elect"  "Senator2.Last.Elect" "Senator2.Next.Elect"  "Senator2.Votes" "Senator2.Total.Votes"  "Winner.2012" "Percent.Dem.2012"  "Percent.GOP.2012" "Winner.2016"  "Percent.Dem.2016" "Percent.GOP.2016"  "Winner.2020" "Percent.Dem.2020"  "Percent.GOP.2020" "Median.HH.Income"  "Median.Age" "Percent.Under.18"  "Population.2019" "Voting.Age.Population"  "Percent.Bachelors.Degree" "Population.2010"  "Percent.Rural" "Population.Density"  "geometry"
Plotting Categorical Attributes
Choropleth maps can be created from sf polygon objects with the plot() command, although you will probably need to add additional parameters to get the kind of map you are looking for.
If you simply plot() the counties, the plot() function will create tiny, unreadable plots of as many of the object's fields as it can fit on a plot:
Therefore, you must select a specific field to plot() using the text name of the field to map enclosed in quotation marks (because it is text) and square brackets. In this example we use the Winner.2012 with the name of the 2012 presidential candidate that won each county. The key.pos=1 parameter puts the legend at the bottom rather than the side:
> plot(counties["Winner.2012"], key.pos=1)
To create a map of the counties using the standard red/blue media colors,
- We can pass a vector of color names to the pal (palette) parameter.
- We also pass NA (not available) for the border parameter so the black borders do not obscure the colors of small counties.
> plot(counties["Winner.2012"], key.pos=1, pal=c("navy", "red"), border=NA)
If you want a more-traditional box legend, you can use the legend() function:
- key.pos=NULL on the plot() command to turn off the default bar legend.
- x = "bottomleft" places the legend at the bottom left corner of the plot.
- legend = c("Obama", "Romney") gives the names listed in the legend.
- fill = c("navy", "red") gives the colors that will fill the legend color boxes.
> plot(counties["Winner.2012"], key.pos=NULL, pal=c("navy", "red"), border=NA) > legend(x = "bottomleft", legend=c("Obama", "Romney"), fill=c("navy", "red"))
Plotting Quantitative Attributes
The plot() function can be used to plot quantitative attributes, although, again, you will probably need additional parameters to get what you want.
The default plot uses a blue-purple-yellow color ramp:
> plot(counties["Percent.Dem.2012"], key.pos=1, border=NA)
You can specify a different color ramp by passing a vector of the desired colors to the colorRampPalette() function. This example uses a diverging color scheme with a color ramp that extends from red to white to navy:
> plot(counties["Percent.Dem.2012"], key.pos=1, border=NA, pal=colorRampPalette(c("red", "white", "navy")))
There will likely be situations where you only want to use a specific subset of a geospatial data set.
sf objects are R data frame objects that can be subsetted in the same way as other data frames.
Parts of a data frame can be defined using conditions specified inside square brackets () following the name of the data frame.
As shown above, when using only one value in the square brackets, you select specific columns (fields).
To select specific rows (features), you need two values inside the brackets, separated by a comma. The first value specifies which rows (features) to select, and the second value specifies which columns to select. If you want to select all columns, you can leave the value after the column blank.
In the example below, the first value in the square brackets specifies to only select features where the ST (state abbreviation) attribute is IL (Illinois). The dollar sign (counties$ST) indicates that the attribute should be drawn from the counties object.
Since we want to select all columns, we just put a comma and no value before the closing square bracket.
We assign this subsetted object to a new name that we can then plot.
> illinois = counties[counties$ST == "IL", ] > plot(illinois["Percent.Dem.2012"], key.pos=1, border=NA, pal=colorRampPalette(c("red", "white", "navy")))
By default, the plot() function plots the geospatial data using an equirectangular projection that may be undesirable depending on what part of the world you are mapping and what you are using the map for.
The st_transform() function can be used to reproject a geospatial object to a new projection. The crs parameter will accept an EPSG number or a PROJ.4 string that describes the desired projection. For this map of the US, we use a PROJ.4 string for a Lambert conformal conic projection centered on the continental US.
> lambert = "+proj=lcc +lon_0=-98 +lat_1=33 +lat_2=45" > counties = st_transform(counties, crs=lambert) > plot(counties["Percent.Dem.2012"], key.pos=1, border=NA, pal=colorRampPalette(c("red", "white", "navy")))
R is software for statistical computing, so you can perform statistical operations on geospatial data. While there are specialized functions for dealing with autocorrelation in spatial data, simple non-spatial functions can be used for data exploration.
For example, there is a field for median household income (Median.HH.Income) in the data. We might stereotypically hypothesize that counties with lower income might disproportionately vote for the GOP candidate. To examine that, we can plot an X/Y scatter chart between the two attributes to see if the plot shows a correlation.
For this call to plot(), we pass the two attributes. We use the dollar sign ($) to isolate individual attributes as vectors of values.
> plot(counties$Median.HH.Income, counties$Percent.Dem.2012)
What we see, at least for this election is no obvious correlation between the two attributes.
We can use the lm() (linear model) function to create a simple linear model and see how it fits. The tilde (~) in this function says to model one attribute in terms of the other.
R2 is the coefficient of determination that indicates the strength of correlation. The values range from zero to one, and higher values indicate stronger correlation.
The R2 value of 0.00838 indicates no correlation, thus failing to corroborate our hypothesis that poorer counties tend to be more Republican. As we can see on the X/Y scatter plot, there are poor counties that vote Democratic and rich counties that vote Republican.
> model = lm(Percent.Dem.2012 ~ Median.HH.Income, data=counties) > summary(model) Call: lm(formula = Percent.Dem.2012 ~ Median.HH.Income, data = counties) Residuals: Min 1Q Median 3Q Max -34.942 -10.391 -1.322 9.048 54.595 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.330e+01 1.022e+00 32.586 < 2e-16 *** Median.HH.Income 9.678e-05 1.850e-05 5.233 1.78e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.63 on 3121 degrees of freedom (16 observations deleted due to missingness) Multiple R-squared: 0.008697, Adjusted R-squared: 0.00838 F-statistic: 27.38 on 1 and 3121 DF, p-value: 1.78e-07
Plotting to PNG Files
If you want to export a plot() from R to a PNG file that you can include in a report, you can place a png() function call before the plot() and a dev.off() function call after the plot.
The width= and height= parameters determine the size of the output image in pixels.
> png("illinois.png", width=600, height=400) > plot(illinois["Percent.Dem.2012"], key.pos=1, border=NA, pal=colorRampPalette(c("red", "white", "navy"))) > dev.off()
Another powerful advantage of a command-line console is that you can type commands into script files and run them. This makes it possible to easily repeat complex sequences of operations. And if you find an error in your script, you can fix it and rerun the script without the labor of having to repeat long sequences of button clicks that you would need with a graphical user interface like ArcMap.
R on Windows contains a simple script editor. To create a script on the PC, select File -> New Script.
On the Mac you will need a plain text editor like TextWrangler.
This video demonstrates the creation of a script using snippets from the tutorial above:
Comments in scripts are lines that the program ignores. These lines are used for documenting the authorship of scripts and for adding comments that explain what is going on when you have complex sequences of expressions and function calls.
Comments start with a pound sign (#) and tell the R interpreter to ignore everything that follows on that line.
# Name of script (date) # This is a comment that explains what the line after it does x = 2 + 2
From within the R console, you can get documentation for most functions by typing a question mark followed by the name of the function. Type:
If you have only a vague idea of what you want, you can type a double question mark and a keyword. Type:
If you have any more complex questions about the language syntax or a cryptic error message issued by the program, there is a vast amount of R information on a variety of Internet websites. The Google is your tech support for R.