Representing Data in R

Data Types

Computers represent everything internally as numbers, but R, like most programming languages, provides a variety of fundamental data types to make the representation of data that a user sees closer to the way that data is used to represent reality.

Integers

Integers are whole numbers no decimal points. They are commonly used for counts (such as population) and for indexing elements in data structures (explained below).

> x = 1 + 1
> x
[1] 2

Real Numbers (Floating Point)

Real numbers are numbers that can have decimal points. R uses the type named double for real numbers. Real numbers are used for everything that can be represented quantitatively.

> x = 1.1 + 1.0
> x
[1] 2.1
> typeof(x)
[1] "double"

Character Strings

Character data is text data with letters and words. Character data in other languages are often called strings for strings of characters. Character data needs to be enclosed in double or single quotes to make it clear to the software where the text begins and ends.

> x = "Hello, world!"
> x
[1] "Hello, world!"
> typeof(x)
[1] "character"

Logical Values

Logical values can be either TRUE or FALSE. TRUE and FALSE can be abbreviated with the letters T or F.

Logical values are commonly created as the result of comparisons, and used to select items from structures or make decisions about program flow.

> 5 > 7
[1] FALSE
> 5 < 7
[1] TRUE

Factors

Factor data is a numeric representation of text data. Factors are commonly used internally by R for categorical data to make such data easier to process and compare.

> x = as.factor(c("Alpha", "Beta", "Gamma", "Delta", "Alpha"))
> x
[1] Alpha Beta  Gamma Delta Alpha
Levels: Alpha Beta Delta Gamma
> as.integer(x)
[1] 1 2 4 3 1
> typeof(x)
[1] "integer"

Dates

Dates are also represented internally by R as numbers. While dates can be represented as character data, if you want to be able to sort or order by date, use of the date data type is helpful. Date and time handling in R is a bit messy (because the way humans write dates and times are inconsistent), and you should consult this description from Berkeley for more information.

> x = as.Date("2017-01-01")
> x
[1] "2017-01-01"
> as.integer(x)
[1] 17167
> typeof(x)
[1] "double"
> class(x)
[1] "Date"

Data Structures

R is a program for statistical analysis, and most statistical data exists in collections like tables or matrices. R uses a variety of data structures to keep collections of data together in a useful manner.

Name Dimensions Data Types
Vector 1 1
Matrix 2 1
Array 1+ 1
List 1 Mixed
Data Frame 1+ 1+
Time Series 2 2+

Vectors

Vectors are one-dimensional collections of data of the same data type. Values that are not in other types of data structures are probably vectors. Variables with one value are vectors of length one.

Vector

Creation

Vectors can be created by enclosing individual values separated by commas as parameters to the c() combine function:

> x = c(1, 3, 5, 6, 7)
> x
[1] 1 3 5 6 7
> y = c('Joe', 'Frank', 'Tom')
> y
[1] "Joe"   "Frank" "Tom"  

Visualization

Numeric vectors can be visualized as line graphs with the plot() function:

> x = c(1, 3, 5, 6, 7, 8, 8, 8)
> plot(x, type='b')
Sample Plot of a Numeric Vector

Character vectors of categorical data can be summarized into tables with the table() function, and those tables can be visualized using the barplot() function.

> states = c("LA", "MS", "AL", "NY", "OH", "VA", "CA", "CO")
> obesity = c("High", "High", "High", "Med", "Med", "Med", "Low", "Low")
> names(obesity) = states
> obesity
    LA     MS     AL     NY     OH     VA     CA     CO 
"High" "High" "High"  "Med"  "Med"  "Med"  "Low"  "Low" 
> table(obesity)
obesity
High  Low  Med 
   3    2    3 
> barplot(table(obesity))
Sample Plot of a Categorical Vector

Access and Subsetting

The length of a vector is returned by the length() function.

Individual elements or groups of elements can be accessed with numeric indexes inside square brackets, with index 1 being the first element in the vector. Indexes can also be sequences or vectors.

> x = c('Vanilla', 'Chocolate', 'Strawberry', 'Avocado', 'Anchovy')
> length(x)
[1] 5
> x[1]
[1] "Vanilla"
> x[3:5]
[1] "Strawberry" "Avocado"    "Anchovy"   
> x[3:length(x)]
[1] "Strawberry" "Avocado"    "Anchovy"   
> x[c(1,3,4)]
[1] "Vanilla"    "Strawberry" "Avocado"

The elements of a vector can be named using the names() function, and the elements accessed with either those names or numeric indices.

> x = c('Vanilla', 'Chocolate', 'Strawberry', 'Avocado', 'Anchovy')
> x[1]
[1] "Vanilla"
> x[3:5]
[1] "Strawberry" "Avocado"    "Anchovy"   
> x[c(1,3,4)]
[1] "Vanilla"    "Strawberry" "Avocado"   
> names(x) = c('Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon')
> x
       Alpha         Beta        Gamma        Delta      Epsilon 
   "Vanilla"  "Chocolate" "Strawberry"    "Avocado"    "Anchovy" 
> x['Beta']
       Beta 
"Chocolate" 
> x[2]
       Beta 
"Chocolate" 

Appending and Deleting

Elements can be appended to the beginning or ending of a vector by including the vector within another c() function. Elements can be deleted by setting the vector to a subset.

> x = c('Joe', 'Frank', 'Tom')
> x
[1] "Joe"   "Frank" "Tom"  
> x = c('Mary', x)
> x
[1] "Mary"  "Joe"   "Frank" "Tom"  
> x = c(x, 'Hailey')
> x
[1] "Mary"   "Joe"    "Frank"  "Tom"    "Hailey"
> x = x[1:3]
> x
[1] "Mary"  "Joe"   "Frank"

Operations

One of the strengths of R is the ability to perform operations all elements of a vector (or other data structure) with single statements:

> x = c(1, 3, 5, 6, 7)
> y = c(100, 200, 300, 400, 500)
> x + y
[1] 101 203 305 406 507
> x * y
[1]  100  600 1500 2400 3500
> x + 1
[1] 2 4 6 7 8
> sum(x)
[1] 22

Data Frames

A data frame is a two-dimensional data structure arranged into rows and columns like a spreadsheet. The structure can contain multiple data types, but all elements in a single column must be the same data type.

Data Frame

Creation

Data frames can be created directly from vectors using the data.frame() function.

> names = c('LA', 'MS', 'AL', 'NY', 'OH', 'VA', 'CA', 'CO')
> obesity = c('High', 'High', 'High', 'Med', 'Med', 'Med', 'Low', 'Low')
> density = c(106, 64, 95, 416, 283, 207, 244, 50)
> states = data.frame(names, obesity, density)
> states
  names obesity density
1    LA    High     106
2    MS    High      64
3    AL    High      95
4    NY     Med     416
5    OH     Med     283
6    VA     Med     207
7    CA     Low     244
8    CO     Low      50

Import and Export

Data frames are commonly imported from comma-separated-value (CSV) files using the read.csv() function. CSV files can be created in and exported from spreadsheet programs like Excel. The as.is=T parameter prevents the function from converting character strings to factors, which makes them easier to display. The CSV file for this example can be downloaded HERE.

> states = read.csv('2017-state-data.csv', as.is=T)
> states
   ST                State WIN2012 DEM2012 PCDEM2012 GOP2012 PCGOP2012 WIN2016
1  AK               Alaska  Romney   91696      41.6  121234      55.0   Trump
2  AL              Alabama  Romney  793620      38.4 1252453      60.7   Trump
3  AR             Arkansas  Romney  391953      36.9  643717      60.6   Trump
4  AZ              Arizona  Romney  900081      44.1 1107130      54.2   Trump
5  CA           California   Obama 6241648      59.2 4046524      38.4 Clinton
6  CO             Colorado   Obama 1238490      51.2 1125391      46.5 Clinton
7  CT          Connecticut   Obama  912531      58.4  631432      40.4 Clinton
8  DC District of Columbia   Obama  222332      91.4   17337       7.1 Clinton
9  DE             Delaware   Obama  242547      58.6  165476      40.0 Clinton
...

Data frames can be written to CSV files using the analogous write.csv() function.

Visualization

The parallel cross-sectional or time-series data in a data frame permits a wide variety of different visualizations of relationships between columns, only a handful of which are covered here.

Single columns can be visualized using the same visualizations used with vectors.

> states = read.csv('2017-state-data.csv')
> hist(states$PCRURAL)
Histogram of Percentage of Residents in Rural Areas By State (USCB 2015)

The default plot() of a data frame with two numeric columns creates an X-Y scatter plot, with the leftmost column as the X-axis:

plot(states[,c('PCRURAL', 'PCDEM2012')])
X-Y Scatter Plot of Percent Rural Residents vs Percent Democratic Vote By State In 2012 (USCB 2015, AP via Politico 2012)

The default plot() of a data frame with a categorical and numeric column creates a boxplot giving the ranges of values in each category:

plot(states[,c('WIN2012', 'PCRURAL')])
X-Y Scatter Plot of Percent Rural Residents vs Presidential Winner By State In 2012 (USCB 2015, AP via Politico 2012)

The default plot() of a data frame with multiple numeric columns is a pairplot with X-Y scatter plots for all possible pairs of columns that can be used to investigate possible relationships between variables.

plot(states[,c('PCDEM2012', 'PCRURAL', 'AREASQMI')])
Example Pair Plot

Access and Subsetting

As with vectors, elements can be accessed with numeric indices inside square brackets, with the first number as the row number and the second number as the column number, starting with 1 as the top row or leftmost column. Numbers can be ranges of numbers or vectors of numbers. An entire row or all columns can be accessed by leaving the numeric index blank.

> names = c('LA', 'MS', 'AL', 'NY', 'OH', 'VA', 'CA', 'CO')
> obesity = c('High', 'High', 'High', 'Med', 'Med', 'Med', 'Low', 'Low')
> density = c(106, 64, 95, 416, 283, 207, 244, 50)
> states = data.frame(names, obesity, density)

> dim(states)
[1] 8 3

> states[1,1]
[1] LA

> states[1:5,1:2]
  names obesity
1    LA    High
2    MS    High
3    AL    High
4    NY     Med
5    OH     Med

> states[1:5,]
  names obesity density
1    LA    High     106
2    MS    High      64
3    AL    High      95
4    NY     Med     416
5    OH     Med     283

> states[,1:2]
  names obesity
1    LA    High
2    MS    High
3    AL    High
4    NY     Med
5    OH     Med
6    VA     Med
7    CA     Low
8    CO     Low

Columns in data frames are always named, and columns can be accessed by name. Column names can be displayed using the names() or colnames() functions. Individual columns can be accessed using the dollar sign ($) operator. Column names can also be used in place of column indices when using square brackets

> names(states)
[1] "names"   "obesity" "density"

> colnames(states)
[1] "names"   "obesity" "density"

> states$obesity
[1] High High High Med  Med  Med  Low  Low 
Levels: High Low Med

> states[1:2,'obesity']
[1] High High
Levels: High Low Med

> states[1:2,c('names','obesity')]
  names obesity
1    LA    High
2    MS    High

Comparisons with column names can be used to create vetors of logical values that can select specific rows from a data frame

> states == 'High'
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
>
> states[states$obesity == 'High',]
  names obesity density
1    LA    High     106
2    MS    High      64
3    AL    High      95

Appending and Deleting

New columns can be added to a data frame directly by setting a non-existent column

> pcdem2012 = c(40.6, 43.5, 38.4, 62.6, 50.1, 50.8, 59.2, 51.2)
> win2012 = c('Romney', 'Romney', 'Romney', 'Obama', 'Obama', 'Obama', 'Obama', 'Obama')
> states$pcdem2012 = pcdem2012
> states[,"win2012"] = win2012
> states
  names obesity density pcdem2012 win2012
1    LA    High     106      40.6  Romney
2    MS    High      64      43.5  Romney
3    AL    High      95      38.4  Romney
4    NY     Med     416      62.6   Obama
5    OH     Med     283      50.1   Obama
6    VA     Med     207      50.8   Obama
7    CA     Low     244      59.2   Obama
8    CO     Low      50      51.2   Obama

Two data frames can be combined using the cbind() function

> pcdem2012 = c(40.6, 43.5, 38.4, 62.6, 50.1, 50.8, 59.2, 51.2)
> win2012 = c('Romney', 'Romney', 'Romney', 'Obama', 'Obama', 'Obama', 'Obama', 'Obama')
> newcols = data.frame(pcdem2012, win2012)
> states = cbind(states, newcols)
> states
  names obesity density pcdem2012 win2012
1    LA    High     106      40.6  Romney
2    MS    High      64      43.5  Romney
3    AL    High      95      38.4  Romney
4    NY     Med     416      62.6   Obama
5    OH     Med     283      50.1   Obama
6    VA     Med     207      50.8   Obama
7    CA     Low     244      59.2   Obama
8    CO     Low      50      51.2   Obama
> newcols
  pcdem2012 win2012
1      40.6  Romney
2      43.5  Romney
3      38.4  Romney
4      62.6   Obama
5      50.1   Obama
6      50.8   Obama
7      59.2   Obama
8      51.2   Obama

Rows or columns can be deleted by setting the data frame to a subset of itself.

> states
  names obesity density pcdem2012 win2012
1    LA    High     106      40.6  Romney
2    MS    High      64      43.5  Romney
3    AL    High      95      38.4  Romney
4    NY     Med     416      62.6   Obama
5    OH     Med     283      50.1   Obama
6    VA     Med     207      50.8   Obama
7    CA     Low     244      59.2   Obama
8    CO     Low      50      51.2   Obama
> states = states[,c(1,4,5)]
> states
  names pcdem2012 win2012
1    LA      40.6  Romney
2    MS      43.5  Romney
3    AL      38.4  Romney
4    NY      62.6   Obama
5    OH      50.1   Obama
6    VA      50.8   Obama
7    CA      59.2   Obama
8    CO      51.2   Obama

Operations

Operations can be performed on individual columns of a data frame as if they are vectors. The results of these operations can be added as new columns to the data frame to keep all the rows together.

> city = c('Dallas', 'Fort Worth', 'Houston', 'San Antonio')
> votes = c(711612, 610890, 1188585, 520288)
> obama = c(405571, 253071, 587044, 264856)
> texas = data.frame(city, votes, obama)
> 
> texas$pcdem = round(100 * texas$obama / texas$votes, 2)
> texas
         city   votes  obama pcdem
1      Dallas  711612 405571 56.99
2  Fort Worth  610890 253071 41.43
3     Houston 1188585 587044 49.39
4 San Antonio  520288 264856 50.91

Lists

Lists are one-dimensions structures like vectors except individual members of a list can have different data types, including vectors or other lists. Data frames are actually lists of parallel vectors.

List

Creation

Lists can be created by combining elements with the list() function, which is similar to the c() function with vectors, except the elements combined into the list can be of different types.

> x = list("Alpha", "Beta", c("Gamma", "Delta"), list("Epsilon", 100))
> x
[[1]]
[1] "Alpha"

[[2]]
[1] "Beta"

[[3]]
[1] "Gamma" "Delta"

[[4]]
[[4]][[1]]
[1] "Epsilon"

[[4]][[2]]
[1] 100

Visualization

Lists can be passed to plotting functions if they have members named the same as the plotting function parameters.

> xy = list(x = 0:20, y = sin(pi * (0:20) / 10))
> plot(xy, type="b")
Example Plot of a List

Access and Subsetting

List elements are accessed as slices or references.

List slices are sublists, and are specified with single square braces. The result is a list rather than an individual member.

> x = list("Alpha", "Beta", c("Gamma", "Delta"), list("Epsilon", 100))
> x[1]
[[1]]
[1] "Alpha"

> x[3:4]
[[1]]
[1] "Gamma" "Delta"

[[2]]
[[2]][[1]]
[1] "Epsilon"

[[2]][[2]]
[1] 100

To access or change the contents individual member of a list, you should use double square brackets.

> x = list("Alpha", "Beta", c("Gamma", "Delta"), list("Epsilon", 100))
> x[[1]]
[1] "Alpha"
> x[[1]] = "Omega"
> x[[1]]
[1] "Omega"

Members of lists can be named and accessed via those names, either with double square brackets or with the dollar sign operator.

> x = list(1, 2, "Joe", "Frank", 3.2)
> names(x) = c('ID', 'Party', 'Name', 'Mentor', 'Score')
> x
$ID
[1] 1

$Party
[1] 2

$Name
[1] "Joe"

$Mentor
[1] "Frank"

$Score
[1] 3.2

> x[['Name']]
[1] "Joe"
> x$Party
[1] 2

Appending and Deleting

To add elements to a list, you can set a member that doesn't exist (indexed by number or name), or use the c() function.

> x = list("Alpha", "Beta")
> x
[[1]]
[1] "Alpha"

[[2]]
[1] "Beta"
> x[length(x)] = "Omega"
> x['Season'] = "Spring"
> x = c("First", x)
> x
[[1]]
[1] "First"

[[2]]
[1] "Alpha"

[[3]]
[1] "Omega"

$Season
[1] "Spring"

You can delete elements by setting a slice of the list (single square brackets) to NULL:

> x = list("Alpha", "Beta", "Gamma")
> x[1:2] = NULL
> x
[[1]]
[1] "Gamma"

Operations

The lapply() function should be used if you need to perform an operation on every member of a list. Because lists are complex, mMathematical operations cannot be performed directly on members in the same manner as vectors or matrices.

> x = list(1, 3, 5, 8)

> x * 10
Error in x * 10 : non-numeric argument to binary operator

> lapply(x, function(z) z + 10)
[[1]]
[1] 11

[[2]]
[1] 13

[[3]]
[1] 15

[[4]]
[1] 18

Arrays and Matrices

Matrices are two-dimensional collections of a single data type. Arrays are collections of a single data type that can have any number of dimensions. Matrices are commonly used in complex mathematical operations.

Matrices and arrays are commonly output by functions as a componet of analysis. Notably, Matrices are used to represent pixels in images and raster GIS data.

Matrix and Array

Creation

Matrices can be created directly from vectors of data using the matrix() function and specifying the number of rows and/or columns, and the order in which the data appears in the vector (byrow). Array creation with the array() function is similar, except the dimensions for each dimension is defined by the dim parameter, which is a vector with the lengths for each dimension.

> m = matrix(data = 1:12, ncol = 3, byrow = T)
> m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12
> m = matrix(data = 1:12, ncol = 3, byrow = F)
> m
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12
> a = array(data = 1:12, dim = c(2,2,3))
> a
, , 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    5    7
[2,]    6    8

, , 3

     [,1] [,2]
[1,]    9   11
[2,]   10   12

Visualization

Matrices can be visualized as 2-D colored surfaces using the image() function.

> x = as.matrix(read.csv("matrix-file.csv"))
> image(x, col=gray(seq(0,1,1/256)))

They can also be visualized as 3-D surfaces using the persp3D() function from the plot3D library.

Example Image Visualization of a 2-D Matrix
> library(plot3D)
> x = as.matrix(read.csv("matrix-file.csv"))
> persp3D(z = x)
Example persp3D Visualization of a 2-D Matrix

Import and Export

Matrices can be imported from CSV files by converting the output of read.csv() using the as.matrix() function. The file in the example below is HERE.

> x = as.matrix(read.csv("matrix-file.csv"))
> x
       X V1  V2  V3  V4  V5  V6 V7 V8
 [1,]  1  2   8  17  25  28  26 16  0
 [2,]  2 21  43  63  74  74  60 35  0
 [3,]  3 39  77 106 120 116  93 52  0
 [4,]  4 56 107 145 162 154 121 67  0
 [5,]  5 70 134 179 197 186 145 80  0
 [6,]  6 82 155 205 225 211 163 89  0
 [7,]  7 91 170 224 244 227 175 95  0
 [8,]  8 96 179 234 254 235 180 97  0
 [9,]  9 97 180 235 254 234 179 96  0
[10,] 10 95 175 227 244 224 170 91  0
[11,] 11 89 163 211 225 205 155 82  0
[12,] 12 80 145 186 197 179 134 70  0
[13,] 13 67 121 154 162 145 107 56  0
[14,] 14 52  93 116 120 106  77 39  0
[15,] 15 35  60  74  74  63  43 21  0
[16,] 16 16  26  28  25  17   8  2  0

Matrices of pixels can also be read from a variety of image file formats. One lossless file format is the Portable Network Graphics (PNG) file. This PNG file in the example below is HERE.

> library(png)
> x = readPNG("tux-small-bw.png")
> class(x)
[1] "matrix"
> image(x, col=gray(seq(0,1,1/256)))
Example Matrix Imported From a PNG File

Access and Subsetting

Matrices can be accessed using square bracket array notation:

> m = as.matrix(read.csv("matrix-file.csv"))
> m
       X V1  V2  V3  V4  V5  V6 V7 V8
 [1,]  1  2   8  17  25  28  26 16  0
 [2,]  2 21  43  63  74  74  60 35  0
 [3,]  3 39  77 106 120 116  93 52  0
 [4,]  4 56 107 145 162 154 121 67  0
 [5,]  5 70 134 179 197 186 145 80  0
 [6,]  6 82 155 205 225 211 163 89  0
 [7,]  7 91 170 224 244 227 175 95  0
 [8,]  8 96 179 234 254 235 180 97  0
 [9,]  9 97 180 235 254 234 179 96  0
[10,] 10 95 175 227 244 224 170 91  0
[11,] 11 89 163 211 225 205 155 82  0
[12,] 12 80 145 186 197 179 134 70  0
[13,] 13 67 121 154 162 145 107 56  0
[14,] 14 52  93 116 120 106  77 39  0
[15,] 15 35  60  74  74  63  43 21  0
[16,] 16 16  26  28  25  17   8  2  0
> m[2,5]
V4 
74 
> m[1:4, 1:4]
     X V1  V2  V3
[1,] 1  2   8  17
[2,] 2 21  43  63
[3,] 3 39  77 106
[4,] 4 56 107 145

Appending and Deleting

Rows or columns can be appended to a matrix using the rbind() or cbind() functions, respectively.

Rows or columns can be deleted by setting the data frame to a subset of itself.

> m = matrix(1:16, ncol=4)
> m
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

> m = cbind(m, 101:104)
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13  101
[2,]    2    6   10   14  102
[3,]    3    7   11   15  103
[4,]    4    8   12   16  104

> m = rbind(m, 201:205)
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13  101
[2,]    2    6   10   14  102
[3,]    3    7   11   15  103
[4,]    4    8   12   16  104
[5,]  201  202  203  204  205

> m = m[2:4, 2:4]
> m
     [,1] [,2] [,3]
[1,]    6   10   14
[2,]    7   11   15
[3,]    8   12   16

Operations

Matrices and vectors can be used directly in mathematical operations:

> x = matrix(1:4, ncol=2)
> y = matrix(seq(10, 40, 10), ncol=2)

> x + 10
     [,1] [,2]
[1,]   11   13
[2,]   12   14

> x * 10
     [,1] [,2]
[1,]   10   30
[2,]   20   40

> x + y
     [,1] [,2]
[1,]   11   33
[2,]   22   44

> x * y
     [,1] [,2]
[1,]   10   90
[2,]   40  160

Note that these are direct element-by-element operations. Linear algebra matrix multiplication uses the %*% operator.

> x %*% y
     [,1] [,2]
[1,]   70  150
[2,]  100  220

> z = c(1000, 2000)
> x %*% z
      [,1]
[1,]  7000
[2,] 10000

Spatial Objects

Objects in R are complex data structures that contain slots accessed with the @ operator. A slot can be any data type, including vectors, data frames or lists.

Objects are used with spatial data to keep all the disparate variables together in one collection. This practice of encapsulating data sets into complex structures that are dealt with as single entities is called object-oriented programming.

The slotNames() function displays the slots in an object. The SpatialPointsDataFrame object has the following slots:

> library(rgdal)
> data = read.csv("largest-us-cities.csv", as.is=T)
> cities = SpatialPointsDataFrame(data[,c("LONG","LAT")], data = data, proj4string = wgs84)
> slotNames(cities)
[1] "data"        "coords.nrs"  "coords"      "bbox"        "proj4string"
SpatialPointsDataFrame Object

Creation

The sp library provides a set of related objects for representing the three fundamental geospatial vector data types: points, lines, polygons.

SpatialPointsDataFrame

SpatialPointsDataFrame objects are used to represent sets of specific locations on the surface of the earth.

Spatial Points

A spatial points object can be created with the SpatialPointsDataFrame() function, which minimally needs:

library(sp)
coords = matrix(data = c(
     -117.418692, 47.662245,
     -117.422896, 47.666241,
     -117.437009, 47.658582,
     -117.415427, 47.656382,
     -117.402272, 47.667440), ncol=2, byrow=T)

attributes = data.frame(name = c(
     'Riverfront Park',
     'Spokane Arena',
     'Peaceful Valley',
     'Spokane Amtrak',
     'Gonzaga University'), stringsAsFactors = F)

wgs84 = CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")

points = SpatialPointsDataFrame(coords, data = attributes, proj4string = wgs84)
plot(points, pch=19, col="navy")
text(points@coords, labels = points$name, pos=3)
Example of Spatial Points - Landmarks in Spokane, WA

SpatialLinesDataFrame

SpatialLinesDataFrame objects are used to represent linear features like roads, streams, routes, etc.

Spatial Lines
main = matrix(byrow=T, ncol=2, data=c(
     -117.434318, 47.656684, 
     -117.426615, 47.659040, 
     -117.409202, 47.659105))

i90 = matrix(byrow=T, ncol=2, data=c(
     -117.379622, 47.653927, 
     -117.387614, 47.653876, 
     -117.395425, 47.652221, 
     -117.403112, 47.652682,
     -117.428378, 47.652735,
     -117.443118, 47.650017))

division = matrix(byrow=T, ncol=2, data=c(
     -117.411288, 47.649558, 
     -117.411121, 47.671755))

attributes = data.frame(
     name = c('Main Street', 'I-90', 'Division Street'),
     lanes = c(4, 6, 4))
lineslist = list(Lines(Line(main), ID=1),
     Lines(Line(i90), ID=2),
     Lines(Line(division), ID=3))

wgs84 = CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")

streetlines = SpatialLines(lineslist, proj4string = wgs84)

streets = SpatialLinesDataFrame(streetlines, attributes)

plot(streets, col='gray', lwd=streets$lanes)
plot(points, pch=19, col='navy', add=T)
text(points@coords, labels = points$name, pos=3)
Example of Spatial Lines - Major Streets in Downtown Spokane, WA

SpatialPolygonsDataFrame

SpatialPolygonsDataFrame objects are used to represent areas. While polygons are simple collections of connected points, spatial polygons are more complex because of the way areas relate to each other. For example, an area like a state may be composed of multiple land areas (like islands), and may contain holes (like water features).

Spatial Points
coords = matrix(ncol=2, byrow=T, data=c(
     -117.434654, 47.656791, 
     -117.425077, 47.661058, 
     -117.403602, 47.661512, 
     -117.403271, 47.652620, 
     -117.434916, 47.652273))

attributes = data.frame(name='Downtown')

wgs84 = CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")

polygon = Polygon(coords)
polygons = Polygons(list(polygon), ID=1)
sppolygons = SpatialPolygons(list(polygons), proj4string = wgs84)
downtown = SpatialPolygonsDataFrame(sppolygons, attributes)

plot(downtown, col='beige', border=NA, xlim=streets@bbox[1,], ylim=streets@bbox[2,])
plot(streets, col='gray', lwd=streets$lanes, add=T)
plot(points, pch=19, col='navy', add=T)
text(points@coords, labels = points$name, pos=3)
Example of Spatial Polygon - Downtown Spokane, WA

Import and Export

RGDAL

Spatial objects tend to be complex, and except for trivial cases, they are generally imported and exported to and from files rather than directly generated in R code.

Spatial points, lines and polygons can be imported from a number of different formats using the readOGR() function from the rgdal Geospatial Data Abstraction Library (GDAL/OGR).

Shapefiles are an old geospatial data format introduced by ESRI in 1998, that still remains a least-common-denominator (safe) standard for distributing geospatial data to the general public.

The following example imports a shapefile of US state data and displays it as a choropleth with states colored by the winner of the 2012 presidential election. This data is available HERE.

library(rgdal)
states = readOGR('.', '2017-state-data')
states = states[!(states$ST %in% c('AK', 'HI')),]

color = ifelse(states$WIN2012 == 'Obama', 'navy', 'darkred')

usa_albers = CRS("+proj=aea +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 
     +datum=NAD83 +units=m +no_defs")
states = spTransform(states, usa_albers)

plot(states, col=color)
legend('bottomleft', legend=c('Obama', 'Romney'), pch=15, col=c('navy', 'darkred'))
Results From the 2012 US Presidential Election (AP via Politico)

Projections

While the WGS 84 coordinate system used by GPS is perhaps the most common coordinate system, data from government sites will sometimes be in older coordinate systems, or coordinate systems (like one of the state plane coordinate systems) that are more-appropriate for the way that data is used by agencies and technicians in the field.

Maps are commonly created with layers of features overlaid on top of each other, and you should make sure that the coordinate systems match. The spTransform() function can be used to tranform data in one coordinate system to another.

Displaying geospatial data in WGS 84 tends to stretch distances horizontally the further you move from the equator. The US map above uses the coordinate system from the Albers Equal Area Conic projection so states display with approximately the same amount of area they actually have.

CSV Files of Points

Points can be imported from spreadsheet comma-separated variable (CSV) files. The CSV file used in the example below is available HERE.

library(rgdal)
data = read.csv("largest-us-cities.csv", as.is=T)
cities = SpatialPointsDataFrame(data[,c("LONG","LAT")], data = data, proj4string = wgs84)

usa_albers = CRS("+proj=aea +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 
     +datum=NAD83 +units=m +no_defs")
cities = spTransform(cities, usa_albers)

states = readOGR('.', '2017-state-data')
states = states[!(states$ST %in% c('AK', 'HI')),]
states = spTransform(states, usa_albers)

plot(states)
plot(cities, pch=19, col="navy", add=T)
text(cities, labels=cities$NAME, pos=3, font=2, cex=0.8, col="white")
text(cities, labels=cities$NAME, pos=3, font=2, cex=0.7)
10 Largest US Cities (USCB 2015)

Access and Subsetting

Individual shapes in Spatial* objects can be accessed using square brackets in the same manner as a data frame, while leaving the second (column) dimension empty.

Although the attribute data frame of a Spatial* object is in the @data slot, the $ operator can be used directly with the name of the object to access attribute values.

You can get a list of the column names (attribute / field names) with the names() command.

> library(rgdal)
> states = readOGR('.', '2017-state-data')
> names(states)
 [1] "ST"         "State"      "WIN2012"    "DEM2012"    "PCDEM2012" 
 [6] "GOP2012"    "PCGOP2012"  "WIN2016"    "DEM2016"    "PCDEM2016" 
[11] "GOP2016"    "PCGOP2016"  "CHANGEDEM"  "CHANGEGOP"  "ELECTORAL" 
[16] "PCELECTORA" "GOVERNOR"   "POP2014"    "PCPOP2014"  "WHITE2014" 
[21] "PCWHITE"    "AREASQMI"   "POPSQMI"    "POP2010"    "URBAN2010" 
[26] "RURAL2010"  "PCRURAL"   
> 
> states$GOVERNOR
 [1] DEM   GOP  GOP  GOP  GOP  GOP  GOP  DEM  GOP  GOP  DEM  DEM  DEM  DEM 
[16] GOP  GOP  DEM  GOP  IND  GOP  GOP  DEM  GOP  DEM  DEM  GOP  GOP  GOP  GOP 
[31] GOP  GOP  GOP  DEM  GOP  GOP  GOP  GOP  DEM  DEM  GOP  DEM  DEM  GOP  GOP 
[46] DEM  DEM  GOP  GOP  DEM  GOP 
Levels: DEM GOP IND
> 
> par(mfcol=c(1,2)) 
> plot(states[states$ST == 'WA', ])
> plot(states[3:4,])

Appending and Deleting

New features can be added to a Spatial* object using the rbind() function, although the new features should have the exact same

Adding new columns to the attribute table is somewhat easier, although you will need to exercise caution to not affect the order of rows in the @data table and lose the relationship between the attributes and their respective shapes.

> library(rgdal)
> states = readOGR('.', '2017-state-data')
> states = states[!(states$ST %in% c('AK', 'HI')),]
> usa_albers = CRS('+proj=aea +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 
+          +datum=NAD83 +units=m +no_defs')
> states = spTransform(states, usa_albers)
> 
> states$POPRANK = order(states$POP2014)
> spplot(states, zcol='POPRANK')
> states$POPCHANGE = states$POP2014 - states$POP2010
> states$POPCHANGE
 [1] 812964  32013 560482 220103  31882  38115  31761  43905  67677 114224
[11]   5384  79736  39149 216228  69269 105260 946472 184107  37706 169499
[21]  31118 168384  58609  17956  29276  20906 214922  23876    174 109662
[31]  17048  16955  67500 101909  20528  94226 174583    887  11625  19126
[41]    685  37942  32334  56350    617  29828  61033   4599  82480
> spplot(states, zcol='POPCHANGE')

A far more common operation is an attribute join, where values from a related table are added to the attribute table based on a join key that associates them with the appropriate shape. Joins are performed with the merge() function from the sp library, which has a similar syntax to the merge() function in base R.

The example below merges obesity data with the state shapefile to create a choropleth.

> library(rgdal)
> states = readOGR('.', '2017-state-data', stringsAsFactors=F)
> states = states[!(states$ST %in% c('AK', 'HI')),]
> usa_albers = CRS('+proj=aea +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 
+          +datum=NAD83 +units=m +no_defs')
> states = spTransform(states, usa_albers)
> 
> st = c( 'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'HI',
+ 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS',
+ 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR',
+ 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY')
> 
> obesity = c(33.5, 29.7, 28.9, 35.9, 24.7, 21.3, 26.3, 30.7, 21.7, 26.2, 30.5, 22.1, 28.9,
+ 29.3, 32.7, 30.9, 31.3, 31.6, 34.9, 28.2, 29.6, 23.3, 30.7, 27.6, 35.5, 30.2,
+ 26.4, 30.2, 27.7, 27.4, 26.9, 28.4, 27, 29.7, 32.2, 32.6, 33, 27.9, 30.2, 27,
+ 32.1, 29.8, 31.2, 31.9, 25.7, 24.8, 28.5, 27.3, 35.7, 31.2, 29.5)
> 
> states = merge(states, data.frame(st, obesity), by.x = 'ST', by.y = 'st')
> spplot(states, zcol='obesity')
Obesity Rates by State (CDC 2015)

Visualization

Geospatial data is most commonly (but not always) visualized with maps, as shown in the examples above.

The plot() function is usually adequate for simple maps, as demonstrated above and below. A spplot() function is provided by the sp library to make quick thematic maps, although you can create more attractive maps using plot() and only a minimal number of additional functions and parameters.

Attribute data is stored in the @data slot in the Spatial* objects and can be accessed with the $ operator, that data can be analyzed and visualized just like data in any other data frame. For example, the X-Y scatter plot below shows the inverse relationship by state between percentage of residents in rural areas and the percentage of the vote for President Obama in 2012. This data can be downloaded HERE.

> states = readOGR('.', '2017-state-data')
> plot(states$PCRURAL, states$PCGOP2012)
> m = lm(PCDEM2012 ~ PCRURAL, data=states)
> abline(m, lwd=3, col="darkred")
Percent Rural Residents vs. Percent Democratic Vote by State (USCB 2015, AP via Politico 2012)

Spatial Raster Objects

The Raster* objects from the raster library are used to store raster geospatial data.

A raster is a (two-dimensional) matrix forming a regular grid of values over an area on the surface of the earth.

Rasters are commonly used with remotely-sensed data from satellites, which are usually images from various bands (frequencies) of light.

Full-color raster images need three overlapping raster layers representing red, green, and blue light frequencies, respectively.

Rasters can also be generated from aerial photography, or from point cloud data generated by aerial laser scanners (lidar).

Creation

A RasterLayer object is comprised of a single raster layer created with the raster() function.

A RasterStack is collection of multiple RasterLayer objects with the same spatial extent and resolution, created with the stack() function.

A RasterBrick is a multi-layer raster object created with the brick().