# Spatial Autocorrelation and Regression

Similar objects tend to cluster together in space,
something captured by Waldo Tobler's (1979) first "law,"
*Everything is related to everything else, but near things
are more related to each other.* When applied to
values measured at different points or in different areas,
this clustering results in **spatial autocorrelation**.

Regression models are built on the assumption that
the values of the dependent variable are **independent** from each other.
Variables that exhibit spatial autocorrelation violate this assumption, which means
regression models built using those variables will be biased and
statistically invalid. This can cause an understatement or overstatement of the
importance of the relationships between the dependent and independent variables.

This tutorial will give an overview of how to test for spatial autocorrelation and build regression models that adjust for spatial autocorrelation.

## Example Data

The examples will model
median monthly housing costs using 2015 American Community
Survey data from 2011-2014 for census tracts in the New York
metropolitan area (CBSA). That data is available as
a zipped shapefile HERE.
You can then load that data into R with the **readOGR()**
function from the **rgdal** library:

library(rgdal) metro = readOGR(dsn='.', layer='nyc-metro', stringsAsFactors=F)

*Moran's I*

** Moran's I** is a commonly-used measure of spatial autocorrelation,
based on work that
Patrick Alfred Pierce Moran published in 1950.
However, Li, Calder and Cressie
(2007)
point out that although the technique was fully developed
much later and named after Moran by
Andrew David Cliff and J.K. Ord (1981),
based on work by
James Durbin and Geoffrey Watson (1950).

The range of *Moran's I* is -1 to +1:

- -1 means all values are evenly dispersed
- 0 is perfect randomness (no autocorrelation)
- 1 has similar values all clustered together

# Create the x/y coordinates for the grid x = rep(1:10, 10) y = rep(1:10, each=10) # Create a grid neighbor list library(spdep) neighbors = cell2nb(10, 10, type="rook", torus=FALSE) # Display four example plots in a single window par(mfrow = c(2,2)) par(mar = c(1, 1, 2, 1)) # Drawing function draw_grid = function(z, title) { plot(neighbors, data.frame(x, y), points=F, col="#f0f0f0") mtext(title, line=0.5, font=2, cex=0.9) text(x, y, z, font=2, cex=0.9) moran.test(z, nb2listw(neighbors)) } # Create and draw examples z = c(rep(c(rep(1, 5), rep(2, 5)), 5), rep(c(rep(3, 5), rep(4, 5)), 5)) draw_grid(z, "Highly Clustered (I = 0.89)") z = rep(c(rep(0:1, 5), rep(1:0, 5)), 5) draw_grid(z, "Perfectly Dispersed (I = -1)") z = rep(0:1, 50) draw_grid(z, "Perfectly Random (I = 0)") set.seed(2) z = as.integer(runif(100, 1, 10)) draw_grid(z, "Highly Random (I = -0.05)")

## Testing for Spatial Autocorrelation Using Moran's I

Making a quick choropleth to view the spatial distribution of median household income, we see indication of spatial autocorrelation with clear clusters of both high and low income around the NYC metropolitan area.

spplot(metro, zcol='MEDHOUSING', col="transparent")

Before performing spatial autocorrelation analysis, you should remove
all shapes that have NA as the value for any of the variables you will be using.
Failure to do this will result in the *missing values in object* error.

metro = metro[!(is.na(metro$POPULATION) | is.na(metro$MEDAGE) | is.na(metro$MEDHHINC) | is.na(metro$MEDHOUSING) | is.na(metro$DENSITY) | is.na(metro$PCPREWAR)),]

To get the *Moran's I* statistic for this variable, we must
first answer the ancient question,
*Who is my neighbor*?

The **poly2nb()** function in the spdep package takes an object
of polygons, and for each polygon creates an **nb** list of the polygons
that are adjacent to that polygon.

The **nb2listw()** function then takes that nb list and creates a
weighted list. By default all neighbors are treated equally, but nb2listw()
provides the ability to treat some regions as more important than others,
such as when dealing with areas of uniquely important economic activity.

The **zero.policy=T** parameter indicates that locations with no neighbors (such as islands)
will be passed rather than throwing the *Empty neighbour sets found* error.

library(spdep) neighbors = poly2nb(metro) weighted_neighbors = nb2listw(neighbors, zero.policy=T)

That neigbor list can then be passed to the **moran.test()** function
along with the variable being analyzed to get the *Moran's I* statistic.
As indicated above, positive statistic value indicate autocorrelation:

> moran.test(metro$MEDHHINC, weighted_neighbors, zero.policy=T) Moran I test under randomisation data: metro$MEDHHINC weights: weighted_neighbors Moran I statistic standard deviate = 78.728, p-value < 2.2e-16 alternative hypothesis: greater sample estimates:Moran I statisticExpectation Variance0.6848731034-0.0002179124 0.0000757239

The **moran.plot()** function can be used to visualize
and diagnose the autocorrelation. The x axis is the values
and the y axis are the spatially-lagged (adjacent) values.
The fairly-clear upward pattern from left to right indicates
autocorrelation.

moran.plot(metro$MEDHOUSING, weighted_neighbors, zero.policy=T)