Spatial Autocorrelation and Regression

Median Monthly Housing Costs by Census Tract in the New York CBSA (USCB 2015)

Similar objects tend to cluster together in space, something captured by Waldo Tobler's (1979) first "law," Everything is related to everything else, but near things are more related to each other. When applied to values measured at different points or in different areas, this clustering results in spatial autocorrelation.

Regression models are built on the assumption that the values of the dependent variable are independent from each other. Variables that exhibit spatial autocorrelation violate this assumption, which means regression models built using those variables will be biased and statistically invalid. This can cause an understatement or overstatement of the importance of the relationships between the dependent and independent variables.

This tutorial will give an overview of how to test for spatial autocorrelation and build regression models that adjust for spatial autocorrelation.

Example Data

The examples will model median monthly housing costs using 2015 American Community Survey data from 2011-2014 for census tracts in the New York metropolitan area (CBSA). That data is available as a zipped shapefile HERE. You can then load that data into R with the readOGR() function from the rgdal library:

library(rgdal)
metro = readOGR(dsn='.', layer='nyc-metro', stringsAsFactors=F)

Moran's I

Moran's I is a commonly-used measure of spatial autocorrelation, based on work that Patrick Alfred Pierce Moran published in 1950. However, Li, Calder and Cressie (2007) point out that although the technique was fully developed much later and named after Moran by Andrew David Cliff and J.K. Ord (1981), based on work by James Durbin and Geoffrey Watson (1950).

The range of Moran's I is -1 to +1:

Moran's I Example Cases
# Create the x/y coordinates for the grid

x = rep(1:10, 10)
y = rep(1:10, each=10)

# Create a grid neighbor list

library(spdep)
neighbors = cell2nb(10, 10, type="rook", torus=FALSE)

# Display four example plots in a single window

par(mfrow = c(2,2))
par(mar = c(1, 1, 2, 1))

# Drawing function

draw_grid = function(z, title) {
	plot(neighbors, data.frame(x, y), points=F, col="#f0f0f0")
	mtext(title, line=0.5, font=2, cex=0.9)
	text(x, y, z, font=2, cex=0.9)
	moran.test(z, nb2listw(neighbors)) }

# Create and draw examples

z = c(rep(c(rep(1, 5), rep(2, 5)), 5), rep(c(rep(3, 5), rep(4, 5)), 5))
draw_grid(z, "Highly Clustered (I = 0.89)")

z = rep(c(rep(0:1, 5), rep(1:0, 5)), 5)
draw_grid(z, "Perfectly Dispersed (I = -1)")

z = rep(0:1, 50)
draw_grid(z, "Perfectly Random (I = 0)")

set.seed(2)
z = as.integer(runif(100, 1, 10))
draw_grid(z, "Highly Random (I = -0.05)")

Testing for Spatial Autocorrelation Using Moran's I

Making a quick choropleth to view the spatial distribution of median household income, we see indication of spatial autocorrelation with clear clusters of both high and low income around the NYC metropolitan area.

spplot(metro, zcol='MEDHOUSING', col="transparent")
Median Household Income (USCB 2015)

Before performing spatial autocorrelation analysis, you should remove all shapes that have NA as the value for any of the variables you will be using. Failure to do this will result in the missing values in object error.

metro = metro[!(is.na(metro$POPULATION) | is.na(metro$MEDAGE) | is.na(metro$MEDHHINC) |
	is.na(metro$MEDHOUSING) | is.na(metro$DENSITY) | is.na(metro$PCPREWAR)),]

To get the Moran's I statistic for this variable, we must first answer the ancient question, Who is my neighbor?

The poly2nb() function in the spdep package takes an object of polygons, and for each polygon creates an nb list of the polygons that are adjacent to that polygon.

The nb2listw() function then takes that nb list and creates a weighted list. By default all neighbors are treated equally, but nb2listw() provides the ability to treat some regions as more important than others, such as when dealing with areas of uniquely important economic activity.

The zero.policy=T parameter indicates that locations with no neighbors (such as islands) will be passed rather than throwing the Empty neighbour sets found error.

library(spdep)
neighbors = poly2nb(metro)
weighted_neighbors = nb2listw(neighbors, zero.policy=T)

That neigbor list can then be passed to the moran.test() function along with the variable being analyzed to get the Moran's I statistic. As indicated above, positive statistic value indicate autocorrelation:

> moran.test(metro$MEDHHINC, weighted_neighbors, zero.policy=T)

        Moran I test under randomisation

data:  metro$MEDHHINC  
weights: weighted_neighbors  

Moran I statistic standard deviate = 78.728, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.6848731034     -0.0002179124      0.0000757239 

The moran.plot() function can be used to visualize and diagnose the autocorrelation. The x axis is the values and the y axis are the spatially-lagged (adjacent) values. The fairly-clear upward pattern from left to right indicates autocorrelation.

moran.plot(metro$MEDHOUSING, weighted_neighbors, zero.policy=T)
Moran Plot of Median Monthly Housing Cost Showing Spatial Autocorrelation

Moran's I can be used on its own to detect clustering, such as when examining prevalence of cancer in a community and searching for hot-spots indicative of environmental contamination.

Linear Model

Before going to the trouble of building a spatial model, you should build a conventional linear model first to to get a baseline for your spatial model and determine whether it is necessary to build a spatial model.

In this case we will explore median household income (MEDHHINC), median age (MEDAGE), population density per square mile (DENSITY), and the percent of homes constructed before WW-II (PCPREWAR) as predictors for median monthly housing cost (MEDHOUSING).

X-Y scatters of the four independent variables vs the dependent variable show significant correlations:

par(mfrow=c(2,2))

cor = cor.test(metro$MEDHHINC, metro$MEDHOUSING)
plot(metro$MEDHHINC, metro$MEDHOUSING, col="#00000040", main=round(cor$estimate, 3))

cor = cor.test(metro$MEDHHINC, metro$MEDAGE)
plot(metro$MEDAGE, metro$MEDHOUSING, col="#00000040", main=round(cor$estimate, 3))

#cor = cor.test(metro$MEDHHINC, metro$MEDYRBUILT)
#plot(metro$MEDYRBUILT, metro$MEDHOUSING, col="#00000040", main=round(cor$estimate, 3))

cor = cor.test(metro$MEDHHINC, metro$PCPREWAR)
plot(metro$PCPREWAR, metro$MEDHOUSING, col="#00000040", main=round(cor$estimate, 3))

cor = cor.test(metro$MEDHHINC, log(metro$DENSITY))
plot(log(metro$DENSITY), metro$MEDHOUSING, col="#00000040", main=round(cor$estimate, 3))
X-Y Scatters and Correlations with Median Monthly Housing Costs (USCB 2015)

Use lm() to build a linear model:

nonspatial = lm(MEDHOUSING ~ MEDHHINC + MEDAGE + DENSITY + PCPREWAR, data = metro@data)

In the summary(), the R2 of 0.8 shows that the model fits well, and the low Pr(>|t|) values indicate that all predictors are significant except PCPREWAR:

> summary(nonspatial)

Call:
lm(formula = MEDHOUSING ~ MEDHHINC + MEDAGE + DENSITY + PCPREWAR, 
    data = metro@data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2127.15  -149.18    -0.93   142.96  1715.53 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.039e+03  2.438e+01  42.596   <2e-16 ***
MEDHHINC     1.431e-02  1.173e-04 122.018   <2e-16 ***
MEDAGE      -1.055e+01  5.921e-01 -17.825   <2e-16 ***
DENSITY     -1.197e-03  1.319e-04  -9.071   <2e-16 ***
PCPREWAR    -8.452e-02  1.894e-01  -0.446    0.655    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 246.9 on 4588 degrees of freedom
Multiple R-squared:  0.8023,    Adjusted R-squared:  0.8022
F-statistic:  4656 on 4 and 4588 DF,  p-value: < 2.2e-16

Creating a rank-order graph of the fitted values displayed over the actual values visualizes that strong fit:

x = data.frame(actual = nonspatial$fitted.values + nonspatial$residuals,
	fitted = nonspatial$fitted.values)
x = x[order(x$fitted),]
plot(x$actual, col='#00000040', ylab="MEDHOUSING", las=1)
lines(x$fitted, col='#c00000', lwd=3)
Non-spatial Model Fitted vs. Actual

However, mapping the residuals (amount that each fitted value deviates from the actual value) from the model shows clusters of errors, indicate spatial autocorrelation that likely biases the model. In cases like this, you will probably have a better model if you adjust for autocorrelation using a spatial lag or spatial error model.

metro$RESIDUALS = nonspatial$residuals
spplot(metro, zcol='RESIDUALS', col="transparent")
Residuals From the Nonspatial Model Indicating Autocorrelation

The Moran's I statistic of -0.27, and the Moran plot of the residuals showing a slight upward trend from left to right, confirms this observation of spatial autocorrelation.

> library(spdep)
> neighbors = poly2nb(metro)
> weighted_neighbors = nb2listw(neighbors, zero.policy=T)

> moran.plot(nonspatial$residuals, weighted_neighbors, zero.policy=T)
> moran.test(nonspatial$residuals, weighted_neighbors, zero.policy=T)

        Moran I test under randomisation

data:  nonspatial$residuals  
weights: weighted_neighbors  

Moran I statistic standard deviate = 31.352, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     2.725418e-01     -2.179124e-04      7.568995e-05 
Moran Plot of Residuals From the Nonspatial Model Indicating Autocorrelation

Spatial Lag Regression

If your residuals are not spatially autocorrelated, you can stop here and use the results of your non-spatial model with confidence.

However, if your residuals exhibit spatial autocorrelation, you need to use a model that adjusts for that autocorrelation if you want unbiased results.

The spatial lag model adjusts for a dependent variable where values are affected by nearby locations, resulting in dependencies between values.

Note that this spatial lag adjustment does not represent a known or unknown structural feature of reality. It is simply used to create the mathematical conditions needed for valid linear regression with unbiased results.

The lagsarlm() function in the spdep library perform spatial lag regression. We use the same formula that we used with the non-spatial lm() function.

library(spdep)
neighbors = poly2nb(metro)
weighted_neighbors = nb2listw(neighbors, zero.policy=T)

lag = lagsarlm(MEDHOUSING ~ MEDHHINC + MEDAGE + PCPREWAR + DENSITY, 
	data=metro, listw = weighted_neighbors,
	tol.solve=1.0e-30, zero.policy=T)

The coefficient values for MEDHHINC and MEDAGE are only slightly lower than in the non-spatial model, reflecting the aspatial importance of those variables. But while PREWAR remains insignificant, DENSITY (residents per square mile) also becomes insignificant, likely reflecting the autocorrelation caused by clustering of high residential density in and adjacent to Manhattan.

> summary(lag)
Call:lagsarlm(formula = MEDHOUSING ~ MEDHHINC + MEDAGE + PCPREWAR + 
    DENSITY, data = metro, listw = weighted_neighbors, zero.policy = T, 
    tol.solve = 1e-30)

Residuals:
       Min         1Q     Median         3Q        Max 
-1898.4838  -138.1129     2.4176   135.1403  1775.3589 

Type: lag 
Regions with no neighbours included:
 1086 1377 1908 
Coefficients: (asymptotic standard errors) 
               Estimate  Std. Error  z value  Pr(>|z|)
(Intercept)  7.7989e+02  2.6403e+01  29.5380 < 2.2e-16
MEDHHINC     1.2218e-02  1.5900e-04  76.8434 < 2.2e-16
MEDAGE      -1.0418e+01  5.6426e-01 -18.4627 < 2.2e-16
PCPREWAR     3.2840e-01  1.8107e-01   1.8137   0.06973
DENSITY     -8.7588e-04  1.2691e-04  -6.9016 5.141e-12

Rho: 0.23251, LR test value: 393.9, p-value: < 2.22e-16
Asymptotic standard error: 0.011994
    z-value: 19.386, p-value: < 2.22e-16
Wald statistic: 375.81, p-value: < 2.22e-16

Log likelihood: -31620.21 for lag model
ML residual variance (sigma squared): 55352, (sigma: 235.27)
Number of observations: 4593 
Number of parameters estimated: 7 
AIC: 63254, (AIC for lm: 63646)
LM test for residual autocorrelation
test value: 273.58, p-value: < 2.22e-16

The Akaike's Information Criterion (AIC) is a relative measure for assessing model fit. Lower values are better. Note that this is a relative measure, and only AIC values for models associated with the same data can be compared. You cannot say that a model of health data with an AIC of 63254 is better or worse than an unrelated model of income data with an AIC of 63646.

The slightly lower AIC value for the spatial lag model relative to the non-spatial model indicates that the spatial lag model has a slightly better fit than the non-spatial model. However, the confirmation that we can remove DENSITY and PREWAR from the model is probably a more significant conclusion of this analysis than any minor improvement in the model fit.

x = data.frame(actual = lag$fitted.values + lag$residuals, 
	fitted = lag$fitted.values)
x = x[order(x$fitted),]
plot(x$actual, col='#00000040')
lines(x$fitted, col='#c00000', lwd=3)
Spatial Lag Model Fitted vs Actual Values

The reduction in the upward slope in the Moran plot of the residuals compared to the non-spatial model shows the spatial lag model did its job:

moran.plot(lag$residuals, weighted_neighbors, zero.policy=T)
Residuals From the Spatial Lag Model With No Autocorrelation

That absence of spatial autocorrelation is also visible in a choropleth of the residuals:

metro$RESIDUALS = lag$residuals
spplot(metro, zcol='RESIDUALS', col="transparent")
Mapped Residuals From the Spatial Lag Model With No Autocorrelation

Spatial Error Regression

While the spatial lag model accounts for autocorrelation in the dependent variable, a spatial error model accounts for autocorrelation in the error term.

library(spdep)
neighbors = poly2nb(metro)
weighted_neighbors = nb2listw(neighbors, zero.policy=T)

err = errorsarlm(MEDHOUSING ~ MEDHHINC + MEDAGE + PCPREWAR + DENSITY, 
	data=metro, listw = weighted_neighbors,
	tol.solve=1.0e-30, zero.policy=T)

The results displayed in summary() are similar to those from the spatial lag model, although the AIC is lower than both the spatial lag and non-spatial model, indicating a slightly better fit:

> summary(err)
Call:errorsarlm(formula = MEDHOUSING ~ MEDHHINC + MEDAGE + PCPREWAR + 
    DENSITY, data = metro, listw = neighbor_list, zero.policy = T, 
    tol.solve = 1e-30)

Residuals:
        Min          1Q      Median          3Q         Max 
-1726.46062  -130.81518     0.15949   126.75311  1736.43500 

Type: error 
Regions with no neighbours included:
 1086 1377 1908 
Coefficients: (asymptotic standard errors) 
               Estimate  Std. Error  z value  Pr(>|z|)
(Intercept)  1.0014e+03  2.8253e+01  35.4450 < 2.2e-16
MEDHHINC     1.3829e-02  1.4087e-04  98.1750 < 2.2e-16
MEDAGE      -9.0279e+00  6.3733e-01 -14.1652 < 2.2e-16
PCPREWAR     4.3360e-01  2.1999e-01   1.9710   0.04872
DENSITY     -1.3408e-03  1.6492e-04  -8.1301 4.441e-16


Lambda: 0.50055, LR test value: 697.2, p-value: < 2.22e-16
Asymptotic standard error: 0.018105
    z-value: 27.648, p-value: < 2.22e-16
Wald statistic: 764.41, p-value: < 2.22e-16

Log likelihood: -31468.56 for error model
ML residual variance (sigma squared): 49794, (sigma: 223.14)
Number of observations: 4593 
Number of parameters estimated: 7 
AIC: 62951, (AIC for lm: 63646)
x = data.frame(actual = err$fitted.values + err$residuals, fitted = err$fitted.values)
x = x[order(x$fitted),]
plot(x$actual, col='#00000040')
lines(x$fitted, col='#c00000', lwd=3)
Spatial Error Model Fitted vs Actual Values
moran.plot(err$residuals, weighted_neighbors, zero.policy=T)
Residuals From the Spatial Error Model With No Autocorrelation

Making a Better Choropleth

Maps created with spplot() are useful because they are quick and easy. However, if you are sharing your results on the web or in a printed report, you might want a more-cartographically-attractive map.

The script below uses OpenStreetMap tiles to give context, then overlays colored polygons of the mapped variable. The mapping is implemented in a function to make it simpler to create choropleths of different variables.

# Base map
library(OpenStreetMap)
basemap = openmap(upperLeft = c(42.1, -75.5), lowerRight = c(39.5, -71.8), type="osm")

# Thematic maps
choropleth = function(variable, equal_interval = F) {
	x = metro@data[,variable]
	if (equal_interval)
		breaks = qunif(seq(0, 1, 0.2), min = min(x), max = max(x))
	else
		breaks = quantile(x, probs = seq(0, 1, 0.2), na.rm=T)

	categories = as.numeric(cut(x, breaks))
	palette = colorRampPalette(c('#FF0000C0', '#0000FFC0'), alpha=T)
	colors = palette(5)[categories]

	par(mfrow = c(1, 1))
	par(mar = c(2, 2, 2, 2))

	plot(basemap)
	metro = spTransform(metro, osm())
	plot(metro, col=colors, border=NA, add=T)

	bin_names = sapply(1:5, function(z) paste(round(breaks[z]), '-', round(breaks[z+1])))
	legend(x=-8107794, y=4914624, legend = bin_names, title = variable, pch = 15, col = palette(5), bg="white")
}

metro$FITTED = lag$fitted.values
choropleth('FITTED', T)

# Optional choropleths for the variables
# choropleth('MEDHOUSING', T)
# choropleth('MEDHHINC')
# choropleth('MEDAGE')
# choropleth('MEDYRBUILT')
# choropleth('PCPREWAR')
# choropleth('DENSITY')
Median Monthly Housing Costs by Census Tract in the New York CBSA (USCB 2015)
Spatial Lag Fitted Model For Median Monthly Housing Costs by Census Tract in the New York CBSA (USCB 2015)