# Spatial Autocorrelation and Regression Median Monthly Housing Costs by Census Tract in the New York CBSA (USCB 2015)

Similar objects tend to cluster together in space, something captured by Waldo Tobler's (1979) first "law," Everything is related to everything else, but near things are more related to each other. When applied to values measured at different points or in different areas, this clustering results in spatial autocorrelation.

Regression models are built on the assumption that the values of the dependent variable are independent from each other. Variables that exhibit spatial autocorrelation violate this assumption, which means regression models built using those variables will be biased and statistically invalid. This can cause an understatement or overstatement of the importance of the relationships between the dependent and independent variables.

This tutorial will give an overview of how to test for spatial autocorrelation and build regression models that adjust for spatial autocorrelation.

## Example Data

The examples will model median monthly housing costs using 2015 American Community Survey data from 2011-2014 for census tracts in the New York metropolitan area (CBSA). That data is available as a zipped shapefile HERE. You can then load that data into R with the readOGR() function from the rgdal library:

```library(rgdal)
metro = readOGR(dsn='.', layer='nyc-metro', stringsAsFactors=F)
```

## Moran's I

Moran's I is a commonly-used measure of spatial autocorrelation, based on work that Patrick Alfred Pierce Moran published in 1950. However, Li, Calder and Cressie (2007) point out that although the technique was fully developed much later and named after Moran by Andrew David Cliff and J.K. Ord (1981), based on work by James Durbin and Geoffrey Watson (1950).

The range of Moran's I is -1 to +1:

• -1 means all values are evenly dispersed
• 0 is perfect randomness (no autocorrelation)
• 1 has similar values all clustered together Moran's I Example Cases
```# Create the x/y coordinates for the grid

x = rep(1:10, 10)
y = rep(1:10, each=10)

# Create a grid neighbor list

library(spdep)
neighbors = cell2nb(10, 10, type="rook", torus=FALSE)

# Display four example plots in a single window

par(mfrow = c(2,2))
par(mar = c(1, 1, 2, 1))

# Drawing function

draw_grid = function(z, title) {
plot(neighbors, data.frame(x, y), points=F, col="#f0f0f0")
mtext(title, line=0.5, font=2, cex=0.9)
text(x, y, z, font=2, cex=0.9)
moran.test(z, nb2listw(neighbors)) }

# Create and draw examples

z = c(rep(c(rep(1, 5), rep(2, 5)), 5), rep(c(rep(3, 5), rep(4, 5)), 5))
draw_grid(z, "Highly Clustered (I = 0.89)")

z = rep(c(rep(0:1, 5), rep(1:0, 5)), 5)
draw_grid(z, "Perfectly Dispersed (I = -1)")

z = rep(0:1, 50)
draw_grid(z, "Perfectly Random (I = 0)")

set.seed(2)
z = as.integer(runif(100, 1, 10))
draw_grid(z, "Highly Random (I = -0.05)")
```

## Testing for Spatial Autocorrelation Using Moran's I

Making a quick choropleth to view the spatial distribution of median household income, we see indication of spatial autocorrelation with clear clusters of both high and low income around the NYC metropolitan area.

```spplot(metro, zcol='MEDHOUSING', col="transparent")
``` Median Household Income (USCB 2015)

Before performing spatial autocorrelation analysis, you should remove all shapes that have NA as the value for any of the variables you will be using. Failure to do this will result in the missing values in object error.

```metro = metro[!(is.na(metro\$POPULATION) | is.na(metro\$MEDAGE) | is.na(metro\$MEDHHINC) |
is.na(metro\$MEDHOUSING) | is.na(metro\$DENSITY) | is.na(metro\$PCPREWAR)),]
```

To get the Moran's I statistic for this variable, we must first answer the ancient question, Who is my neighbor?

The poly2nb() function in the spdep package takes an object of polygons, and for each polygon creates an nb list of the polygons that are adjacent to that polygon.

The nb2listw() function then takes that nb list and creates a weighted list. By default all neighbors are treated equally, but nb2listw() provides the ability to treat some regions as more important than others, such as when dealing with areas of uniquely important economic activity.

The zero.policy=T parameter indicates that locations with no neighbors (such as islands) will be passed rather than throwing the Empty neighbour sets found error.

```library(spdep)
neighbors = poly2nb(metro)
weighted_neighbors = nb2listw(neighbors, zero.policy=T)
```

That neigbor list can then be passed to the moran.test() function along with the variable being analyzed to get the Moran's I statistic. As indicated above, positive statistic value indicate autocorrelation:

```> moran.test(metro\$MEDHHINC, weighted_neighbors, zero.policy=T)

Moran I test under randomisation

data:  metro\$MEDHHINC
weights: weighted_neighbors

Moran I statistic standard deviate = 78.728, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance
0.6848731034     -0.0002179124      0.0000757239
```

The moran.plot() function can be used to visualize and diagnose the autocorrelation. The x axis is the values and the y axis are the spatially-lagged (adjacent) values. The fairly-clear upward pattern from left to right indicates autocorrelation.

```moran.plot(metro\$MEDHOUSING, weighted_neighbors, zero.policy=T)
``` Moran Plot of Median Monthly Housing Cost Showing Spatial Autocorrelation