Landsat Data in R

This tutorial covers the basic acquisition, import, and visualization of Landsat data in R.

Landsat is the name of a joint NASA / USGS program that provides repetitive, calibrated, satellite-based observations of the earth at a spatial resolution that enables analysis of man's interaction with the environment. (NASA 2021).

The Landsat 1 satellite was launched on 23 July 1972 and subsequent satellites have provided continuous satellite imagery of the Earth. This is arguably one of the most important scientific enterprises of our time, and if you work with remote sensing, you will probably use Landsat data on multiple occasions.

Landsat 7 (launched 1999) and Landsat 8 (launched 2013) are operational as of this writing. Landsat 9 is scheduled for launch in 2021 and will replace Landsat 7 with a satellite that is functionally similar to Landsat 8 (NASA 2020).

Landsat 7 (NASA 1999)

Landsat satellites complete just over 14 orbits a day, covering the entire earth every 16 days. Landsat satellites travel in a near-polar, sun-synchronous orbit so they spend as much time as possible looking at sunlit area.

Landsat Orbit Coverage (NASA 2011)

Acquire the Data

The USGS makes recent and historic Landsat data freely available via the EarthExplorer website.

Cloudy Landsat Scene

Registration

In order to download data, you need to register (for free). Click the Login button and select Create New Account.

Registering On USGS Earth Explorer

Scene Downloads

For the most recent scenes, you should use Landsat 8 data. In this example we download both a summer and a winter scene.

  1. Log in
  2. To get data for a specific location, search with a Feature Name and State and select a location that you want to explore.
  3. Select the appropriate collection and sensor.
    • If you want the most recent scenes, use: Landsat -> Landsat Collection 2 Level-1 -> Landsat 8 OLTI/TIRS C1 Level-1.
    • If you want historical scenes, use: Landsat -> Landsat Collection 2 Level-1 -> Landsat 4 - 5 TM C1 Level-1.
  4. View scene footprints that are in the time frame of interest, that cover your area of interest, and are not covered in clouds. Be aware that for vegetation analysis, summer scenes are better.
  5. Click Download and you will be given multiple options.
    • If you just need visible light imagery, the Full Resulotion Browse (Natural Color) GeoTIFF is the simplest (single file) option.
    • If you are doing some kind of multi-band analysis (such as NDVI analysis), dowload the full (and large) Landsat Collection 2 Level-1 Product Bundle.
  6. Create a new working directory and uncompress the data files into that directory.
    • The Natural Color image files are made available directly as GeoTIFF (.tif) files that require no additional processing.
    • The Product Bundle data is a .tar archive contianing GeoTIFF files for individual bands that can be extracted using a program like 7-zip.
Downloading Landsat 8 data from EarthExplorer

Band Files

Individual Landsat bands for a scene are placed in separate GeoTIFF files that are distinguished by the ends of their names. For example, for Landsat 8 level 1 data (LC08_L1TP), path 23 and row 32 (023032) on 6 June 2021 (20210622), the following files are data files for the 11 different bands

Additional GeoTIFF files provide more-detailed information about the data and data processing.

Memory and Bandwidth Considerations

Note that rasters are often large files that take a long time to load and consume considerable disk space, memory, and connection bandwith.

If your machine has inadequate memory to keep all the data in memory, your machine may begin thrashing data back and forth between RAM and your hard disk (virtual memory). This will result in very slow performance from all apps running on your machine, and you will likely need to kill R to regain control of your computer.

If you have inadequate memory, you will need to find a machine with more RAM (such as a lab machine on UIUC AnyWare), you will need to add memory to your machine, or you may need to acquire a more powerful machine.

An unrelated source of performance degradation can occur if you are working on a lab machine that stores data on a computer network. Having to constantly transfer your data across a network can consume a considerable amount of unnecessary time. This can be especially serious if you are connected to the internet through a poor Wi-Fi connection rather than a fast wired connection.

If you are storing your data on a network drive, you should consider downloading and performing your work on your local hard drive, and then saving your work at completion to the network drive.

Landsat 8 RGB Visualization

The raster Package

Raster data in R is commonly handled with the raster package, which you can install with install.packages() before you use it the first time.

install.packages("raster")

The raster package provides three different raster structures that you choose from depending on how many bands you need for a scene, and how many files they come from:

Stacks

To create a RasterStack for a visible light RGB scene:

Because the Landsat sensors have a wide dynamic range to be able to handle a variety of different terrains, the values for a scene may be in the low range of potential values, which will result in a very dark visualization.

library(raster)

blue = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B2.TIF")
green = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B3.TIF")
red = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B4.TIF")

rgb = stack(red, green, blue)

plotRGB(rgb, scale=65535)
Dark, unprocessed Landsat RGB scene of Champaign-Urbana, IL

Cropping

Landsat scenes cover a land area of over 200 square kilometers, which, along with the dark rendering, can make it difficult to distinguish what locations are being displayed.

If your focus is on a fairly small area within the scene, cropping the scene will both make it easier to see where you are, and will reduce the memory needed to process the scene further.

boundary = raster(ymx=40.17, xmn=-88.38, ymn=40.03, xmx=-88.10)

boundary = projectExtent(boundary, rgb@crs)

champaign = crop(rgb, boundary)

plotRGB(champaign)
Cropped unprocessed Landsat RGB scene

Stretch

In some cases, the stretch option on plotRGB() can improve the rendering of the scene. lin stretches the values evenly from the minimum to the minimum, and hist distributes them more evenly across the range.

In both cases with this particular scene, the results make features more visible, but the visual contrast is harsh and, arguably, unappealing.

plotRGB(champaign, stretch="lin")
Landsat RGB scene with linear stretching
plotRGB(champaign, stretch="hist")
Landsat RGB scene with histogram stretching

Custom Stretch

Using hist() to show a histogram of the distributions of values across the different bands shows the heavy skewing of the distributions so that most values are in the (dark) low end of the distribution in all bands. This explains why the default plotRGB() is so dark.

This function reads all values from the raster files and may take a few seconds to complete.

hist(rgb)
Landsat RGB scene histogram of values in the three bands

Given that knowledge of the distribution, you can use the zlim and scale parameters to more clearly define range of values that will be displayed. The values that look best will depend on the content of your specific scene.

plotRGB(champaign - 5000, scale=12000, zlim=c(0, 12000))
Landsat RGB scene with customized stretching

Levels

The custom stretch performs the same operation on all values, which can imbalance the colors and gives the scene a slight tinting and or haze. The channel rasters can be individually adjusted for leveling and gamma shift.

blue = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B2.TIF")
green = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B3.TIF")
red = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B4.TIF")

blue = crop(blue, boundary)
green = crop(green, boundary)
red = crop(red, boundary)

blue = ((blue - 7000) / 10000)
green = ((green - 7000) / 10000)
red = ((red - 7000) / 10000)

blue[blue < 0] = 0
green[green < 0] = 0
red[red < 0] = 0

blue = blue^0.95
green = green^0.5
red = red^0.4

rgb = stack(red, green, blue)
plotRGB(rgb, scale=1, zlim=c(0, 1))
Landsat RGB scene with customized levels

Landsat 5 RGB Visualization

Because Landsat data has been collected for almost half a century, it can be used to detect broad land use changes. The process of visualizing Landsat 5 RGB data is largely the same as with Landsat 8, although the differences in band numbering and bit depth necessitate different parameters.

This Landsat 5 scene is from 28 August 1984. Note that the Landsat 5 Thematic Mapper band numbering is different from the newer Landsat 8 band numbering.

library(raster)

blue = raster("LT05_L1TP_022032_19840828_20200918_02_T1_B1.TIF")
green = raster("LT05_L1TP_022032_19840828_20200918_02_T1_B2.TIF")
red = raster("LT05_L1TP_022032_19840828_20200918_02_T1_B3.TIF")

rgb = stack(red, green, blue)

boundary = raster(ymx=40.17, xmn=-88.38, ymn=40.03, xmx=-88.10)

boundary = projectExtent(boundary, rgb@crs)

champaign = crop(rgb, boundary)

plotRGB(champaign, stretch="lin")
Landsat 5 (1984) RGB scene with stretched levels

Levels

As with the Landsat 8 scene, linear stretching creates harsh contrast that can be scaled and stretched to give a more pleasing visualization.

blue = raster("LT05_L1TP_022032_19840828_20200918_02_T1_B1.TIF")
green = raster("LT05_L1TP_022032_19840828_20200918_02_T1_B2.TIF")
red = raster("LT05_L1TP_022032_19840828_20200918_02_T1_B3.TIF")

boundary = raster(ymx=40.17, xmn=-88.38, ymn=40.03, xmx=-88.10)
boundary = projectExtent(boundary, red@crs)

blue = crop(blue, boundary)
green = crop(green, boundary)
red = crop(red, boundary)

blue = ((blue - 30) / 120)
green = ((green - 30) / 70)
red = ((red - 30) / 70)

blue[blue < 0] = 0
green[green < 0] = 0
red[red < 0] = 0

blue = blue^1.7
green = green^0.3
red = red^0.2

rgb = stack(red, green, blue)
plotRGB(rgb, scale=1, zlim=c(0, 1))
Landsat 5 (1984) RGB scene with custom levels

Natural Color Images

USGS also publishes preprocessed three-band composite RGB files that can be downloaded directly without having to combined band files or perform additional correction. These are the Full-Resolution Browse (Natural Color) GeoTIFF option on the download dialog in EarthExplorer. These are actually false-color images constructed with the infrared and red bands, but in they often render well in RGB and require much less work to use than stacks constructed from band files.

Natural color download

Because the natural color scenes are in single composite files, use the brick() function to load the image into memory and the plotRGB() function to display the brick as a single image. Optionally, you can also crop the image as described above.

rgb = brick("LC08_L1TP_023031_20200822_20200905_02_T1_refl.tif")

boundary = raster(ymx=41.922, xmn=-87.72, ymn=41.838, xmx=-87.552)

boundary = projectExtent(boundary, rgb@crs)

rgb = crop(rgb, boundary)

plotRGB(rgb)
Landsat 8 natural color image of downtown Chicago, IL on 22 August 2020

Other Band Combinations

Although RGB is the most obvious band combination for Landsat data, the comparatively low resolution (30 meters) severely limits visible light image quality compared to other contemporary high-resolution sources, especially when focusing on localities.

However, data from the eight other infrared and ultraviolet bands can be combined with the visible light bands in a variety of ways to create useful visualizations for biogeography and agriculture.

Landsat 8 bands 1 - 9

Color Infrared

This visualization colors near-infrared (band 5) as red, red (band 4) as green, and green (band 3) as blue.

Healthy photosynthetic plants reflect infrared light, so helthy vegetation is red, bare or urbanized areas are white, and dark areas are water.

green = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B3.TIF")
red = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B4.TIF")
near.infrared = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B5.TIF")

rgb = stack(near.infrared, red, green)

boundary = raster(ymx=40.17, xmn=-88.38, ymn=40.03, xmx=-88.10)
boundary = projectExtent(boundary, rgb@crs)
rgb = crop(rgb, boundary)

plotRGB(rgb, scale=65535, stretch="lin")
Color infrared visualization of Champaign-Urbana, IL on 22 June 2021

Short-Wave Infrared

This alternative vegetation visualization colors short-wave infrared 2 (band 7) as red, short-wave infrared 1 (band 6) as green, and red(band 4) as blue.

Density of vegetation is indicated by the darkness of the shades of green. Urban areas are blue and soil is brown.

red = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B4.TIF")
sw.infrared.1 = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B6.TIF")
sw.infrared.2 = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B7.TIF")

rgb = stack(sw.infrared.2, sw.infrared.1, red)

boundary = raster(ymx=40.17, xmn=-88.38, ymn=40.03, xmx=-88.10)
boundary = projectExtent(boundary, rgb@crs)
rgb = crop(rgb, boundary)

plotRGB(rgb, scale=65535, stretch="lin")
Short-wave infrared visualization of Champaign-Urbana, IL on 22 June 2021

NDVI

The Landsat multispectral data is of considerable value for the remote sensing of the environment.

One commonly used technique is the calculation of Normalized Difference Vegetation Index (NDVI), which can be used to detect the presence of and changes in photosynthetic vegetation.

NDVI = (infrared - red) / (infrared + red)

NDVI is based on the observation that green plants reflect infrared light (to avoid overheating), but absorb visible light (to power photosynthesis). This gives green plants a light reflection signature that can distinguish them from objects that absorb all spectra of light or different spectra of light from plants.

While you could use the level of green light to detect green (photosynthetic) plants, the level of green light depends on the intensity and angle of the sun relative to the terrain. The use of this combination of red and near-infrared gives measurements that are normalized to be the consistent across time and space regardless of terrain or level of lighting.

How NDVI works

Creating an NDVI Raster

Rasters with the same extents and pixel dimensions can be used in mathematical operations like vectors or matrices. This is called raster algebra.

In this example:

red = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B4.TIF")
near.infrared = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B5.TIF")

boundary = raster(ymx=40.17, xmn=-88.38, ymn=40.03, xmx=-88.10)
boundary = projectExtent(boundary, red@crs)

red = crop(red, boundary)
near.infrared = crop(near.infrared, boundary)

ndvi = (near.infrared - red) / (near.infrared + red)

colors = colorRampPalette(c("red3", "white", "darkcyan"))(255)

plot(ndvi, zlim=c(0, 0.6), col=colors)
Landsat 8 NDVI in Champaign-Urbana, IL on 22 June 2021

The range of NDVI is -1 to +1, although NDVI values will generally be in a small area in the middle of the full potential range of -1 to +1. You can use this information to adjust the zlim parameter to distribute the color values and make them legible.

hist(ndvi)
Landsat NDVI histogram

Blending with a Base Map For Context

Since the objective of NDVI analysis is usually to know what is going on at a specific location or set of locations, it can be helpful to plot NDVI with a base map that gives geographic context to the data:

library(OpenStreetMap)

upperLeft = c(40.1171, -88.26)
lowerRight = c(40.0817, -88.19)
basemap = openmap(upperLeft, lowerRight, zoom=13)
basemap = raster(basemap)

red = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B4.TIF")
near.infrared = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B5.TIF")

red = projectRaster(red, basemap)
near.infrared = projectRaster(near.infrared, basemap)

ndvi = (near.infrared - red) / (near.infrared + red)

colors = colorRampPalette(c("red3", "white", "darkcyan"))(255)

ndvi = RGB(ndvi, col=colors)

blended = (basemap / 255.0) * (ndvi / 255.0)

plotRGB(blended, scale=1, interpolate=1)
Landsat NDVI blended with a base map for context

Analyzing NDVI Change

The same type of raster algebra used to calculate NDVI can be used to calculate changes in NDVI between scenes.

When looking for a scene for for comparison a different date, it is important that you use an image from the same path and row. Although Landsat scenes overlap, using the same path and row will maximize the amount of common area and, therefore the amount of area covered in the analysis.

This example displays changes in NDVI in Champaign-Urbana, IL between the 1984 Landsat 5 scene and the 2021 Landsat 8 scene.

While the urbanized areas seem largely unchanged, the surrounding area has noticeably lower NDVI, which reflects both the construction of residential developments on the edge of the city (sprawl), and may reflect decreased agricultural intensity in the areas around Champaign-Urbana, which is noticeable by comparing the visible light visualizations above. However, this may also reflect differences in the sensor hardware.

red = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B4.TIF")
near.infrared = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B5.TIF")

boundary = raster(ymx=40.17, xmn=-88.38, ymn=40.03, xmx=-88.10)
boundary = projectExtent(boundary, red@crs)

red = crop(red, boundary)
near.infrared = crop(near.infrared, boundary)

ndvi.2021 = (near.infrared - red) / (near.infrared + red)

red = raster("LT05_L1TP_022032_19840828_20200918_02_T1_B3.TIF")
near.infrared = raster("LT05_L1TP_022032_19840828_20200918_02_T1_B4.TIF")

red = projectRaster(red, ndvi.2021)
near.infrared = projectRaster(near.infrared, ndvi.2021)

ndvi.1984 = (near.infrared - red) / (near.infrared + red)

ndvi.difference = ndvi.2021 - ndvi.1984

colors = colorRampPalette(c("red3", "white", "darkcyan"))(255)

plot(ndvi.difference, col=colors, zlim=c(-1, 1))
NDVI change between 1984 and 2021 in Champaign-Urbana, IL

Interactive Display with Leaflet

When exploring data, it can be helpful to have an interactive visualization that you can zoom and pan. You can either export your data to a web service like ArcGIS Online or use an library like Leaflet for R that provides a similar level of interactivity on your desktop.

If you haven't already, you can install the Leaflet for R library with install.packages().

install.packages("leaflet")

The following script will create an interactive map using the NDVI change raster and open it in a new browser window. Note that unless you publish this map to a web server, it will only be available to you on your personal machine, and cannot be embedded into a notebook.

library(leaflet)

map = leaflet()

map = addTiles(map)

palette = colorBin(c("red3", "white", "darkcyan"), values(ndvi.difference))

map = addRasterImage(map, ndvi.difference, palette, opacity=0.7)

map = addLegend(map, pal=palette, values=values(ndvi.difference), title="NDVI")

map
Negative NDVI change in a new residential development in southeast Urbana (40.093, -88.177)

NDVI vs. Demographics

Vegetation (especially trees) is often associated with residential areas that are both more attractive and more healthful places to live (Visram 2021). Tree cover reduces summer street temperatures and promotes the presence of birds (Rubiano 2020).

There is a commonly cited relationship between neighborhood demographics and levels of vegetation that has been referred to as a luxury effect (Leong, Dunn, and Trautwein 2018) or tree gap (Visram 2021). Residential tree cover is considered a matter of environmental justice (Leahy and Serkez 2021), and American Forests' Tree Equity Score tracks the tree cover as critical infrastructure.

While NDVI does not clearly distinguish trees from grasses, levels of different types of vegetation tend to correlate in space, so NDVI can be used to assess the relationship between neighborhood greenness and demographics.

Load County Census Tracts

For this example, we will use demographic data from the US Census Bureau's 2015-2019 American Community Survey five-year estimates. A GeoJSON file of that data for census tracts in Illinois is available here.

library(sf)

tracts = st_read("2015-2019-acs-il_tracts.geojson", stringsAsFactors=F, quiet=T)

tracts = tracts[grepl("^17019", tracts$FIPS),]

plot(tracts["Median.Household.Income"], breaks="jenks",
	pal=colorRampPalette(c("red3", "white", "darkcyan")))
Median household income by census tract in Champaign County, IL 2015-2019 ACS five-year estimates

To filter other counties in Illinois, below are the FIPS codes for the top 14 Illinois counties by population.

Calculate County NDVI

NDVI is calculated using the same technique described above. For this example, we crop the rasters to the area occupied by all Champaign County tracts.

library(raster)

red = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B4.TIF")
near.infrared = raster("LC08_L1TP_023032_20210622_20210629_02_T1_B5.TIF")

boundary = raster(as_Spatial(tracts))
boundary = projectExtent(boundary, red@crs)

red = crop(red, boundary)
near.infrared = crop(near.infrared, boundary)

ndvi = (near.infrared - red) / (near.infrared + red)

tracts = st_transform(tracts, st_crs(red))

colors = colorRampPalette(c("red3", "white", "darkcyan"))(255)

plot(ndvi, zlim=c(0, 0.6), col=colors, interpolate=T, axes=F, box=F)

plot(st_geometry(tracts), add=T)
NDVI in Champaign County, IL on 22 June 2021

Extract Mean NDVI for Each Tract

When used with a set of polygons like the tract polygons, the extract() function from the raster package finds all raster values covered by each polygon and combines them using the function specified with the fun parameter. For this example we will take the mean of all NDVI values within each tract.

tracts$NDVI = extract(ndvi, as_Spatial(tracts), fun=mean)

plot(tracts["NDVI"], breaks="jenks", 
	pal=colorRampPalette(c("red3", "white", "darkcyan")))
Mean NDVI by census tract in Champaign County, IL on 22 June 2021

Save Tract Data

In order to preserve your processed data for future analysis, you may want to save it to a new GeoJSON file.

st_write(tracts, "2021-minn-champaign-ndvi.geojson", delete_dsn=T)

Correlate NDVI and Median Household Income

Plotting income against NDVI, we see a fairly strong luxury effect in Champaign County, IL.

plot(tracts$NDVI, tracts$Median.Household.Income)

model = lm(Median.Household.Income ~ NDVI, weights=Total.Population, data=tracts)

abline(model, lwd=3, col="navy")

summary(model)

	Weighted Residuals:
	     Min       1Q   Median       3Q      Max 
	-3131216  -956475  -120820   905910  3644378 

	Coefficients:
		    Estimate Std. Error t value Pr(>|t|)    
	(Intercept)   -31924      14138  -2.258   0.0293 *  
	NDVI          257878      40223   6.411 1.13e-07 ***
	---
	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	Residual standard error: 1427000 on 41 degrees of freedom
	Multiple R-squared:  0.5006,    Adjusted R-squared:  0.4885 
	F-statistic:  41.1 on 1 and 41 DF,  p-value: 1.125e-07
Mean NDVI by census tract vs. median household income in Champaign County, IL on 22 June 2021

Visualize Tracts

To get geographic context for the low NDVI tract, we order() the tracts by mean NDVI, take the first feature, create an OpenStreetMap base map for the area covered by the bounding box, and overlay the outline of the tract.

This low NDVI tract is the barren student ghetto (Campustown) to the west of the University of Illinois campus.

tracts = tracts[order(tracts$NDVI),]

low.tract = tracts[1,]

library(OpenStreetMap)

bbox = st_bbox(st_buffer(st_transform(low.tract, 4326), 0.001))
upperLeft = bbox[c("ymax", "xmin")]
lowerRight = bbox[c("ymin", "xmax")]

basemap = openmap(upperLeft, lowerRight)

plot(basemap)

plot(st_geometry(st_transform(low.tract, 3857)), lwd=3, add=T)
The Campustown low NDVI tract 4.01

The high NDVI tract can be visualized by reversing the sort. In this case, we use an aerial imagery from Bing as the basemap.

As might be expected, this is a sparsely-populated rural agricultural area in the southeast corner of the county.

tracts = tracts[order(tracts$NDVI, decreasing=T),]

high.tract = tracts[1,]

library(OpenStreetMap)

bbox = st_bbox(st_buffer(st_transform(high.tract, 4326), 0.001))
upperLeft = bbox[c("ymax", "xmin")]
lowerRight = bbox[c("ymin", "xmax")]

basemap = openmap(upperLeft, lowerRight, type="bing")

plot(basemap)

plot(st_geometry(st_transform(high.tract, 3857)), lwd=3, border="red", add=T)
High NDVI tract 108 in the rural southeast corner of Champaign County

Multiple Regression

A question we might ask would be, what mechanism ties income and neighborhood vegetation together?

To examine these questions, we can create a multiple regression model which considers additional variables that we hypothesize may be significant.

The results show that when population density (Pop.per.Square.Mile) is considered, it becomes significant, while median household income falls to the wayside. The percentage of homeowners is also presumably related since home ownership in this part of the state usually means a single family home on a lot with a lawn and trees, compared to renters (especially students) who live in dense, multi-family housing units with less surrounding vegetation.

This hints that the correlation between income and NDVI (at least in this part of the state) is indirect. The more money you have, the more space you can afford to occupy, and the more space you occupy, the more you will be surrounded by vegetation. However, this relationship is confounded in a largely low density area like Champaign County, poor people can also afford homes, which makes the income/vegetation relationship less clear than it might be in a more densely populated area.

model = lm(NDVI ~ Median.Household.Income + Pop.per.Square.Mile + 
	Median.Age + Percent.Pre.War.Units + Percent.Homeowners +
	Percent.Bachelors.Degree, weights=Total.Population, data=tracts)

summary(model)

	Coefficients:
				   Estimate Std. Error t value Pr(>|t|)    
	(Intercept)               2.678e-01  4.127e-02   6.488 1.55e-07 ***
	Median.Household.Income  -8.987e-07  6.107e-07  -1.472 0.149795    
	Pop.per.Square.Mile      -4.081e-06  8.961e-07  -4.555 5.80e-05 ***
	Median.Age               -3.266e-04  1.413e-03  -0.231 0.818572    
	Percent.Pre.War.Units    -1.750e-04  4.854e-04  -0.360 0.720592    
	Percent.Homeowners        2.324e-03  6.482e-04   3.585 0.000992 ***
	Percent.Bachelors.Degree  1.775e-03  1.147e-03   1.548 0.130356    
	
---
	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	Residual standard error: 2.767 on 36 degrees of freedom
	Multiple R-squared:  0.7811,    Adjusted R-squared:  0.7446 
	F-statistic: 21.41 on 6 and 36 DF,  p-value: 1.592e-10

Multicollinearity

An issue with the aforementioned interpretation of the model results is that multiple regression assumes the independent variables are, indeed, independent and not correlated. The presence of correlation in explanatory variables is called multicollinearity.

While multicollinearity will not affect model fit, it can make model parameters ambiguous and cause coefficients to be marked as statistically insignificant even though their presence improves model fit (Ghosh 2017).

The imcdiag() function from the mctest library can be used to perform multiple tests on the model to identify multicollinearity.

library(mctest)

model = lm(NDVI ~ Median.Household.Income + Pop.per.Square.Mile +
        Median.Age + Percent.Pre.War.Units + Percent.Homeowners +
        Percent.Bachelors.Degree, weights=Total.Population, data=tracts)

imcdiag(model)

	All Individual Multicollinearity Diagnostics Result

				    VIF    TOL      Wi      Fi Leamer    CVIF Klein
	Median.Household.Income  7.7600 0.1289 50.0239 64.2198 0.3590 -1.6275     1
	Pop.per.Square.Mile      2.1322 0.4690  8.3781 10.7557 0.6848 -0.4472     0
	Median.Age               3.9510 0.2531 21.8375 28.0347 0.5031 -0.8286     0
	Percent.Pre.War.Units    1.1029 0.9067  0.7618  0.9779 0.9522 -0.2313     0
	Percent.Homeowners       8.8169 0.1134 57.8449 74.2603 0.3368 -1.8491     1
	Percent.Bachelors.Degree 1.9967 0.5008  7.3758  9.4689 0.7077 -0.4188     0
				   IND1   IND2
	Median.Household.Income  0.0174 1.4406
	Pop.per.Square.Mile      0.0634 0.8781
	Median.Age               0.0342 1.2352
	Percent.Pre.War.Units    0.1225 0.1543
	Percent.Homeowners       0.0153 1.4662
	Percent.Bachelors.Degree 0.0677 0.8255

	1 --> COLLINEARITY is detected by the test 
	0 --> COLLINEARITY is not detected by the test

	Median.Household.Income , Median.Age , Percent.Pre.War.Units ,
	Percent.Bachelors.Degree , coefficient(s) are non-significant may be due to
	multicollinearity

If you do not know how to interpret the different test statistics, the easiest approach is to look for a one in the Klein's rule column (Imdadullah, Aslam, and Altaf 2016; Klein 1962).

In this case Median.Household.Income and Percent.Homeowners are flagged as collinear. While more complex techniques like factor analysis can be used to mitigate the multicollinearity, the simplest approach is to remove the flagged variables.

model = lm(NDVI ~ Pop.per.Square.Mile + Median.Age + Percent.Pre.War.Units + 
	Percent.Bachelors.Degree, weights=Total.Population, data=tracts)

imcdiag(model)

	All Individual Multicollinearity Diagnostics Result

				    VIF    TOL     Wi      Fi Leamer     CVIF Klein
	Pop.per.Square.Mile      1.5860 0.6305 7.6174 11.7191 0.7941 -10.7487     0
	Median.Age               1.3766 0.7264 4.8957  7.5319 0.8523  -9.3298     0
	Percent.Pre.War.Units    1.0204 0.9800 0.2648  0.4074 0.9900  -6.9155     0
	Percent.Bachelors.Degree 1.1878 0.8419 2.4409  3.7552 0.9176  -8.0500     0
				   IND1   IND2
	Pop.per.Square.Mile      0.0485 1.7999
	Median.Age               0.0559 1.3327
	Percent.Pre.War.Units    0.0754 0.0972
	Percent.Bachelors.Degree 0.0648 0.7701

	1 --> COLLINEARITY is detected by the test 
	0 --> COLLINEARITY is not detected by the test

	Percent.Pre.War.Units , Percent.Bachelors.Degree , coefficient(s) are
	non-significant may be due to multicollinearity

	R-square of y on all x: 0.6755 

summary(model)

	Weighted Residuals:
	    Min      1Q  Median      3Q     Max 
	-5.8128 -2.4011  0.2137  2.3086  5.3079 

	Coefficients:
				   Estimate Std. Error t value Pr(>|t|)    
	(Intercept)               2.292e-01  4.401e-02   5.207 6.93e-06 ***
	Pop.per.Square.Mile      -4.680e-06  8.759e-07  -5.343 4.53e-06 ***
	Median.Age                3.382e-03  1.012e-03   3.342  0.00188 ** 
	Percent.Pre.War.Units     1.931e-04  5.380e-04   0.359  0.72159    
	Percent.Bachelors.Degree  9.499e-04  1.036e-03   0.917  0.36511    
	---
	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	Residual standard error: 3.216 on 38 degrees of freedom
	Multiple R-squared:  0.6879,    Adjusted R-squared:  0.655 
	F-statistic: 20.94 on 4 and 38 DF,  p-value: 3.468e-09

Removing the statistically insigificant variables leaves only population density and median age in the model, which still has an R2 indicating a fairly good fit.

model = lm(NDVI ~ Pop.per.Square.Mile + Median.Age, 
	weights=Total.Population, data=tracts)

imcdiag(model)

	All Individual Multicollinearity Diagnostics Result

			       VIF    TOL      Wi  Fi Leamer      CVIF Klein   IND1
	Pop.per.Square.Mile 1.3646 0.7328 14.9482 Inf  0.856 -137.1094     0 0.0179
	Median.Age          1.3646 0.7328 14.9482 Inf  0.856 -137.1094     0 0.0179
			    IND2
	Pop.per.Square.Mile    1
	Median.Age             1

	1 --> COLLINEARITY is detected by the test 
	0 --> COLLINEARITY is not detected by the test

	* all coefficients have significant t-ratios

	R-square of y on all x: 0.6686 

Spatial Lag Regression

A second issue is that multiple regression assumes that variables are not autocorrelated with themselves. Because all of these variables are likely spatially autocorrelated (clustered in space), running the same variables in an spatial lag model can consider whether spatial autocorrelation affects the relative importance of the relationships.

In the model output, Rho is the parameter for autoregression. The p-values for the Liklihood Ratio (LR) test and Asymptotic standard error t-test indicate the significance of this parameter, with values under 0.05 indicating statistical significance. The low p-values indicate that autocorrelation is significant consideration with this data. (Anselin 2003).

However, the most notable change is that median age no longer significant in the spatial lag model. This leaves only population density as a significant predictor of average NDVI in a census tract.

library(spdep)

neighbors = poly2nb(as_Spatial(tracts))

weights = nb2listw(neighbors, zero.policy=T)

model = lagsarlm(NDVI ~ Pop.per.Square.Mile + Median.Age,
	data=tracts, listw=weights, tol.solve=1.0e-30, zero.policy=T)

summary(model)

	Coefficients: (asymptotic standard errors) 
			       Estimate  Std. Error z value  Pr(>|z|)
	(Intercept)          8.1738e-02  4.2607e-02  1.9184 0.0550577
	Pop.per.Square.Mile -2.6416e-06  7.0301e-07 -3.7576 0.0001716
	Median.Age           1.3295e-03  7.4975e-04  1.7732 0.0761953

	Rho: 0.66608, LR test value: 16.651, p-value: 4.4935e-05
	Asymptotic standard error: 0.1169
	    z-value: 5.698, p-value: 1.212e-08
	Wald statistic: 32.468, p-value: 1.212e-08

	Log likelihood: 80.25075 for lag model
	ML residual variance (sigma squared): 0.0012555, (sigma: 0.035433)
	Number of observations: 43 
	Number of parameters estimated: 5 
	AIC: -150.5, (AIC for lm: -135.85)
	LM test for residual autocorrelation
	test value: 0.15444, p-value: 0.69432

Principal Component Analysis

In dealing with multicollinearity above, our approach was to simply remove correlated variables from our model. However, in eliminating variables, we may have also removed important information. We also cannot know whether variables we never chose not to include could have been significant.

One technique for addressing those issues is principal component analysis (PCA), which uses linear algebra to reduce a matrix of all the possible variables to a smaller set of principal component variables. When used as part of factor analysis, PCA is used to find relationships between observed, correlated variables that point to unobserved variables (called factors or latent variables) that explain the broader relationships (Majumdar 2018).

This reduction in the number of variables (dimensions) needed for analysis is called dimensionality reduction. Because this technique makes a determination of the importance of different variables without direct human intervention or judgement, PCA can also be considered a form of unsupervised machine learning.

For this example, we use a set of census tract demographic variables that we hypothesize may affect NDVI, but which we have observed above as being correlated. These variables are passed as the right side of a formula to the prcomp() function.

The print() of the output of the prcomp() function lists the calculated principal components along the top (PC1 - PC6) along with the percent of each variable that contributes to that component. These can be interpreted to infer patterns of relationships that hint at latent variables.

pca = prcomp(~ Median.Household.Income + Pop.per.Square.Mile + 
	Median.Age + Percent.Pre.War.Units + Percent.Homeowners +
	Percent.Bachelors.Degree, data=tracts, scale=T)

print(pca)

	Standard deviations (1, .., p=6):
	[1] 1.7817884 1.1120335 0.9529559 0.6684303 0.4106193 0.2551065

	Rotation (n x k) = (6 x 6):
					 PC1         PC2        PC3         PC4
	Median.Household.Income   0.50453008 -0.29820975 0.11966764 -0.09439929
	Pop.per.Square.Mile      -0.42854455 -0.27840409 0.16474412  0.79941428
	Median.Age                0.50489392 -0.12534918 0.03310023  0.42237713
	Percent.Pre.War.Units     0.03034971  0.53816646 0.83838676  0.02159398
	Percent.Homeowners        0.53538904 -0.04247929 0.07006712  0.24128032
	Percent.Bachelors.Degree -0.13899124 -0.72554949 0.49964152 -0.33902899
					  PC5         PC6
	Median.Household.Income  -0.478345209  0.63598320
	Pop.per.Square.Mile      -0.244642283  0.11307993
	Median.Age                0.727370616  0.14423463
	Percent.Pre.War.Units     0.005540119  0.07788716
	Percent.Homeowners       -0.381645934 -0.70906620
	Percent.Bachelors.Degree  0.191268633 -0.23042000

We can test the statistical significance of these principal components as predictors of NDVI by creating a multiple regression model using the lm() function.

The listing of coefficients indicates that the wealth component is the strongest, followed by high income renters, and population density.

	pca.data = as.data.frame(predict(pca, st_drop_geometry(tracts)))

	pca.data$NDVI = tracts$NDVI

	model = lm(NDVI ~ ., data=pca.data)

	summary(model)

	Residuals:
	      Min        1Q    Median        3Q       Max 
	-0.074877 -0.026456 -0.002282  0.024347  0.107067 

	Coefficients:
		      Estimate Std. Error t value Pr(>|t|)    
	(Intercept)  0.3385591  0.0063652  53.189  < 2e-16 ***
	PC1          0.0363484  0.0036147  10.056 5.35e-12 ***
	PC2          0.0049857  0.0057917   0.861  0.39502    
	PC3         -0.0002518  0.0067585  -0.037  0.97048    
	PC4         -0.0241218  0.0096353  -2.503  0.01697 *  
	PC5          0.0014034  0.0156850   0.089  0.92920    
	PC6         -0.0781579  0.0252465  -3.096  0.00379 ** 
	---
	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	Residual standard error: 0.04174 on 36 degrees of freedom
	Multiple R-squared:  0.7658,    Adjusted R-squared:  0.7268 
	F-statistic: 19.62 on 6 and 36 DF,  p-value: 5.166e-10

Since the principal components still represent spatial data, we should try spatial lag regression to verify that those variables are still significant when compensating for spatial autocorrelation.

Although the p-values for the Liklihood Ratio (LR) test and Asymptotic standard error t-test indicate autocorrelation is significant (both below 0.05), the p-values on the components largely match the significance observed above.

library(spdep)

neighbors = poly2nb(as_Spatial(tracts))

weights = nb2listw(neighbors, zero.policy=T)

model = lagsarlm(NDVI ~ ., data=pca.data, listw=weights, tol.solve=1.0e-30, zero.policy=T)

summary(model)

	Type: lag 
	Coefficients: (asymptotic standard errors) 
		      Estimate Std. Error z value  Pr(<|z|)
	(Intercept)  0.1431644  0.0441071  3.2458  0.001171
	PC1          0.0207942  0.0039552  5.2575 1.461e-07
	PC2          0.0055409  0.0044463  1.2462  0.212693
	PC3          0.0026783  0.0052533  0.5098  0.610175
	PC4         -0.0172517  0.0074813 -2.3060  0.021111
	PC5          0.0052327  0.0120898  0.4328  0.665145
	PC6         -0.0593586  0.0193892 -3.0614  0.002203

	Rho: 0.57636, LR test value: 11.936, p-value: 0.00055063
	Asymptotic standard error: 0.12802
	    z-value: 4.5021, p-value: 6.7284e-06
	Wald statistic: 20.269, p-value: 6.7284e-06

	Log likelihood: 85.35492 for lag model
	ML residual variance (sigma squared): 0.001023, (sigma: 0.031984)
	Number of observations: 43 
	Number of parameters estimated: 9 
	AIC: -152.71, (AIC for lm: -142.77)
	LM test for residual autocorrelation
	test value: 10.882, p-value: 0.00097091