Regression Analysis in ArcGIS Pro

Regression is "a functional relationship between two or more correlated variables that is often empirically determined from data and is used especially to predict values of one variable when given values of the others" (Merriam-Webster 2022).

A variety of different regression techniques are used in statistical analysis. Regression with geospatial data is used to analyze and model the relationships between different social and/or environmental characteristics across different locations.

There are two broad types of motivation for regression modeling (Sainani 2014):

  1. Explanatory modeling seeks to find meaningful relationships that can improve understanding of an analyzed phenomenon and inform decision making.
  2. Predictive modeling seeks to find a combination of factors that can predict a phenomenon in a different location or at a time in the future.

One way to evaluate the relationship of multiple factors with an effect is the use of multiple regression, which creates a mathematical model that combines multiple independent variables in a simple linear formula to model the values of a dependent variable (Hayes 2022).

Adding additional variables beyond simple bivariate correlation can improve both the explanatory and predictive value of a model.

y = β0 + β1x1 + β2x2 ... + ε

Regression formulas are often written in matrix form to simplify the notation.

y = βX + ε

This tutorial covers regression analysis of geospatial data in ArcGIS Pro using the mmregression tool.

Data Preparation

Model Variables

The first step in performing regression is selecting a dependent variable that represents the phenomenon you are analysing, and independent variables representing the factors that influence that phenomenon (independent variables).

A common challenge with any kind of empirical research is finding available data from a trusted, authoritative source.

The example multiple regression analysis used in this tutorial will investigate the energy and economic indicators associated with levels of democracy.

Deductive research methodology commonly starts with a literature review of past research to inform variable choices.

Accordingly, a set of independent variables that could be hypothesized to be related to levels of democracy could include:

Import the Data

Data from external sources should be imported into the project geodatabase, which is a default file database that is a part of every ArcGIS Pro project.

Storing data in the project geodatabase rather than accessing data from external shapefiles or feature services will improve the speed of your analysis, assure reliable access to needed data, and make the analysis reproducable.

  1. Give the default map a meaningful name (Dependent Variable Map)
  2. Use the Export Features tool to copy the data from the feature service into a new feature class in the project geodatabase so the data is stable, available, and suitable for analysis (Energy).
  3. Symbolize by the dependent variable (Democracy_Index).
Importing the world energy indicators data into the project geodatabase

Examine the Data

You can view the available data and fields in the Attribute Table to find out what is in your data.

Fields View offers data type information for each field.

Viewing the attribute table and field information

Join and Subset the Data

If your data comes from multiple sources, you will need to join the data into a single feature class for analysis with the regression tools. Joins are described in the tutorial Joins in ArcGIS Pro.

If you need to isolate a data for a specific area or characteristic, you may need to subset your data. Subsetting is described in the tutorial Creating Subsets of Data in ArcGIS Pro.

Normalization

When performing regression, you should generally make sure that all your variables are either derived or fundamental variables. In this tutorial example, the dependent variable is an index (derived) so derived variables like percentages or per capita should be used rather than fundamental values like population or energy totals.

Normalization is "the process of taking a count and dividing it by something else in order to make a number more comparable or to put it in context" (Funk 2024). Normalization converts fundamental geospatial variables into derived variables that can be compared across areas and used in regression when the dependent variable is also derived.

For this example, if you want to use a fundamental variable like Oil_Production_Mbd (average petroleum production rate in millions of barrels per day) or Oil_Reserves_B_Barrels (total known commercially viable resource capacity in billions of barrels) you can normalize the value by population to get a per capita value, or by area to get a per square kilometer value. Because this is social analysis, normalizing by Population is probably most appropriate.

The Calculate Field tool can be used calculate a normalized field. This example creates an Oil_Production_per_Capita field by dividing Oil_Production_Mbd by Population and multiplying by 1,000,000 to convert millions of barrels to barrels so the number is more readable.

Creating normalized variables

Multiple Regression

mmregression

This tutorial uses a Python tool called mmregression that conveniently incorporates common spatial regression operations into a single tool. While ArcGIS Pro provides native tools to perform most of these operations, chaining the tools together is cumbersome, and ArcGIS Pro has no tools that can perform spatial regression.

mmregression has seven parameters:

To incorporate mmregression into a project:

  1. Download the mmregression.atbx toolbox file.
  2. Move the toolbox file into your project directory. The exact location on your machine will vary depending on your configuration, but project directories are usually located in your Documents\ArcGIS\Projects folder.
  3. In your ArcGIS Pro Catalog pane, right click on Toolboxes, select Add Toolbox and navigate to the toolbox you copied into your project directory.
  4. Expand the contents of the new toolbox and double clock on the mmregression tool to open it.
  5. You can drag the tool into a ModelBuilder diagram so that you can preserve your tool parameters if you want to re-run or change them in the future.
Adding mmregression to a project

Transformation

One of the assumptions of linear regression is that the variables are normally distributed, and running linear regression on non-normal variables can result in coefficients that are unreliable.

Skew is the direction that the distribution is shifted away from normal. The direction left or right is determined by the side that has the longer tail.

Types of skew

A transformation is a modification to variable values that brings the variable distribution closer to normality.

Figure
Skewed vs. transformed variable

The Box Cox transformation is a general-purpose transformation that can be used with left-skewed, right-skewed, or unskewed data.

Box Cox transformation

You can view a histogram chart for a variable to visually assess skew.

Comparing histograms for transformed and untransformed distributions

Standardization

Standardization involves converting different variables to a common scale so that coefficients for variables measuring different phenomena with different units can then be compared to each other.

One common standardization scale is z-score, which is the number of standard deviations that each individual value differs from the mean for the variable.

If you check the Standardize box, mmregression will standardize all variables to z-scores. If you transform your variables, you will probably also want to standardize them since the ranges of the transform values may bear little resemblance to the unit used for the untransformed variable.

Figure
Z-score standardization

Model Evaluation

After you run the mmregression tool, the messages contain summary information about the created models for evaluation.

The first summary in the output is for a non-spatial ordinary least squares (OLS) model.

Ordinary least squares regression is a multiple regression technique that estimates regression coefficients by minimizing the sum of the squares of the differences between dependent variable values and the results of the regression model (Wikipedia 2023).

Models can be evaluated based on a combination of the following criteria. These criteria are especially relevant when evaluating the candidate models created with exploratory regression.

Model Fit

Model fit is evaluated with adjusted R2.

The adjusted value (Adj. R-squared) is preferred to the unadjusted value (R-squared) because the adjusted value compensates for improvements in mathematical fit that will occur with the addition of more variables, even when the additional variables add no meaningful improvement to explanatory or predictive value of the model.

In the OLS model output below, the adjusted R-squared value of 0.448 indicates a fairly strong fit for this social science model.

Adjusted R-squared

Information Loss

Models are simplifications that result in the loss of information available in the variables.

AIC

Multicollinearity

Multicollinearity is when two or more independent variables in a regression model are correlated with each other in a way that biases the model coefficients and makes them unreliable.

While multicollinearity will not affect model fit, it can make model parameters ambiguous and cause coefficients to be marked as statistically insignificant even though their presence improves model fit (Ghosh 2017).

Variance inflation factor (VIF) is a value that can be used to identify independent variables that are correlated with each other.

In model output below, both GDP per capita and percent agriculture from GDP have high VIF values indicating that they are (negatively) correlated. Countries that are highly dependent on agriculture tend to be poorer since industrial and knowledge-based activities are generally more highly valued financially.

You would probably want to remove the agriculture variable since the primary concern is the effect of affluence on levels of democracy.

Multicollinearity analysis with VIF

Coefficients

Model coefficients indicate how differences in independent variable values are reflected in changes in the modeled dependent variable.

When using transformed z-scores (the default in mmregression), the coefficients indicate the relative importance of each factor to the model.

The probability (P>|t|) values beside each explanatory variable indicate the probability that any effect of this variable on the response could be attributed to randomness.

In the example below:

Figure
OLS results

Residuals

Residuals are the differences between modeled values and actual values.

Residuals = Actual - Modeled

Model misspecification is when the variables provided for the model are inadequate or inappropriate for meaningfully modeling the analyzed phenomena. With regression models, residuals are useful for identifying misspecification.

When working with geospatial data, you can review the map of the residuals to identify areas with high or low residuals that may indicate what variable(s) are missing.

The output feature class created by mmregression contains an olsresiduals field with the residuals from the non-spatial OLS regression model.

In this case, the low residuals in Russia, Eastern Europe, and the Middle East (overprediction of democracy) and the high residuals (underprediction of democracy) in the Americas hint at social characteristics or historical legacies in those areas that is not captured in the specified independent variables.

Mapping residuals

Autocorrelation

Another assumption of linear regression is that the variables are not autocorrelated.

You can detect whether spatial autocorrelation is an issue with your model by using the Spatial Autocorrelation (Global Moran's I) tool to identify spatial autocorrelation in model residuals. This statistic was originally developed by the Australian statistician Patrick Alfred Pierce Moran (1950).

To evaluate autocorrelation using Moran's I in ArcGIS Pro:

The report will be an HTML file placed in the project directory.

In this example, the high Moran's I value (0.22) and very high Z-score (7.86) confirms the autocorrelation visible in the residuals map, and indicates that the model coefficients are unreliable.

Examining residuals autocorrelation

Spatial Regression

Spatial regression models incorporate variables that compensate for spatial autocorrelation so the model coefficients and outputs are more trustworthy.

Neighbors

When addressing autocorrelation in models and model outputs, you specify how to define neighbors to know whether feature variable values are autocorrelated with the values of their neighbors. Spatial lag is the autocorrelation of values across groups of neighboring features.

k = 5 nearest neighbors

Spatial Lag Regression

Spatial lag regression performs multiple regression modeling by adding a model term that considers spatial lag of the dependent variable across neighboring areas (Sparks 2015).

y = βX + ρWy + ε

In this case the AIC value for the spatial lag model (336.5) is lower than the AIC for the non-spatial OLS model (375.1), indicating that the spatial lag model is a better choice than the non-spatial OLS model.

Spatial lag model output

The lagged dependent variable (ρWy) is included in the output feature class as the LAGVAR field, and you can map it to see the smoothed lag effect on the spatial distribution.

Spatially lagged dependent variable

Notably, the spatial lag model significantly reduces autocorrelation in the residuals (lagresiduals).

Spatial lag Moran's I

Spatial Error Regression

In contrast to the focus of spatial lag regression on autocorrelation in the dependent variable, spatial error regression models spatial interactions in the independent variables that is reflected in the residuals (Eilers 2019).

y = βX + λWu + ε

In this case the AIC value for the spatial err model (329.8) is lower than the AIC for the non-spatial OLS model (375.1) or spatial lag model (336.5), indicating that the spatial err model is a better choice than the non-spatial OLS model.

Spatial error model output

The LAGERR field with the lagged residuals is included in the output feature class, and you can map it to see the smoothed lag effect on the spatial distribution of the olsresiduals field.

Spatially lagged residuals

The Moran's I of the residuals indicates an absence of spatial autocorrelation, which makes the spatial error model acceptable.

Spatial error model residuals Moran's I

Exploratory Regression

Two common approaches for choosing from the pool of possible variables that was codified by Hocking (1976) include:

Both forward selection and backward elimination can be tedious to perform manually.

An automated approach to variable selection and elimination is exploratory regression, which involves trying all possible combinations of explanatory variables to find the best models.

Exploratory regression violates the philosophy behind the (deductive) classical scientific method where you begin with a hypothesis and then use your models to test your hypothesis. Finding a set of variables that fit a particular data set but do not model other data sets well (overfitting) leads away from analysis of fundamental processes you are trying to understand.

However, in situations where those fundamental processes are not well understood, inductive analysis with tools like exploratory regression can be useful for giving new insights that inform the development of hypotheses that can then be tested on other data sets (ESRI 2023).

Exploratory regression is implemented in the Exploratory Regression tool.

This tool may take a few minutes to run depending on the size of your data set and the number of independent variables you are exploring.

Find your report file and open it. This is a text file that by default will open in Notepad. The explored models are listed from the top in order of the number of variables explored.

Exploratory regression in ModelBuilder
Figure
Exploratory regression results

Theoretical Logic

With exploratory regression we are seeking independent variables that offer some theoretical logic that explains the phenomenon represented by the dependent variables.

Following from the literature review and the quantitative criteria above, the four-variable model from the exploratory regression that best matches is:

All of these indicators are proxies for post-industrial development, consistent with the literature asserting that democracy and development go hand in hand.

Variable Significance

A table at the bottom of the exploratory regression results lists the percentage of explored models where each of the explored variables were found to be significant. These results can be used when deciding which variables to consider important or unimportant as you explore your data further.

Figure
Exploratory regression variable significance table

Geographically Weighted Regression

A second major challenge with spatial data is spatial heterogeneity (also called regional variation or nonstationarity) where the processes you are analyzing vary in different parts of the study area.

One exploratory technique for addressing this issue is the use of geographically weighted regression which builds multiple small models across the features that consider neighboring values in the regression.

In ArcGIS Pro, the Geographically Weighted Regression (GWR) tool can be used to perform geographically weighted multiple regression.

  1. Input Feature Class: Energy
  2. Dependent Variable: Democracy_Index_Z_Score
  3. Model Type: Continuous (Gaussian)
  4. Explanatory Variables: See above
  5. Output Features: Energy_GWR
  6. Neighborhood Type: Distance band
  7. Neighborhood Selection Method: Golden search
Geographically weighted regression

Viewing the coefficients in the output features shows how coefficients vary over space for the different independent variables.

Appendix: Regression with Native ArcGIS Pro Tools

ArcGIS Pro has native tools for performing standardization, transformation, and OLS regression. However, these tools are cumbersome to chain together and ArcGIS Pro does not currently have tools for performing spatial lag or spatial error regression.

Model Builder

While you can run the tools needed for data preparation and regression independently, it can be helpful to use a ModelBuilder diagram that will save you from repeated typing as you iterate through versions of the model. Use of ModelBuilder also preserves your workflow so that you can debug and reproduce your analysis in the future.

One unfortunate aspect of using ModelBuilder is that some of these tools modify feature classes rather than producing new feature classes which can be simply daisy chained. This will require additional effort to build the diagrams iteratively.

Creating a blank ModelBuilder diagram

Transform Field

You can use the Transform Field tool to transform skewed varibles.

Transformation to approximate normality

Standardize Field

The Standardize Field tool can be used to standardize model variables.

Standardization

Unique ID Field

The OLS regression tool requires that each feature have a unique integer ID number, but ArcGIS Pro for some inexplicable reason hides the OBJECT_ID field available in all feature classes.

To copy the OBJECTID field to a new long integer FEATUREID field, you will need to use the Calculate Field tool.

Adding a FEATUREID field

OLS Regression Tool

  1. Under Analysis and Tools, add the Ordinary Least Squares tool to your model.
  2. Right click on the Output Feature Class and select Add to Display so the tool will add a new layer symbolized by the regression residuals.
  3. In the Catalog Pane, duplicate your dependent variable map and rename it (Residuals Map).
Ordinary least squares regression