Regression Analysis in ArcGIS Pro
Regression is "a functional relationship between two or more correlated variables that is often empirically determined from data and is used especially to predict values of one variable when given values of the others" (Merriam-Webster 2022).
A variety of different regression techniques are used in statistical analysis. Regression with geospatial data is used to analyze and model the relationships between different social and/or environmental characteristics across different locations.
There are two broad types of motivation for regression modeling (Sainani 2014):
- Explanatory modeling seeks to find meaningful relationships that can improve understanding of an analyzed phenomenon and inform decision making.
- Predictive modeling seeks to find a combination of factors that can predict a phenomenon in a different location or at a time in the future.
One way to evaluate the relationship of multiple factors with an effect is the use of multiple regression, which creates a mathematical model that combines multiple independent variables in a simple linear formula to model the values of a dependent variable (Hayes 2022).
Adding additional variables beyond simple bivariate correlation can improve both the explanatory and predictive value of a model.
y = β0 + β1x1 + β2x2 ... + ε
- y is the dependent variable.
- xn are the independent variables.
- βn are the regression coefficients estimated from the data.
- ε is the error term (residuals).
Regression formulas are often written in matrix form to simplify the notation.
y = βX + ε
- y is the dependent variable.
- β is a vector of regression coefficients estimated from the data.
- X is a matrix of independent variables.
- ε is the error term (residuals).
This tutorial covers regression analysis of geospatial data in ArcGIS Pro using the mmregression tool.
Data Preparation
Model Variables
The first step in performing regression is selecting a dependent variable that represents the phenomenon you are analysing, and independent variables representing the factors that influence that phenomenon (independent variables).
A common challenge with any kind of empirical research is finding available data from a trusted, authoritative source.
The example multiple regression analysis used in this tutorial will investigate the energy and economic indicators associated with levels of democracy.
- The data used for this tutorial is country-level energy and socioeconomic data collected by the U.S. Energy Information Administration (EIA), The World Bank, and The V-Dem Institute.
- The data is available as the Minn 2019 World Energy Indicators feature service from the U of I ArcGIS Online organization. A GeoJSON file and metadata are also available here..
- The dependent variable for this example will be the level of democratic governance as quantified by the V-Dem liberal democracy index which ranges from zero (autocracy) to one (liberal democracy).
- The appropriate approach to selection of which independent variables should be included in a multiple regression model is contested. The pool of possible variables is often created based on the rationally hypothesized contribution that they could make to the model. This mitigates the possibility of spurious correlations and overfitting.
Deductive research methodology commonly starts with a literature review of past research to inform variable choices.
- There is an long-observed association between economic development and democracy (Lipset 1959, Burkhart and Lewis-Beck 1994, Dahl 1989).
- There is an observed connection between GDP, energy use, and trade (Dedeoğlu and Kaya 2013)
- Energy intensity is inversely related to economic growth (Deichmann, Reuter, Vollmer, and Zhang 2019).
- The resource curse hypothesis asserts that countries with an abundance of oil, mineral or other natural resource wealth often have poorer economic and political environments than countries with fewer natural resources (Ross 1999).
- The relationship between agricultural productivity and democracy is nonlinear (Ang, Fredriksson, and Gupta 2018).
Accordingly, a set of independent variables that could be hypothesized to be related to levels of democracy could include:
- GDP_per_Capita_PPP_Dollars
- MJ_per_Dollar_GDP
- Resource_Rent_Percent_GDP
- Agriculture_Percent_GDP
Import the Data
Data from external sources should be imported into the project geodatabase, which is a default file database that is a part of every ArcGIS Pro project.
Storing data in the project geodatabase rather than accessing data from external shapefiles or feature services will improve the speed of your analysis, assure reliable access to needed data, and make the analysis reproducable.
- Give the default map a meaningful name (Dependent Variable Map)
- Use the Export Features tool to copy the data from the feature service into a new feature class in the project geodatabase so the data is stable, available, and suitable for analysis (Energy).
- Symbolize by the dependent variable (Democracy_Index).
Examine the Data
You can view the available data and fields in the Attribute Table to find out what is in your data.
Fields View offers data type information for each field.
Join and Subset the Data
If your data comes from multiple sources, you will need to join the data into a single feature class for analysis with the regression tools. Joins are described in the tutorial Joins in ArcGIS Pro.
If you need to isolate a data for a specific area or characteristic, you may need to subset your data. Subsetting is described in the tutorial Creating Subsets of Data in ArcGIS Pro.
Normalization
When performing regression, you should generally make sure that all your variables are either derived or fundamental variables. In this tutorial example, the dependent variable is an index (derived) so derived variables like percentages or per capita should be used rather than fundamental values like population or energy totals.
Normalization is "the process of taking a count and dividing it by something else in order to make a number more comparable or to put it in context" (Funk 2024). Normalization converts fundamental geospatial variables into derived variables that can be compared across areas and used in regression when the dependent variable is also derived.
For this example, if you want to use a fundamental variable like Oil_Production_Mbd (average petroleum production rate in millions of barrels per day) or Oil_Reserves_B_Barrels (total known commercially viable resource capacity in billions of barrels) you can normalize the value by population to get a per capita value, or by area to get a per square kilometer value. Because this is social analysis, normalizing by Population is probably most appropriate.
The Calculate Field tool can be used calculate a normalized field. This example creates an Oil_Production_per_Capita field by dividing Oil_Production_Mbd by Population and multiplying by 1,000,000 to convert millions of barrels to barrels so the number is more readable.
- Input Table: Energy
- Field Name: Oil_Production_per_Capita
- Field Type: Double
- Expression: !Oil_Production_Mbd! * 1000000 / !Population!
Multiple Regression
mmregression
This tutorial uses a Python tool called mmregression that conveniently incorporates common spatial regression operations into a single tool. While ArcGIS Pro provides native tools to perform most of these operations, chaining the tools together is cumbersome, and ArcGIS Pro has no tools that can perform spatial regression.
mmregression has seven parameters:
- Input Feature Class: The feature class containing all variables that will be used for regression.
- Output Feature Class: The feature class created by the tool that will contain the model residuals as well as diagnostic values for analyzing the operation of the tool.
- Dependent Variable: The field in the Input Feature Class for the independent variable.
- Independent Variables: The fields in the Input Feature Class for the independent variables.
- Standardize: A flag indicating whether to standardize variables to z-scores before running regression (see below)
- Transform: A flag indicating whether to perform a Box Cox transformation on the variables before running regression (see below)
- K Neighbors: A numeric value indicating the number of nearest neighbors (K) to use for calculating lagged values in spatial regression (see below)
To incorporate mmregression into a project:
- Download the mmregression.atbx toolbox file.
- Move the toolbox file into your project directory. The exact location on your machine will vary depending on your configuration, but project directories are usually located in your Documents\ArcGIS\Projects folder.
- In your ArcGIS Pro Catalog pane, right click on Toolboxes, select Add Toolbox and navigate to the toolbox you copied into your project directory.
- Expand the contents of the new toolbox and double clock on the mmregression tool to open it.
- You can drag the tool into a ModelBuilder diagram so that you can preserve your tool parameters if you want to re-run or change them in the future.
Transformation
One of the assumptions of linear regression is that the variables are normally distributed, and running linear regression on non-normal variables can result in coefficients that are unreliable.
Skew is the direction that the distribution is shifted away from normal. The direction left or right is determined by the side that has the longer tail.
A transformation is a modification to variable values that brings the variable distribution closer to normality.
The Box Cox transformation is a general-purpose transformation that can be used with left-skewed, right-skewed, or unskewed data.
- When you select the Transform text box in mmregression the tool will use the Box Cox transformation.
- λ (lambda) is a parameter for the Box Cox transformation ranges from -5 to +5 and is used to indicate the direction and extent of the transformation.
- λ is calculated by the software to transform to a distribution that is as close to normal as possible.
- λ = 1 means no transformation.
- λ = 0 uses the logarithmic transformation commonly used with right-skewed data.
- The transformation can only handle positive values, but the software will add an offset before transformation so that the lowest value is always greater than zero.
You can view a histogram chart for a variable to visually assess skew.
- Skewed variables will show a difference between the median and mean.
- Since histograms for otherwise normal variables can be distorted by the presence of outliers, you should base your transformation choice on differences between the median and mean, and the apparent fit of the curve shown when Show normal distribution is checked.
- If you have selected Transform in the mmregression tool, you can view the variable from the output feature class to see how the Box-Cox transformation changed the variable.
Standardization
Standardization involves converting different variables to a common scale so that coefficients for variables measuring different phenomena with different units can then be compared to each other.
One common standardization scale is z-score, which is the number of standard deviations that each individual value differs from the mean for the variable.
If you check the Standardize box, mmregression will standardize all variables to z-scores. If you transform your variables, you will probably also want to standardize them since the ranges of the transform values may bear little resemblance to the unit used for the untransformed variable.
Model Evaluation
After you run the mmregression tool, the messages contain summary information about the created models for evaluation.
The first summary in the output is for a non-spatial ordinary least squares (OLS) model.
Ordinary least squares regression is a multiple regression technique that estimates regression coefficients by minimizing the sum of the squares of the differences between dependent variable values and the results of the regression model (Wikipedia 2023).
Models can be evaluated based on a combination of the following criteria. These criteria are especially relevant when evaluating the candidate models created with exploratory regression.
- Model fit
- Information loss
- Multicollinearity
- Coefficients
- Residuals
- Autocorrelation
Model Fit
Model fit is evaluated with adjusted R2.
- The range of R2 is from 0.000 (no fit) to 1.000 (perfect fit).
- R2 values are generally interpreted as the percentage of the variance in the dependent variable explained by the model, although the interpretation of that value depends on the purpose of the model.
- Values of less than 0.100 usually indicate an insignificant model.
- In the social sciences where relationships often involve the complex interplay of ambiguous factors, values in the 0.200s or 0.300s can be considered strong enough to merit further investigation.
- In the natural sciences, values above 0.600 are often needed for a model to be considered strong.
The adjusted value (Adj. R-squared) is preferred to the unadjusted value (R-squared) because the adjusted value compensates for improvements in mathematical fit that will occur with the addition of more variables, even when the additional variables add no meaningful improvement to explanatory or predictive value of the model.
In the OLS model output below, the adjusted R-squared value of 0.448 indicates a fairly strong fit for this social science model.
Information Loss
Models are simplifications that result in the loss of information available in the variables.
- The Akaike information criterion (AIC) value for a model indicates the level of information from the input variables that is lost in a model (Akaike 1974; Wikipedia 2022).
- AIC values estimate information loss, so models with lower AIC values are better.
- Unlike R2 which can be used to compare the quality of different types of models, AIC is only comparable between models having the same dependent variable.
Multicollinearity
Multicollinearity is when two or more independent variables in a regression model are correlated with each other in a way that biases the model coefficients and makes them unreliable.
While multicollinearity will not affect model fit, it can make model parameters ambiguous and cause coefficients to be marked as statistically insignificant even though their presence improves model fit (Ghosh 2017).
Variance inflation factor (VIF) is a value that can be used to identify independent variables that are correlated with each other.
- The VIF values for the independent variables are listed at the top of the output messages when you run mmregression.
- Values of VIF over five indicate the presence of multicollinearity.
- To reduce multicollinearity, successively remove the variables with the highest VIF until all VIF values are at least below five and, preferably, below 2.5.
- Because correlated variables often come in correlated pairs, deleting one of the high VIF variables may reduce the VIF for other variable(s).
- Consider the theoretical contribution that a variable makes to the model before removing it.
In model output below, both GDP per capita and percent agriculture from GDP have high VIF values indicating that they are (negatively) correlated. Countries that are highly dependent on agriculture tend to be poorer since industrial and knowledge-based activities are generally more highly valued financially.
You would probably want to remove the agriculture variable since the primary concern is the effect of affluence on levels of democracy.
Coefficients
Model coefficients indicate how differences in independent variable values are reflected in changes in the modeled dependent variable.
When using transformed z-scores (the default in mmregression), the coefficients indicate the relative importance of each factor to the model.
- The strength of the contribution of each variable is assessed with the absolute value of the coefficient.
- Independent variable coefficients with a positive value indicate a positive contribution to the dependent variable, and negative coefficients indicate an inverse contribution to the dependent variable.
- In this example, the high coefficients indicate GDP per capita and percent of GDP from industrial activity are the most important variables.
- Percent of GDP from industrial activity makes a small (negative) contribution and economic energy intensity makes no meaningful contribution to the model.
The probability (P>|t|) values beside each explanatory variable indicate the probability that any effect of this variable on the response could be attributed to randomness.
- Low probability values (p-values) indicate higher statistical significance, and those values are flagged with an asterisk (*).
- Probability values are primarily useful when working with sampled data, and since this example uses population data, these values are irrelevant.
In the example below:
- GDP_per_Capita_PPP_Dollars makes a strong positive contribution to the model, consistent with the association of economic development and democracy in the literature.
- Resource_Rent_Percent_GDP makes a strong negative contribution to the model, consistent with the resource curse hypothesis.
- Neither MJ_per_Dollar_GDP (energy intensity of the economy) or Agriculture_Percent_GDP make a meaningful contribution to the model.
- Because these results are for non-spatial OLS regression, spatial autocorrelation may make these results unreliable and spatial regression should be used to get more reliable results.
Residuals
Residuals are the differences between modeled values and actual values.
Residuals = Actual - Modeled
Model misspecification is when the variables provided for the model are inadequate or inappropriate for meaningfully modeling the analyzed phenomena. With regression models, residuals are useful for identifying misspecification.
When working with geospatial data, you can review the map of the residuals to identify areas with high or low residuals that may indicate what variable(s) are missing.
The output feature class created by mmregression contains an olsresiduals field with the residuals from the non-spatial OLS regression model.
In this case, the low residuals in Russia, Eastern Europe, and the Middle East (overprediction of democracy) and the high residuals (underprediction of democracy) in the Americas hint at social characteristics or historical legacies in those areas that is not captured in the specified independent variables.
Autocorrelation
Another assumption of linear regression is that the variables are not autocorrelated.
- Autocorrelation in a variable means there is correlation between sequences of values at different parts of the distribution.
- Geographical phenomena are very commonly clustered together in space, which means that geospatial variables very commonly exhibit spatial autocorrelation that causes multiple regression model coefficients and outputs to be biased so the model outputs are unreliable.
- If you have autocorrelated variables, spatial regression techniques (described below) can be used to compensate for autocorrelation.
You can detect whether spatial autocorrelation is an issue with your model by using the Spatial Autocorrelation (Global Moran's I) tool to identify spatial autocorrelation in model residuals. This statistic was originally developed by the Australian statistician Patrick Alfred Pierce Moran (1950).
- The range of the Moran's I statistic is -1 (perfect dispersion) to +1 (perfect clustering).
- The tool evaluates the Moran's I statistic as a z-score (number of standard deviations away from the expected mean if the distribution was random).
- High z-scores (> 1.96) indicate high levels of clustering.
- Low z-scores (< -1.96) indicate even dispersion.
- z-scores between those extremes around zero indicate a random distribution and an absence of autocorrelation.
To evaluate autocorrelation using Moran's I in ArcGIS Pro:
- Input Feature Class: The layer of model residuals (Residuals)
- Input Field: The residuals variable (olsresiduals)
- Select Generate Report to create a graphic that can be used to present the results of the tool.
The report will be an HTML file placed in the project directory.
In this example, the high Moran's I value (0.22) and very high Z-score (7.86) confirms the autocorrelation visible in the residuals map, and indicates that the model coefficients are unreliable.
Spatial Regression
Spatial regression models incorporate variables that compensate for spatial autocorrelation so the model coefficients and outputs are more trustworthy.
Neighbors
When addressing autocorrelation in models and model outputs, you specify how to define neighbors to know whether feature variable values are autocorrelated with the values of their neighbors. Spatial lag is the autocorrelation of values across groups of neighboring features.
- mmregression uses a k-nearest-neighbors algorithm to define neighbors. This algorithm defines neighbers as the fixed number (k) of features closest to each feature as neighbors.
- k-nearest-neighbors is a simple algorithm that can be used with points, lines, or polygons.
- k-nearest-neighbors can miss adjacent neighbors when working with features of significantly different sizes, such as census tracts.
- The choice of k should be made based on the types of features and an estimate of how broadly the variable is lagged across neighbors.
- The default value of k in mmregression is five (5).
Spatial Lag Regression
Spatial lag regression performs multiple regression modeling by adding a model term that considers spatial lag of the dependent variable across neighboring areas (Sparks 2015).
y = βX + ρWy + ε
- y is the dependent variable
- β is the vector of regression coefficients
- X is the matrix of independent variables
- ρ (rho) is the coefficient for the lag variable
- W is the matrix of weights defining the neighbors for each feature
- ε is the error term (residuals).
In this case the AIC value for the spatial lag model (336.5) is lower than the AIC for the non-spatial OLS model (375.1), indicating that the spatial lag model is a better choice than the non-spatial OLS model.
The lagged dependent variable (ρWy) is included in the output feature class as the LAGVAR field, and you can map it to see the smoothed lag effect on the spatial distribution.
Notably, the spatial lag model significantly reduces autocorrelation in the residuals (lagresiduals).
Spatial Error Regression
In contrast to the focus of spatial lag regression on autocorrelation in the dependent variable, spatial error regression models spatial interactions in the independent variables that is reflected in the residuals (Eilers 2019).
y = βX + λWu + ε
- y is the dependent variable
- β is the vector of regression coefficients
- X is the matrix of independent variables
- λ (lambda) is the coefficient for the lagged error
- W is the matrix of weights defining the neighbors for each feature
- u is the residuals from OLS regression
- ε is the error term (residuals from the spatial error model).
In this case the AIC value for the spatial err model (329.8) is lower than the AIC for the non-spatial OLS model (375.1) or spatial lag model (336.5), indicating that the spatial err model is a better choice than the non-spatial OLS model.
The LAGERR field with the lagged residuals is included in the output feature class, and you can map it to see the smoothed lag effect on the spatial distribution of the olsresiduals field.
The Moran's I of the residuals indicates an absence of spatial autocorrelation, which makes the spatial error model acceptable.
Exploratory Regression
Two common approaches for choosing from the pool of possible variables that was codified by Hocking (1976) include:
- Forward selection involves adding variables one at a time until either all variables are used or some criteria for fit is met.
- Backward elimination involved initially including all available variables in the model and then successively removing variables when they are either redundant (multicollinear) or make no meaningful contribution to model fit.
Both forward selection and backward elimination can be tedious to perform manually.
An automated approach to variable selection and elimination is exploratory regression, which involves trying all possible combinations of explanatory variables to find the best models.
Exploratory regression violates the philosophy behind the (deductive) classical scientific method where you begin with a hypothesis and then use your models to test your hypothesis. Finding a set of variables that fit a particular data set but do not model other data sets well (overfitting) leads away from analysis of fundamental processes you are trying to understand.
However, in situations where those fundamental processes are not well understood, inductive analysis with tools like exploratory regression can be useful for giving new insights that inform the development of hypotheses that can then be tested on other data sets (ESRI 2023).
Exploratory regression is implemented in the Exploratory Regression tool.
- Input Features: Energy
- Dependent Variable: Democracy Index
- Candidate Explanatory Variables: Select all standardized variables except the dependent variable.
- Report File: Exploratory.txt
This tool may take a few minutes to run depending on the size of your data set and the number of independent variables you are exploring.
Find your report file and open it. This is a text file that by default will open in Notepad. The explored models are listed from the top in order of the number of variables explored.
Theoretical Logic
With exploratory regression we are seeking independent variables that offer some theoretical logic that explains the phenomenon represented by the dependent variables.
Following from the literature review and the quantitative criteria above, the four-variable model from the exploratory regression that best matches is:
- MJ_per_Dollar_GDP
- Agriculture_Percent_GDP
- Industry_Percent_GDP
- Arable_Land_HA_per_Capita
All of these indicators are proxies for post-industrial development, consistent with the literature asserting that democracy and development go hand in hand.
Variable Significance
A table at the bottom of the exploratory regression results lists the percentage of explored models where each of the explored variables were found to be significant. These results can be used when deciding which variables to consider important or unimportant as you explore your data further.
Geographically Weighted Regression
A second major challenge with spatial data is spatial heterogeneity (also called regional variation or nonstationarity) where the processes you are analyzing vary in different parts of the study area.
One exploratory technique for addressing this issue is the use of geographically weighted regression which builds multiple small models across the features that consider neighboring values in the regression.
In ArcGIS Pro, the Geographically Weighted Regression (GWR) tool can be used to perform geographically weighted multiple regression.
- Input Feature Class: Energy
- Dependent Variable: Democracy_Index_Z_Score
- Model Type: Continuous (Gaussian)
- Explanatory Variables: See above
- Output Features: Energy_GWR
- Neighborhood Type: Distance band
- Neighborhood Selection Method: Golden search
Viewing the coefficients in the output features shows how coefficients vary over space for the different independent variables.
- Energy intensity (MJ per dollar GDP) is noticibly low in extractive and industrial countries and high in North America.
- Industry percentage of GDP is noticeably high in South Asia and low in the US and Northern Europe.
- Agricultural percentage of GDP is especially low in West and high in South Asia.
- Arable land per capita stands out in both in low agricultural productivity areas in Africa and high agricultural productivity areas in Eastern Europe.
Appendix: Regression with Native ArcGIS Pro Tools
ArcGIS Pro has native tools for performing standardization, transformation, and OLS regression. However, these tools are cumbersome to chain together and ArcGIS Pro does not currently have tools for performing spatial lag or spatial error regression.
Model Builder
While you can run the tools needed for data preparation and regression independently, it can be helpful to use a ModelBuilder diagram that will save you from repeated typing as you iterate through versions of the model. Use of ModelBuilder also preserves your workflow so that you can debug and reproduce your analysis in the future.
One unfortunate aspect of using ModelBuilder is that some of these tools modify feature classes rather than producing new feature classes which can be simply daisy chained. This will require additional effort to build the diagrams iteratively.
Transform Field
You can use the Transform Field tool to transform skewed varibles.
- This example transforms all skewed derived variables in the data set so that we have options when choosing variables later.
- The new transformed fields by default will have _LOGARITHMIC appended to the name of the original fields.
Standardize Field
The Standardize Field tool can be used to standardize model variables.
- If you have transformed variables, you will need to add and run this tool after you have added and run the Transform Field tool so that the transformed variables are available in the list of variables to standardize.
- If you change the variables you transform, you will need to delete (cut) the Standardize Field tool and add it again.
Unique ID Field
The OLS regression tool requires that each feature have a unique integer ID number, but ArcGIS Pro for some inexplicable reason hides the OBJECT_ID field available in all feature classes.
To copy the OBJECTID field to a new long integer FEATUREID field, you will need to use the Calculate Field tool.
OLS Regression Tool
- Under Analysis and Tools, add the Ordinary Least Squares tool to your model.
- Input Feature Class: Energy
- Unique ID Field: FEATUREID
- Output Feature Class: Residuals
- Dependent Variable: Democracy Index_Z_SCORE
- Explanatory Variables: Described above
- Output Reports File: Regression.pdf
- Right click on the Output Feature Class and select Add to Display so the tool will add a new layer symbolized by the regression residuals.
- In the Catalog Pane, duplicate your dependent variable map and rename it (Residuals Map).