Regression Analysis in Python

Regression is "a functional relationship between two or more correlated variables that is often empirically determined from data and is used especially to predict values of one variable when given values of the others" (Merriam-Webster 2022).

A variety of different regression techniques are commonly used in statistical analysis. Regression with geospatial data is commonly used to analyze and model the relationships between different social and/or environmental characteristics across different locations. The motivation for such analysis is often to find meaningful relationships that can improve understanding and inform decision making.

This tutorial covers basic regression analysis of geospatial data in Python.

Example Data: World Energy Indicators

For this tutorial, the example data is country-level energy and socioeconomic data collected by the U.S. Energy Information Administration (EIA), The World Bank, and The V-Dem Institute. A GeoJSON file and metadata are available here..

NumPy is a Python package for scientific computing.
GeoPandas is a Python package for working with geospatial data.
Matplotlib is a Python package for plotting graphs.
Read the geospatial data from a file or web source into a GeoDataFrame object using the read_file() function. For this example, we will use the GeoJSON file of country-level energy indicators described above.
The to_crs() method is used to reproject the data to the commonly-used (but problematic) World Mercator projection that avoids the distortions associated with the unprojected lat/long data from the GeoJSON file.
Depending on your configuration, maps and other figure images generated in Jupyter notebooks can be small and difficult to read. You can increase the size by setting the figure.figsize plot parameter at the top of your notebook. The first value in the array is the width (9) and the second is the height (6), in inches.

import numpy

import geopandas

import matplotlib.pyplot as plt

countries = geopandas.read_file("https://michaelminn.net/tutorials/data/2019-world-energy-indicators.geojson")

countries = countries.to_crs("EPSG:3857")

plt.rcParams['figure.figsize'] = [9, 6]

The GeoDataFrame plot() method can be used to with the name of the attribute to visualize the variable as a choropleth map.

The Matplotlib show() function displays the plotted map.

axis = countries.plot("MM_BTU_per_Capita", cmap = "coolwarm", legend=True, scheme="quantiles")

axis.set_axis_off()

plt.show()

Choropleth of per-capita energy use by country (The World Bank)

The Pandas info() method shows the available attributes with their data types and number of valid (non-null) values.

print(countries.info())


RangeIndex: 175 entries, 0 to 174
Data columns (total 52 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   Country_Code                 175 non-null    object  
 1   Country_Name                 175 non-null    object  
 2   Longitude                    175 non-null    float64 
 3   Latitude                     175 non-null    float64 
 4   WB_Region                    171 non-null    object  
 5   WB_Income_Group              170 non-null    object  
 6   Population                   170 non-null    float64 
 7   GNI_PPP_B_Dollars            162 non-null    float64 
 8   GDP_per_Capita_PPP_Dollars   162 non-null    float64 
 9   MJ_per_Dollar_GDP            166 non-null    float64 
 10  Fuel_Percent_Exports         148 non-null    float64 
 11  Resource_Rent_Percent_GDP    168 non-null    float64 
 12  Exports_Percent_GDP          161 non-null    float64 
 13  Imports_Percent_GDP          161 non-null    float64 
 14  Industry_Percent_GDP         167 non-null    float64 
 15  Agriculture_Percent_GDP      167 non-null    float64 
 16  Military_Percent_GDP         150 non-null    float64 
 17  Gini_Index                   126 non-null    float64 
 18  Land_Area_sq_km              169 non-null    float64 
 19  Arable_Land_HA_per_Capita    168 non-null    float64 
 20  CO2_per_Capita_Tonnes        165 non-null    float64 
 21  HDI                          164 non-null    float64 
 22  Regime_Type                  164 non-null    object  
 23  Democracy_Index              164 non-null    float64 
 24  CO2_Emissions_MM_Tonnes      172 non-null    float64 
 25  Primary_Consumption_Quads    172 non-null    float64 
 26  Coal_Consumption_Quads       172 non-null    float64 
 27  Gas_Consumption_Quads        172 non-null    float64 
 28  Oil_Consumption_Quads        172 non-null    float64 
 29  Nuclear_Consumption_Quads    31 non-null     float64 
 30  Renewable_Consumption_Quads  172 non-null    float64 
 31  Primary_Production_Quads     172 non-null    float64 
 32  Coal_Production_Quads        172 non-null    float64 
 33  Gas_Production_Quads         172 non-null    float64 
 34  Oil_Production_Quads         172 non-null    float64 
 35  Nuclear_Production_Quads     31 non-null     float64 
 36  Renewable_Production_Quads   172 non-null    float64 
 37  Electricity_Capacity_gW      172 non-null    float64 
 38  Electricity_Generation_tWh   172 non-null    float64 
 39  Electricity_Consumption_tWh  172 non-null    float64 
 40  Oil_Production_Mbd           172 non-null    float64 
 41  Oil_Consumption_Mbd          172 non-null    float64 
 42  Oil_Reserves_B_Barrels       169 non-null    float64 
 43  Gas_Production_bcf           172 non-null    float64 
 44  Gas_Consumption_bcf          172 non-null    float64 
 45  Gas_Reserves_tcf             161 non-null    float64 
 46  Coal_Production_Mst          172 non-null    float64 
 47  Coal_Consumption_Mst         172 non-null    float64 
 48  Coal_Reserves_Mst            172 non-null    float64 
 49  MM_BTU_per_Capita            169 non-null    float64 
 50  Renewable_Percent            172 non-null    float64 
 51  geometry                     175 non-null    geometry
dtypes: float64(46), geometry(1), object(5)
memory usage: 71.2+ KB

Distributions

The dependent variable we will use in this example is MM_BTU_per_Capita (millions of BTU of energy per person per year).

The British thermal unit (BTU) is an energy unit commonly used in American energy literature which is equivalent to the amount of heat needed to raise the temperature of one pound of water at sea level by one degree Fahrenheit.
The BTU is slightly larger than the kilojoule that is commonly used for measuring amounts of energy in scientific literature.
MM is a common abbreviation for million also used in energy literature.
For context, US per capita energy consumption in 2022 was around 301 MM BTU (EIA 2023).

A distribution is the manner in which the values in a variable are spread across the range of possible values.

Histograms are charts commonly used to visualize distributions by showing the number of entries in different ranges of values.

The pyplot hist() function can be used to create a histogram.

The histogram of the distribution of annual energy use per capita by country is highly skewed to the right, showing that most countries use less energy per year per capita than the US, although there are a handful of countries in the right tail with very high energy consumption.

axis = plt.hist(countries["MM_BTU_per_Capita"])

plt.xlabel("MM BTU per Capita")

plt.ylabel("Number of Countries")

plt.show()

Histogram of per-capita energy use by country

A quantile shows the values in a distribution below given percentages of the population.

The 0% quantile is the minimum value.
The 50% quantile is the median where 50% of the values in the distribution are at or below the median value.
The 100% quantile is the maximum value.
Quantiles that divide the distribution into four groups are called quartiles and quantiles that divide the distribution into five groups are called quintiles.

The Pandas describe() function can be used to show quantile values as well as the mean and standard deviation.

With this variable, half of all countries use 53.7 MM BTU of energy per capita or less per year.
Because the distribution is heavily skewed to the right, the median is a more descriptive central tendency than the mean.

print(countries["MM_BTU_per_Capita"].describe())

count    169.000000
mean      89.017438
std      119.299077
min        0.777000
25%       10.699000
50%       53.718000
75%      109.448000
max      803.056000
Name: MM_BTU_per_Capita, dtype: float64

Ranking

We can use the Pandas sort_values() method to sort the data in order to see the highest and lowest values.

To simplify the listing of rows from a data set with a large number of columns, we subset only the country name and energy consumption columns.
To avoid cluttering the list with countries that have no data we use the dropna() method.
The reset_index() method resets the row numbers in the sorted data so they reflect the new sorted order when the data is displayed.
We use the ascending=False to sort in decreasing order so the most energy-intensive countries are at the top of the list.

The head() method shows the rows at the top (head) of the list.

The highest energy users are countries with energy-intensive industries. These highly localized areas of high releases are also visible on the map above.

consumption = countries[["Country_Name", "MM_BTU_per_Capita"]]

consumption = consumption.dropna()

consumption = consumption.sort_values("MM_BTU_per_Capita", ascending=False)

consumption = consumption.reset_index(drop=True)

print(consumption.head())

             Country_Name  MM_BTU_per_Capita
20                 Brunei            410.541
156   Trinidad and Tobago            480.882
162  United Arab Emirates            500.898
68                Iceland            598.031
127                 Qatar            803.056

The Pandas tail() function shows the rows at the bottom of the list, which are all countries at low levels of industrialization.

print(consumption.tail())

             Country_Name  MM_BTU_per_Capita
139               Somalia              0.777
27   Central African Rep.              1.021
23                Burundi              1.056
39        Dem. Rep. Congo              1.512
28                   Chad              1.563

Multiple Regression

One way to evaluate the influence of multiple factors on an effect is the use of multiple regression, which creates a model that considers multiple explanatory (independent) variables to predict the outcome of a response (dependent) variable (Hayes 2022).

Adding additional variables beyond simple bivariate correlation can improve both the explanatory and predictive value of a model.

y = β₁x₁ + β₂x₂ ... + e

y is the dependent variable.
x_n are the independent variables.
β_n are the regression parameters estimated from the data.
e is the error term.

The Resource Curse

The resource curse is a hypothesis that countries with an abundance of oil, mineral or other natural resource wealth often have poorer economic and political environments than countries with fewer natural resources.

This analysis investigates the resource curse using factors associated with high levels of democratic governance, as quantified by the V-Dem liberal democracy index which ranges from zero (autocracy) to one (liberal democracy)

axis = countries.plot("Democracy_Index", legend=True,
        cmap = "coolwarm_r", edgecolor = "#00000040", scheme="naturalbreaks")

axis.set_axis_off()

plt.show()

Model Variables

The appropriate approach to selection of which variables should be included in a multiple regression model is contested. Two common approaches codified by Hocking (1976) include:

Forward selection: Available variables are added one at a time until either all variables are use or some criterion for fit is met.
Backward elimination: All available variables are initially included in the model, and variables are successively removed when they are either redundant (multicollinearity) or make no meaningful contribution to model fit.

To avoid spurious correlations and overfitting, the initial pool of considered variables should be chosen based on a rationally hypothesized contribution that they could make to the model, such as level of development contributing to per capita energy use.

Correlation can be used to find candidates, although low correlation variables can sometimes be important to a model when the relationship between variables is complex.

r_squared = countries.select_dtypes(include=numpy.number)

r_squared = r_squared.corr()["Democracy_Index"]**2

r_squared = r_squared.sort_values(ascending=False)

r_squared = r_squared.round(3)

print(r_squared)

Democracy_Index                1.000
HDI                            0.363
GDP_per_Capita_PPP_Dollars     0.315
Agriculture_Percent_GDP        0.163
Resource_Rent_Percent_GDP      0.160
Industry_Percent_GDP           0.128
Fuel_Percent_Exports           0.091
MJ_per_Dollar_GDP              0.077
MM_BTU_per_Capita              0.075
Exports_Percent_GDP            0.062
Military_Percent_GDP           0.060
Gini_Index                     0.055
...

For this example examining the resource curse with the Democracy Index as the dependent variable, we create a GeoDataFrame with the initial variables below.

We also include the spatial geometry, Longitude, and Latitude columns to facilitate mapping and spatial regression later.
Because the regression functions will fail on missing falues, we use the dropna() method to remove those missing values.

dependent_name = ["Democracy_Index"]

independent_names = ["GDP_per_Capita_PPP_Dollars", "Military_Percent_GDP", 
        "Resource_Rent_Percent_GDP", "Industry_Percent_GDP"]

model_data = countries[dependent_name + independent_names + ["geometry", "Latitude", "Longitude"]]

model_data = model_data.dropna()

Transformation

One of the assumptions of OLS is that the variables are normally distributed. Resulting coefficients from OLS with non-normal variable distributions are unreliable.

The pyplot hist() method can be used to create histograms for all variables in a DataFrame, and the Pandas describe() method can be used to display basic descriptive statistics.

model_data.hist()

print(model_data.describe())

Histograms for the transfored model data

The GDP, military spending, and resource rent variables are all skewed right, since a handful of countries have unusually high values in all those categories.

A transformation is a modification to a variable that brings the variable distribution closer to normality. A commonly used transformation is the same logarithmic transformation used with the x/y scatter charts above. One is added to all values because some variables have values of zero and a logarithm of zero is undefined.

transform_vars = ["GDP_per_Capita_PPP_Dollars", "Military_Percent_GDP",
                "Resource_Rent_Percent_GDP"]

model_data[transform_vars] = numpy.log(model_data[transform_vars] + 1)

model_data.hist()

print(model_data.describe())

Histograms for transfored model variables

Standardization

Because the different variables measure different phenomena and different scales, we can standardize the data values by converting them to z-scores that can then be compared with each other.

A z-score is the number of standard deviations that a value differs from the mean for the whole data set.

standardize = dependent_name + independent_names

model_data[standardize] = (model_data[standardize] - model_data[standardize].mean()) / model_data[standardize].std()

model_data.hist()

Histograms showing the standardized distributions

OLS Regression

PySal is a Python package that includes non-spatial and spatial regression functions.

import pysal.lib

import pysal.model

Non-spatial ordinary least squares regression can be performed with the OLS() function.

ols_model = pysal.model.spreg.OLS(model_data[dependent_name].values,
        model_data[independent_names].values,
        name_y = dependent_name, name_x = independent_names)

print(ols_model.summary)

As with correlation, the quality of model fit is evaluated with the R-squared value, with higher values indicating better fit.

R-squared is commonly interpreted as the percentage of variance in the independent variable explained by the model.

Adjusted r-squared is the r-squared value that should be used with multiple regression models because it is adjusted to reflect the addition of independent variables that increase r-squared but do not make a significant contribution to model fit.

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :['Democracy_Index']         Number of Observations:         144
Mean dependent var  :     -0.0000                Number of Variables   :           5
S.D. dependent var  :      1.0000                Degrees of Freedom    :         139
R-squared           :      0.5872
Adjusted R-squared  :      0.5753
Sum squared residual:      59.037                F-statistic           :     49.4224
Sigma-square        :       0.425                Prob(F-statistic)     :   8.285e-26
S.E. of regression  :       0.652                Log likelihood        :    -140.128
Sigma-square ML     :       0.410                Akaike info criterion :     290.256
S.E of regression ML:      0.6403                Schwarz criterion     :     305.105

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       0.0000000       0.0543090       0.0000000       1.0000000
GDP_per_Capita_PPP_Dollars 0.6750323       0.0693305       9.7364443       0.0000000
Military_Percent_GDP      -0.2199530       0.0586037      -3.7532291       0.0002557
Resource_Rent_Percent_GDP  0.0308747       0.0836677       0.3690162       0.7126772
Industry_Percent_GDP      -0.4131628       0.0743013      -5.5606395       0.0000001
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER            2.757

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2           4.918           0.0855

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                4           4.756           0.3133
Koenker-Bassett test              4           6.822           0.1456
================================ END OF REPORT =====================================

Residuals

With regression models, misspecification is when the variables provided for the model are inadequate or inappropriate for meaningfully modelling the analyzed phenomena.

Residuals are the differences between modeled values and actual values.

residuals = actual - predicted

In this case, the low residual values in Western Asia, Russia, and China indicate lower levels of democracy than predicted by the model. Isolated high residual values indicate higher levels of democracy than predicted.

These residuals represent characteristic(s) of those areas not covered by the chosen variables and you would probably want to explore the addition of additional variables to improve model fit.

residuals = model_data

residuals["Residuals"] = ols_model.u

axis = residuals.plot("Residuals", legend=True,
        cmap = "coolwarm_r", edgecolor = "#00000040",
        scheme="stdmean")

axis.set_axis_off()

plt.show()

Multicollinearity

Multicollinearity is when two or more independent variables in a regression model are correlated with each other in a way that biases the model coefficients and makes them unreliable.

While multicollinearity will not affect model fit, it can make model parameters ambiguous and cause coefficients to be marked as statistically insignificant even though their presence improves model fit (Ghosh 2017).

Variance inflation factor (VIF) is a value that can be used to identify independent variables that are correlated with each other.

The PySAL spreg.vif() can be used to display VIF values. The VIF values for each independent variable are listed in the same order as the OLS model summary().

To reduce multicollinearity, successively remove the variables with the highest VIF until all VIF values are at least below five and, preferably, below 2.5 (Choueiry 2021).

In this case all VIFs (the first column) are below 2.5, so the model has low multicollinearity and all variables can stay.

pysal.model.spreg.vif(ols_model)

[None,
 (1.6183711717978753, 0.6179052231195415),
 (1.1563235730276584, 0.8648098363865844),
 (2.3569244378285616, 0.42428173935066904),
 (1.8587572304188513, 0.5379938722684406)]

Spatial Regression

One of the major issues when using multiple regression with geospatial data is that the ordinary least squares algorithm used to build the model assumes that the variables are not autocorrelated, meaning that that there is no correlation between sequences of values at different parts of the distribution.

Geographical phenomena are very commonly clustered together in space, which means that geospatial variables very commonly exhibit autocorrelation that causes multiple regression model coefficients and outputs to be biased so the model outputs are untrustworthy.

Spatial regression models involve techniques that compensate for spatial autocorrelation so the model coefficients and outputs are more trustworthy.

Weights Matrix

To use the spatial regression tools in pysal, we must first create a weights matrix that indicates the neighbors of each feature.

The KNN() function creates a weights matrix where you specify a value k, and the neighbors of each feature are the nearest k features.

import pysal.lib

weights = pysal.lib.weights.KNN.from_dataframe(model_data, k=4)

axis = model_data.plot(edgecolor="lightgray", facecolor="none")

model_data["index"] = model_data.index

weights.plot(gdf=model_data, indexed_on="index", ax=axis)

axis.set_axis_off()

plt.show()

The weights matrix can be passed to the OLS() function along with the spat_diag and moran parameters to display spatial autocorrelation statistics at the bottom of the model summary.

ols_model = pysal.model.spreg.OLS(model_data[dependent_name].values,
        model_data[independent_names].values, weights, spat_diag = True, moran=True,
        name_y = dependent_name, name_x = independent_names)

print(ols_model.summary)

Although this particular data set is population rather than sampled data, the p-values in the PROB column are useful for knowing whether spatial autocorrelation should be addressed in the model. The PROB are p-values for rejecting the null hypothesis that spatial autocorrelation is not a problem.

Low p-values (below 0.05) are indicators of spatial autocorrelation, and in all tests except spatial errors, spatial autocorrelation is an issue in this model.

...
DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                           MI/DF       VALUE           PROB
Moran's I (error)              0.1434         3.126           0.0018
Lagrange Multiplier (lag)         1          10.796           0.0010
Robust LM (lag)                   1           3.921           0.0477
Lagrange Multiplier (error)       1           6.937           0.0084
Robust LM (error)                 1           0.061           0.8046
Lagrange Multiplier (SARMA)       2          10.858           0.0044

Spatial Lag Regression

Spatial lag regression performs multiple regression modeling by adding a model term that considers diffusion of the dependent variable across adjacent areas (Sparks 2015).

lag_model = pysal.model.spreg.ML_Lag(model_data[dependent_name].values,
        model_data[independent_names].values, weights,
	name_y = dependent_name[0], name_x = independent_names)

print(lag_model.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL)
-----------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :Democracy_Index             Number of Observations:         144
Mean dependent var  :     -0.0000                Number of Variables   :           6
S.D. dependent var  :      1.0000                Degrees of Freedom    :         138
Pseudo R-squared    :      0.6223
Spatial Pseudo R-squared:  0.5973
Sigma-square ML     :       0.375                Log likelihood        :    -134.921
S.E of regression   :       0.612                Akaike info criterion :     281.842
                                                 Schwarz criterion     :     299.660

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       0.0244989       0.0510394       0.4800004       0.6312271
GDP_per_Capita_PPP_Dollars 0.5450982       0.0755767       7.2125178       0.0000000
Military_Percent_GDP      -0.1769419       0.0555886      -3.1830589       0.0014573
Resource_Rent_Percent_GDP  0.0450290       0.0794133       0.5670216       0.5706995
Industry_Percent_GDP      -0.3570708       0.0710419      -5.0262027       0.0000005
   W_Democracy_Index       0.0723270       0.0206303       3.5058652       0.0004551
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================

Spatial Error Regression

In contrast to the focus of spatial lag regression on autocorrelation in the dependent variable, spatial error regression models spatial interactions in the independent variables that is reflected in the error terms (Eilers 2019).

err_model = pysal.model.spreg.ML_Error(model_data[dependent_name].values,
        model_data[independent_names].values, weights,
	name_y = dependent_name[0], name_x = independent_names)

print(err_model.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL ERROR (METHOD = FULL)
-------------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :Democracy_Index             Number of Observations:         144
Mean dependent var  :     -0.0000                Number of Variables   :           5
S.D. dependent var  :      1.0000                Degrees of Freedom    :         139
Pseudo R-squared    :      0.5844
Sigma-square ML     :       0.381                Log likelihood        :    -136.487
S.E of regression   :       0.618                Akaike info criterion :     282.974
                                                 Schwarz criterion     :     297.823

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       0.0293325       0.0767635       0.3821155       0.7023757
GDP_per_Capita_PPP_Dollars 0.6429572       0.0772592       8.3220837       0.0000000
Military_Percent_GDP      -0.1681504       0.0587230      -2.8634501       0.0041905
Resource_Rent_Percent_GDP -0.0171515       0.0867640      -0.1976799       0.8432955
Industry_Percent_GDP      -0.3729551       0.0748255      -4.9843313       0.0000006
              lambda       0.0822485       0.0257157       3.1983810       0.0013820
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================

AIC

The Akaike information criterion (AIC) is an estimate of model prediction error, with lower values representing better fit (Akaike 1974; Wikipedia 2022).

In choosing between models using the same variables, the model with the lower AIC is the model with the better fit.

The spatial lag model has the lowest AIC of the three models, so it is chosen as the model with the best fit.

Model	AIC
OLS	290.256
Spatial lag	281.842 (best fit)
Spatial error	282.974

model_data["Residuals"] = lag_model.u

axis = model_data.plot("Residuals", legend=True, scheme="stdmean",
        cmap = "coolwarm_r", edgecolor = "#00000040")

axis.set_axis_off()

plt.show()

Geographically Weighted Regression

The regression methods used above assume that the relationships between the variables are the same over space (stationarity). However, the autocorrelation patterns in the residuals indicate that something is different about the Middle East, Russia, and China.

Geographically weighted regression (GWR) is an exploratory creates separate local models for each feature. Mapping of the local model coefficients gives a view of how factors differ in importance across the analyzed area and offers insights into possible additional factors that can be added to improve model fit.

mgwr is a Python package for geographically weighted regression.

import mgwr.gwr

import mgwr.sel_bw

coordinates = model_data[["Longitude", "Latitude"]]

gwr_selector = mgwr.sel_bw.Sel_BW(coordinates, model_data[dependent_name].values,
        model_data[independent_names].values)

gwr_bw = gwr_selector.search(bw_min=2)

print(gwr_bw)

gwr_model = mgwr.gwr.GWR(coordinates, model_data[dependent_name].values,
        model_data[independent_names].values, bw=99).fit()

gwr_model.summary()

===========================================================================
Model type                                                         Gaussian
Number of observations:                                                 144
Number of covariates:                                                     5

Global Regression Results
---------------------------------------------------------------------------
Residual sum of squares:                                             59.037
Log-likelihood:                                                    -140.128
AIC:                                                                290.256
AICc:                                                               292.869
BIC:                                                               -631.767
R2:                                                                   0.587
Adj. R2:                                                              0.575

Variable                              Est.         SE  t(Est/SE)    p-value
------------------------------- ---------- ---------- ---------- ----------
X0                                   0.000      0.054      0.000      1.000
X1                                   0.675      0.069      9.736      0.000
X2                                  -0.220      0.059     -3.753      0.000
X3                                   0.031      0.084      0.369      0.712
X4                                  -0.413      0.074     -5.561      0.000

Geographically Weighted Regression (GWR) Results
---------------------------------------------------------------------------
Spatial kernel:                                           Adaptive bisquare
Bandwidth used:                                                      99.000

Diagnostic information
---------------------------------------------------------------------------
Residual sum of squares:                                             46.484
Effective number of parameters (trace(S)):                           15.836
Degree of freedom (n - trace(S)):                                   128.164
Sigma estimate:                                                       0.602
Log-likelihood:                                                    -122.917
AIC:                                                                279.506
AICc:                                                               284.267
BIC:                                                                329.507
R2:                                                                   0.675
Adjusted R2:                                                          0.634
Adj. alpha (95%):                                                     0.016
Adj. critical t value (95%):                                          2.443

Summary Statistics For GWR Parameter Estimates
---------------------------------------------------------------------------
Variable                   Mean        STD        Min     Median        Max
-------------------- ---------- ---------- ---------- ---------- ----------
X0                       -0.068      0.147     -0.363     -0.072      0.213
X1                        0.593      0.180      0.239      0.619      0.891
X2                       -0.170      0.064     -0.345     -0.176     -0.043
X3                       -0.066      0.166     -0.322     -0.063      0.276
X4                       -0.319      0.117     -0.511     -0.343     -0.071
===========================================================================

The params property contains the local coefficients for each feature.

for z in range(1, len(independent_names) + 1):
	model_data["GDP_per_Capita_GWR"] = gwr_model.params[:,z]

	axis = model_data.plot("GDP_per_Capita_GWR", legend=True,
	    cmap = "coolwarm_r", edgecolor = "#00000040", scheme="naturalbreaks")

	plt.title(independent_names[z - 1])
	axis.set_axis_off()

	plt.show()

Notably, the percentage of GDP from resource rents has an especially negative relationship with democracy in the resource-dependent countries of central Africa and in Russia. This gives some credence to the resource curse hypothesis in specific situations.

GWR coefficient for GDP per capita vs. the democracy index

GWR coefficient for military percent of GDP vs. the democracy index

GWR coefficient for resource rent percent of GDP vs. the democracy index

GWR coefficient for industry percent of GDP vs. the democracy index