Regression Analysis in Python
Regression is "a functional relationship between two or more correlated variables that is often empirically determined from data and is used especially to predict values of one variable when given values of the others" (Merriam-Webster 2022).
A variety of different regression techniques are commonly used in statistical analysis. Regression with geospatial data is commonly used to analyze and model the relationships between different social and/or environmental characteristics across different locations. The motivation for such analysis is often to find meaningful relationships that can improve understanding and inform decision making.
This tutorial covers basic regression analysis of geospatial data in Python.
Example Data: World Energy Indicators
For this tutorial, the example data is country-level energy and socioeconomic data collected by the U.S. Energy Information Administration (EIA), The World Bank, and The V-Dem Institute. A GeoJSON file and metadata are available here..
- NumPy is a Python package for scientific computing.
- GeoPandas is a Python package for working with geospatial data.
- Matplotlib is a Python package for plotting graphs.
- Read the geospatial data from a file or web source into a GeoDataFrame object using the read_file() function. For this example, we will use the GeoJSON file of country-level energy indicators described above.
- The to_crs() method is used to reproject the data to the commonly-used (but problematic) World Mercator projection that avoids the distortions associated with the unprojected lat/long data from the GeoJSON file.
- Depending on your configuration, maps and other figure images generated in Jupyter notebooks can be small and difficult to read. You can increase the size by setting the figure.figsize plot parameter at the top of your notebook. The first value in the array is the width (9) and the second is the height (6), in inches.
import numpy import geopandas import matplotlib.pyplot as plt countries = geopandas.read_file("https://michaelminn.net/tutorials/data/2019-world-energy-indicators.geojson") countries = countries.to_crs("EPSG:3857") plt.rcParams['figure.figsize'] = [9, 6]
The GeoDataFrame plot() method can be used to with the name of the attribute to visualize the variable as a choropleth map.
The Matplotlib show() function displays the plotted map.
axis = countries.plot("MM_BTU_per_Capita", cmap = "coolwarm", legend=True, scheme="quantiles") axis.set_axis_off() plt.show()
The Pandas info() method shows the available attributes with their data types and number of valid (non-null) values.
print(countries.info())
RangeIndex: 175 entries, 0 to 174 Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country_Code 175 non-null object 1 Country_Name 175 non-null object 2 Longitude 175 non-null float64 3 Latitude 175 non-null float64 4 WB_Region 171 non-null object 5 WB_Income_Group 170 non-null object 6 Population 170 non-null float64 7 GNI_PPP_B_Dollars 162 non-null float64 8 GDP_per_Capita_PPP_Dollars 162 non-null float64 9 MJ_per_Dollar_GDP 166 non-null float64 10 Fuel_Percent_Exports 148 non-null float64 11 Resource_Rent_Percent_GDP 168 non-null float64 12 Exports_Percent_GDP 161 non-null float64 13 Imports_Percent_GDP 161 non-null float64 14 Industry_Percent_GDP 167 non-null float64 15 Agriculture_Percent_GDP 167 non-null float64 16 Military_Percent_GDP 150 non-null float64 17 Gini_Index 126 non-null float64 18 Land_Area_sq_km 169 non-null float64 19 Arable_Land_HA_per_Capita 168 non-null float64 20 CO2_per_Capita_Tonnes 165 non-null float64 21 HDI 164 non-null float64 22 Regime_Type 164 non-null object 23 Democracy_Index 164 non-null float64 24 CO2_Emissions_MM_Tonnes 172 non-null float64 25 Primary_Consumption_Quads 172 non-null float64 26 Coal_Consumption_Quads 172 non-null float64 27 Gas_Consumption_Quads 172 non-null float64 28 Oil_Consumption_Quads 172 non-null float64 29 Nuclear_Consumption_Quads 31 non-null float64 30 Renewable_Consumption_Quads 172 non-null float64 31 Primary_Production_Quads 172 non-null float64 32 Coal_Production_Quads 172 non-null float64 33 Gas_Production_Quads 172 non-null float64 34 Oil_Production_Quads 172 non-null float64 35 Nuclear_Production_Quads 31 non-null float64 36 Renewable_Production_Quads 172 non-null float64 37 Electricity_Capacity_gW 172 non-null float64 38 Electricity_Generation_tWh 172 non-null float64 39 Electricity_Consumption_tWh 172 non-null float64 40 Oil_Production_Mbd 172 non-null float64 41 Oil_Consumption_Mbd 172 non-null float64 42 Oil_Reserves_B_Barrels 169 non-null float64 43 Gas_Production_bcf 172 non-null float64 44 Gas_Consumption_bcf 172 non-null float64 45 Gas_Reserves_tcf 161 non-null float64 46 Coal_Production_Mst 172 non-null float64 47 Coal_Consumption_Mst 172 non-null float64 48 Coal_Reserves_Mst 172 non-null float64 49 MM_BTU_per_Capita 169 non-null float64 50 Renewable_Percent 172 non-null float64 51 geometry 175 non-null geometry dtypes: float64(46), geometry(1), object(5) memory usage: 71.2+ KB
Distributions
The dependent variable we will use in this example is MM_BTU_per_Capita (millions of BTU of energy per person per year).
- The British thermal unit (BTU) is an energy unit commonly used in American energy literature which is equivalent to the amount of heat needed to raise the temperature of one pound of water at sea level by one degree Fahrenheit.
- The BTU is slightly larger than the kilojoule that is commonly used for measuring amounts of energy in scientific literature.
- MM is a common abbreviation for million also used in energy literature.
- For context, US per capita energy consumption in 2022 was around 301 MM BTU (EIA 2023).
A distribution is the manner in which the values in a variable are spread across the range of possible values.
Histograms are charts commonly used to visualize distributions by showing the number of entries in different ranges of values.
The pyplot hist() function can be used to create a histogram.
The histogram of the distribution of annual energy use per capita by country is highly skewed to the right, showing that most countries use less energy per year per capita than the US, although there are a handful of countries in the right tail with very high energy consumption.
axis = plt.hist(countries["MM_BTU_per_Capita"]) plt.xlabel("MM BTU per Capita") plt.ylabel("Number of Countries") plt.show()
A quantile shows the values in a distribution below given percentages of the population.
- The 0% quantile is the minimum value.
- The 50% quantile is the median where 50% of the values in the distribution are at or below the median value.
- The 100% quantile is the maximum value.
- Quantiles that divide the distribution into four groups are called quartiles and quantiles that divide the distribution into five groups are called quintiles.
The Pandas describe() function can be used to show quantile values as well as the mean and standard deviation.
- With this variable, half of all countries use 53.7 MM BTU of energy per capita or less per year.
- Because the distribution is heavily skewed to the right, the median is a more descriptive central tendency than the mean.
print(countries["MM_BTU_per_Capita"].describe())
count 169.000000 mean 89.017438 std 119.299077 min 0.777000 25% 10.699000 50% 53.718000 75% 109.448000 max 803.056000 Name: MM_BTU_per_Capita, dtype: float64
Ranking
We can use the Pandas sort_values() method to sort the data in order to see the highest and lowest values.
- To simplify the listing of rows from a data set with a large number of columns, we subset only the country name and energy consumption columns.
- To avoid cluttering the list with countries that have no data we use the dropna() method.
- The reset_index() method resets the row numbers in the sorted data so they reflect the new sorted order when the data is displayed.
- We use the ascending=False to sort in decreasing order so the most energy-intensive countries are at the top of the list.
The head() method shows the rows at the top (head) of the list.
The highest energy users are countries with energy-intensive industries. These highly localized areas of high releases are also visible on the map above.
consumption = countries[["Country_Name", "MM_BTU_per_Capita"]] consumption = consumption.dropna() consumption = consumption.sort_values("MM_BTU_per_Capita", ascending=False) consumption = consumption.reset_index(drop=True) print(consumption.head())
Country_Name MM_BTU_per_Capita 20 Brunei 410.541 156 Trinidad and Tobago 480.882 162 United Arab Emirates 500.898 68 Iceland 598.031 127 Qatar 803.056
The Pandas tail() function shows the rows at the bottom of the list, which are all countries at low levels of industrialization.
print(consumption.tail())
Country_Name MM_BTU_per_Capita 139 Somalia 0.777 27 Central African Rep. 1.021 23 Burundi 1.056 39 Dem. Rep. Congo 1.512 28 Chad 1.563
Multiple Regression
One way to evaluate the influence of multiple factors on an effect is the use of multiple regression, which creates a model that considers multiple explanatory (independent) variables to predict the outcome of a response (dependent) variable (Hayes 2022).
Adding additional variables beyond simple bivariate correlation can improve both the explanatory and predictive value of a model.
y = β1x1 + β2x2 ... + e
- y is the dependent variable.
- xn are the independent variables.
- βn are the regression parameters estimated from the data.
- e is the error term.
The Resource Curse
The resource curse is a hypothesis that countries with an abundance of oil, mineral or other natural resource wealth often have poorer economic and political environments than countries with fewer natural resources.
This analysis investigates the resource curse using factors associated with high levels of democratic governance, as quantified by the V-Dem liberal democracy index which ranges from zero (autocracy) to one (liberal democracy)
axis = countries.plot("Democracy_Index", legend=True, cmap = "coolwarm_r", edgecolor = "#00000040", scheme="naturalbreaks") axis.set_axis_off() plt.show()
Model Variables
The appropriate approach to selection of which variables should be included in a multiple regression model is contested. Two common approaches codified by Hocking (1976) include:
- Forward selection: Available variables are added one at a time until either all variables are use or some criterion for fit is met.
- Backward elimination: All available variables are initially included in the model, and variables are successively removed when they are either redundant (multicollinearity) or make no meaningful contribution to model fit.
To avoid spurious correlations and overfitting, the initial pool of considered variables should be chosen based on a rationally hypothesized contribution that they could make to the model, such as level of development contributing to per capita energy use.
Correlation can be used to find candidates, although low correlation variables can sometimes be important to a model when the relationship between variables is complex.
r_squared = countries.select_dtypes(include=numpy.number) r_squared = r_squared.corr()["Democracy_Index"]**2 r_squared = r_squared.sort_values(ascending=False) r_squared = r_squared.round(3) print(r_squared)
Democracy_Index 1.000 HDI 0.363 GDP_per_Capita_PPP_Dollars 0.315 Agriculture_Percent_GDP 0.163 Resource_Rent_Percent_GDP 0.160 Industry_Percent_GDP 0.128 Fuel_Percent_Exports 0.091 MJ_per_Dollar_GDP 0.077 MM_BTU_per_Capita 0.075 Exports_Percent_GDP 0.062 Military_Percent_GDP 0.060 Gini_Index 0.055 ...
For this example examining the resource curse with the Democracy Index as the dependent variable, we create a GeoDataFrame with the initial variables below.
- We also include the spatial geometry, Longitude, and Latitude columns to facilitate mapping and spatial regression later.
- Because the regression functions will fail on missing falues, we use the dropna() method to remove those missing values.
dependent_name = ["Democracy_Index"] independent_names = ["GDP_per_Capita_PPP_Dollars", "Military_Percent_GDP", "Resource_Rent_Percent_GDP", "Industry_Percent_GDP"] model_data = countries[dependent_name + independent_names + ["geometry", "Latitude", "Longitude"]] model_data = model_data.dropna()
Transformation
One of the assumptions of OLS is that the variables are normally distributed. Resulting coefficients from OLS with non-normal variable distributions are unreliable.
The pyplot hist() method can be used to create histograms for all variables in a DataFrame, and the Pandas describe() method can be used to display basic descriptive statistics.
model_data.hist() print(model_data.describe())
The GDP, military spending, and resource rent variables are all skewed right, since a handful of countries have unusually high values in all those categories.
A transformation is a modification to a variable that brings the variable distribution closer to normality. A commonly used transformation is the same logarithmic transformation used with the x/y scatter charts above. One is added to all values because some variables have values of zero and a logarithm of zero is undefined.
transform_vars = ["GDP_per_Capita_PPP_Dollars", "Military_Percent_GDP", "Resource_Rent_Percent_GDP"] model_data[transform_vars] = numpy.log(model_data[transform_vars] + 1) model_data.hist() print(model_data.describe())
Standardization
Because the different variables measure different phenomena and different scales, we can standardize the data values by converting them to z-scores that can then be compared with each other.
A z-score is the number of standard deviations that a value differs from the mean for the whole data set.
standardize = dependent_name + independent_names model_data[standardize] = (model_data[standardize] - model_data[standardize].mean()) / model_data[standardize].std() model_data.hist()
OLS Regression
PySal is a Python package that includes non-spatial and spatial regression functions.
import pysal.lib import pysal.model
Non-spatial ordinary least squares regression can be performed with the OLS() function.
ols_model = pysal.model.spreg.OLS(model_data[dependent_name].values, model_data[independent_names].values, name_y = dependent_name, name_x = independent_names) print(ols_model.summary)
As with correlation, the quality of model fit is evaluated with the R-squared value, with higher values indicating better fit.
R-squared is commonly interpreted as the percentage of variance in the independent variable explained by the model.
Adjusted r-squared is the r-squared value that should be used with multiple regression models because it is adjusted to reflect the addition of independent variables that increase r-squared but do not make a significant contribution to model fit.
REGRESSION ---------- SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ----------------------------------------- Data set : unknown Weights matrix : None Dependent Variable :['Democracy_Index'] Number of Observations: 144 Mean dependent var : -0.0000 Number of Variables : 5 S.D. dependent var : 1.0000 Degrees of Freedom : 139 R-squared : 0.5872 Adjusted R-squared : 0.5753 Sum squared residual: 59.037 F-statistic : 49.4224 Sigma-square : 0.425 Prob(F-statistic) : 8.285e-26 S.E. of regression : 0.652 Log likelihood : -140.128 Sigma-square ML : 0.410 Akaike info criterion : 290.256 S.E of regression ML: 0.6403 Schwarz criterion : 305.105 ------------------------------------------------------------------------------------ Variable Coefficient Std.Error t-Statistic Probability ------------------------------------------------------------------------------------ CONSTANT 0.0000000 0.0543090 0.0000000 1.0000000 GDP_per_Capita_PPP_Dollars 0.6750323 0.0693305 9.7364443 0.0000000 Military_Percent_GDP -0.2199530 0.0586037 -3.7532291 0.0002557 Resource_Rent_Percent_GDP 0.0308747 0.0836677 0.3690162 0.7126772 Industry_Percent_GDP -0.4131628 0.0743013 -5.5606395 0.0000001 ------------------------------------------------------------------------------------ REGRESSION DIAGNOSTICS MULTICOLLINEARITY CONDITION NUMBER 2.757 TEST ON NORMALITY OF ERRORS TEST DF VALUE PROB Jarque-Bera 2 4.918 0.0855 DIAGNOSTICS FOR HETEROSKEDASTICITY RANDOM COEFFICIENTS TEST DF VALUE PROB Breusch-Pagan test 4 4.756 0.3133 Koenker-Bassett test 4 6.822 0.1456 ================================ END OF REPORT =====================================
Residuals
With regression models, misspecification is when the variables provided for the model are inadequate or inappropriate for meaningfully modelling the analyzed phenomena.
Residuals are the differences between modeled values and actual values.
residuals = actual - predicted
In this case, the low residual values in Western Asia, Russia, and China indicate lower levels of democracy than predicted by the model. Isolated high residual values indicate higher levels of democracy than predicted.
These residuals represent characteristic(s) of those areas not covered by the chosen variables and you would probably want to explore the addition of additional variables to improve model fit.
residuals = model_data residuals["Residuals"] = ols_model.u axis = residuals.plot("Residuals", legend=True, cmap = "coolwarm_r", edgecolor = "#00000040", scheme="stdmean") axis.set_axis_off() plt.show()
Multicollinearity
Multicollinearity is when two or more independent variables in a regression model are correlated with each other in a way that biases the model coefficients and makes them unreliable.
While multicollinearity will not affect model fit, it can make model parameters ambiguous and cause coefficients to be marked as statistically insignificant even though their presence improves model fit (Ghosh 2017).
Variance inflation factor (VIF) is a value that can be used to identify independent variables that are correlated with each other.
The PySAL spreg.vif() can be used to display VIF values. The VIF values for each independent variable are listed in the same order as the OLS model summary().
To reduce multicollinearity, successively remove the variables with the highest VIF until all VIF values are at least below five and, preferably, below 2.5 (Choueiry 2021).
In this case all VIFs (the first column) are below 2.5, so the model has low multicollinearity and all variables can stay.
pysal.model.spreg.vif(ols_model)
[None, (1.6183711717978753, 0.6179052231195415), (1.1563235730276584, 0.8648098363865844), (2.3569244378285616, 0.42428173935066904), (1.8587572304188513, 0.5379938722684406)]
Spatial Regression
One of the major issues when using multiple regression with geospatial data is that the ordinary least squares algorithm used to build the model assumes that the variables are not autocorrelated, meaning that that there is no correlation between sequences of values at different parts of the distribution.
Geographical phenomena are very commonly clustered together in space, which means that geospatial variables very commonly exhibit autocorrelation that causes multiple regression model coefficients and outputs to be biased so the model outputs are untrustworthy.
Spatial regression models involve techniques that compensate for spatial autocorrelation so the model coefficients and outputs are more trustworthy.
Weights Matrix
To use the spatial regression tools in pysal, we must first create a weights matrix that indicates the neighbors of each feature.
The KNN() function creates a weights matrix where you specify a value k, and the neighbors of each feature are the nearest k features.
import pysal.lib weights = pysal.lib.weights.KNN.from_dataframe(model_data, k=4) axis = model_data.plot(edgecolor="lightgray", facecolor="none") model_data["index"] = model_data.index weights.plot(gdf=model_data, indexed_on="index", ax=axis) axis.set_axis_off() plt.show()
The weights matrix can be passed to the OLS() function along with the spat_diag and moran parameters to display spatial autocorrelation statistics at the bottom of the model summary.
ols_model = pysal.model.spreg.OLS(model_data[dependent_name].values, model_data[independent_names].values, weights, spat_diag = True, moran=True, name_y = dependent_name, name_x = independent_names) print(ols_model.summary)
Although this particular data set is population rather than sampled data, the p-values in the PROB column are useful for knowing whether spatial autocorrelation should be addressed in the model. The PROB are p-values for rejecting the null hypothesis that spatial autocorrelation is not a problem.
Low p-values (below 0.05) are indicators of spatial autocorrelation, and in all tests except spatial errors, spatial autocorrelation is an issue in this model.
... DIAGNOSTICS FOR SPATIAL DEPENDENCE TEST MI/DF VALUE PROB Moran's I (error) 0.1434 3.126 0.0018 Lagrange Multiplier (lag) 1 10.796 0.0010 Robust LM (lag) 1 3.921 0.0477 Lagrange Multiplier (error) 1 6.937 0.0084 Robust LM (error) 1 0.061 0.8046 Lagrange Multiplier (SARMA) 2 10.858 0.0044
Spatial Lag Regression
Spatial lag regression performs multiple regression modeling by adding a model term that considers diffusion of the dependent variable across adjacent areas (Sparks 2015).
lag_model = pysal.model.spreg.ML_Lag(model_data[dependent_name].values, model_data[independent_names].values, weights, name_y = dependent_name[0], name_x = independent_names) print(lag_model.summary)
REGRESSION ---------- SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL) ----------------------------------------------------------------- Data set : unknown Weights matrix : unknown Dependent Variable :Democracy_Index Number of Observations: 144 Mean dependent var : -0.0000 Number of Variables : 6 S.D. dependent var : 1.0000 Degrees of Freedom : 138 Pseudo R-squared : 0.6223 Spatial Pseudo R-squared: 0.5973 Sigma-square ML : 0.375 Log likelihood : -134.921 S.E of regression : 0.612 Akaike info criterion : 281.842 Schwarz criterion : 299.660 ------------------------------------------------------------------------------------ Variable Coefficient Std.Error z-Statistic Probability ------------------------------------------------------------------------------------ CONSTANT 0.0244989 0.0510394 0.4800004 0.6312271 GDP_per_Capita_PPP_Dollars 0.5450982 0.0755767 7.2125178 0.0000000 Military_Percent_GDP -0.1769419 0.0555886 -3.1830589 0.0014573 Resource_Rent_Percent_GDP 0.0450290 0.0794133 0.5670216 0.5706995 Industry_Percent_GDP -0.3570708 0.0710419 -5.0262027 0.0000005 W_Democracy_Index 0.0723270 0.0206303 3.5058652 0.0004551 ------------------------------------------------------------------------------------ ================================ END OF REPORT =====================================
Spatial Error Regression
In contrast to the focus of spatial lag regression on autocorrelation in the dependent variable, spatial error regression models spatial interactions in the independent variables that is reflected in the error terms (Eilers 2019).
err_model = pysal.model.spreg.ML_Error(model_data[dependent_name].values, model_data[independent_names].values, weights, name_y = dependent_name[0], name_x = independent_names) print(err_model.summary)
REGRESSION ---------- SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL ERROR (METHOD = FULL) ------------------------------------------------------------------- Data set : unknown Weights matrix : unknown Dependent Variable :Democracy_Index Number of Observations: 144 Mean dependent var : -0.0000 Number of Variables : 5 S.D. dependent var : 1.0000 Degrees of Freedom : 139 Pseudo R-squared : 0.5844 Sigma-square ML : 0.381 Log likelihood : -136.487 S.E of regression : 0.618 Akaike info criterion : 282.974 Schwarz criterion : 297.823 ------------------------------------------------------------------------------------ Variable Coefficient Std.Error z-Statistic Probability ------------------------------------------------------------------------------------ CONSTANT 0.0293325 0.0767635 0.3821155 0.7023757 GDP_per_Capita_PPP_Dollars 0.6429572 0.0772592 8.3220837 0.0000000 Military_Percent_GDP -0.1681504 0.0587230 -2.8634501 0.0041905 Resource_Rent_Percent_GDP -0.0171515 0.0867640 -0.1976799 0.8432955 Industry_Percent_GDP -0.3729551 0.0748255 -4.9843313 0.0000006 lambda 0.0822485 0.0257157 3.1983810 0.0013820 ------------------------------------------------------------------------------------ ================================ END OF REPORT =====================================
AIC
The Akaike information criterion (AIC) is an estimate of model prediction error, with lower values representing better fit (Akaike 1974; Wikipedia 2022).
In choosing between models using the same variables, the model with the lower AIC is the model with the better fit.
The spatial lag model has the lowest AIC of the three models, so it is chosen as the model with the best fit.
Model | AIC |
---|---|
OLS | 290.256 |
Spatial lag | 281.842 (best fit) |
Spatial error | 282.974 |
model_data["Residuals"] = lag_model.u axis = model_data.plot("Residuals", legend=True, scheme="stdmean", cmap = "coolwarm_r", edgecolor = "#00000040") axis.set_axis_off() plt.show()
Geographically Weighted Regression
The regression methods used above assume that the relationships between the variables are the same over space (stationarity). However, the autocorrelation patterns in the residuals indicate that something is different about the Middle East, Russia, and China.
Geographically weighted regression (GWR) is an exploratory creates separate local models for each feature. Mapping of the local model coefficients gives a view of how factors differ in importance across the analyzed area and offers insights into possible additional factors that can be added to improve model fit.
mgwr is a Python package for geographically weighted regression.
import mgwr.gwr import mgwr.sel_bw
coordinates = model_data[["Longitude", "Latitude"]] gwr_selector = mgwr.sel_bw.Sel_BW(coordinates, model_data[dependent_name].values, model_data[independent_names].values) gwr_bw = gwr_selector.search(bw_min=2) print(gwr_bw) gwr_model = mgwr.gwr.GWR(coordinates, model_data[dependent_name].values, model_data[independent_names].values, bw=99).fit() gwr_model.summary()
=========================================================================== Model type Gaussian Number of observations: 144 Number of covariates: 5 Global Regression Results --------------------------------------------------------------------------- Residual sum of squares: 59.037 Log-likelihood: -140.128 AIC: 290.256 AICc: 292.869 BIC: -631.767 R2: 0.587 Adj. R2: 0.575 Variable Est. SE t(Est/SE) p-value ------------------------------- ---------- ---------- ---------- ---------- X0 0.000 0.054 0.000 1.000 X1 0.675 0.069 9.736 0.000 X2 -0.220 0.059 -3.753 0.000 X3 0.031 0.084 0.369 0.712 X4 -0.413 0.074 -5.561 0.000 Geographically Weighted Regression (GWR) Results --------------------------------------------------------------------------- Spatial kernel: Adaptive bisquare Bandwidth used: 99.000 Diagnostic information --------------------------------------------------------------------------- Residual sum of squares: 46.484 Effective number of parameters (trace(S)): 15.836 Degree of freedom (n - trace(S)): 128.164 Sigma estimate: 0.602 Log-likelihood: -122.917 AIC: 279.506 AICc: 284.267 BIC: 329.507 R2: 0.675 Adjusted R2: 0.634 Adj. alpha (95%): 0.016 Adj. critical t value (95%): 2.443 Summary Statistics For GWR Parameter Estimates --------------------------------------------------------------------------- Variable Mean STD Min Median Max -------------------- ---------- ---------- ---------- ---------- ---------- X0 -0.068 0.147 -0.363 -0.072 0.213 X1 0.593 0.180 0.239 0.619 0.891 X2 -0.170 0.064 -0.345 -0.176 -0.043 X3 -0.066 0.166 -0.322 -0.063 0.276 X4 -0.319 0.117 -0.511 -0.343 -0.071 ===========================================================================
The params property contains the local coefficients for each feature.
for z in range(1, len(independent_names) + 1): model_data["GDP_per_Capita_GWR"] = gwr_model.params[:,z] axis = model_data.plot("GDP_per_Capita_GWR", legend=True, cmap = "coolwarm_r", edgecolor = "#00000040", scheme="naturalbreaks") plt.title(independent_names[z - 1]) axis.set_axis_off() plt.show()
Notably, the percentage of GDP from resource rents has an especially negative relationship with democracy in the resource-dependent countries of central Africa and in Russia. This gives some credence to the resource curse hypothesis in specific situations.