Area Maps in Python

Geospatial data representing areas is commonly used for describing social and environmental characteristics of our world.

This tutorial covers the basic descriptive analysis and visualization of geospatial area data in Python using GeoPandas and Matplotlib.

Areas

Geospatial data describes what is where. There are locations on the surface of the earth (where) that have names or characteristics (what).

Areas in geospatial data are represented with polygons.

Vectors vs. raster data

Pandas and GeoPandas

Pandas is a data analysis and manipulation module that can be used in Python.

At the core of pandas is a tabular object type called a DataFrame. Pandas DataFrames are conceptually similar to spreadsheet tables.

Figure
Pandas DataFrame

GeoPandas is an extension of pandas and shapely data types and operations for working with geospatial data.

GeoPandas GeoDataFrames are pandas DataFrames with an additional geometry column that contains geometries for representing points, lines or polygons associated with each row.

Figure
GeoPandas GeoDataFrame

Columns of values in DataFrames/GeoDataFrames can be accessed the column name inside square brackets:

countries["CO2_per_Capita"]
0      0.161
1      1.692
2      4.010
3      0.779
4      3.741
Name: CO2_per_Capita, Length: 5, dtype: float64

Columns can also be accessed with dot notation, but only if the column names adhere to Python variable naming limitations.

countries.CO2_per_Capita
0      0.161
1      1.692
2      4.010
3      0.779
4      3.741
Name: CO2_per_Capita, Length: 5, dtype: float64

Multiple columns can be accessed by including a list in the brackets.

countries[["Country_Name", "CO2_per_Capita"]]
    Country_Name  CO2_per_Capita
0    Afghanistan           0.161
1        Albania           1.692
2        Algeria           4.010
3         Angola           0.779
4      Argentina           3.741

[5 rows x 2 columns]

Individual rows can be accessed by referencing the DataFrame loc property.

countries.loc[0]
Country_Name               Afghanistan
CO2_per_Capita_                  0.161
MM_BTU_per_Capita                2.739
geometry		   POLYGON(...

Ranges of rows can be accessed with indices separated by a colon.

countries.loc[0:2]
   Country_Name		CO2_per_Capita		MM_BTU_per_Capita	geometry
0  Afghanistan          0.161			2.739			POLYGON(...
1  Albania		1.692			39.804			POLYGON(...
2  Algeria		4.010			61.465			POLYGON(...

[3 rows x 4 columns]

The head method can be used as a shortcut for accessing the top five rows.

 countries.head()
   Country_Name		CO2_per_Capita		MM_BTU_per_Capita	geometry
0  Afghanistan          0.161			2.739			POLYGON(...
1  Albania		1.692			39.804			POLYGON(...
2  Algeria		4.010			61.465			POLYGON(...
3  Angola		0.779			12.514			POLYGON(...
4  Argentina		3.791			82.431			POLYGON(...

Acquire and Examine the Data

Load the Data

For this tutorial, the example data is country-level energy and socioeconomic data collected by the U.S. Energy Information Administration (EIA), The World Bank, and The V-Dem Institute. A GeoJSON file and metadata are available here..

import geopandas

import matplotlib.pyplot as plt

countries = geopandas.read_file("https://michaelminn.net/tutorials/data/2019-world-energy-indicators.geojson")

countries = countries.to_crs("EPSG:3857")

Depending on your configuration, maps and other figure images generated in Jupyter notebooks can be small and difficult to read. You can increase the size by setting the figure.figsize plot parameter at the top of your notebook. The first value in the array is the width and the second is the height, in inches.

plt.rcParams['figure.figsize'] = [9, 6]

List Fields

You can get a list of field (column) names from the .columns property of the GeoDataFrame.

countries.columns
Index(['Country_Code', 'Country_Name', 'Longitude', 'Latitude', 'Population',
       'GNI_PPP_B_Dollars', 'GDP_per_Capita_PPP_Dollars', 'MJ_per_Dollar_GDP',
       'Fuel_Percent_Exports', 'Resource_Rent_Percent_GDP',
       'Exports_Percent_GDP', 'Imports_Percent_GDP', 'Industry_Percent_GDP',
       'Agriculture_Percent_GDP', 'Military_Percent_GDP', 'Gini_Index',
       'Regime_Type', 'Democracy_Index', 'CO2_Emissions_MM_Tonnes',
       'Primary_Consumption_Quads', 'Coal_Consumption_Quads',
       'Gas_Consumption_Quads', 'Oil_Consumption_Quads',
       'Nuclear_Consumption_Quads', 'Renewable_Consumption_Quads',
       'Primary_Production_Quads', 'Coal_Production_Quads',
       'Gas_Production_Quads', 'Oil_Production_Quads',
       'Nuclear_Production_Quads', 'Renewable_Production_Quads',
       'Electricity_Capacity_gW', 'Electricity_Generation_tWh',
       'Electricity_Consumption_tWh', 'Oil_Production_Mbd',
       'Oil_Consumption_Mbd', 'Oil_Reserves_B_Barrels', 'Gas_Production_bcf',
       'Gas_Consumption_bcf', 'Gas_Reserves_tcf', 'Coal_Production_Mst',
       'Coal_Consumption_Mst', 'Coal_Reserves_Mst', 'MM_BTU_per_Capita',
       'Renewable_Percent', 'geometry'],
      dtype='object')

For a more-detailed listing with the data types and number of features, you can use the GeoDataFrame info() method.

countries.info()

RangeIndex: 175 entries, 0 to 174
Data columns (total 51 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   Country_Code                 175 non-null    object  
 1   Country_Name                 175 non-null    object  
 2   Longitude                    175 non-null    float64 
 3   Latitude                     175 non-null    float64 
 4   WB_Region                    171 non-null    object  
 5   WB_Income_Group              170 non-null    object  
 6   Population                   170 non-null    float64 
 7   GNI_PPP_B_Dollars            162 non-null    float64 
 8   GDP_per_Capita_PPP_Dollars   162 non-null    float64 
 9   MJ_per_Dollar_GDP            166 non-null    float64 
 10  Fuel_Percent_Exports         148 non-null    float64 
 11  Resource_Rent_Percent_GDP    168 non-null    float64 
 12  Exports_Percent_GDP          161 non-null    float64 
 13  Imports_Percent_GDP          161 non-null    float64 
 14  Industry_Percent_GDP         167 non-null    float64 
 15  Agriculture_Percent_GDP      167 non-null    float64 
 16  Military_Percent_GDP         150 non-null    float64 
 17  Gini_Index                   126 non-null    float64 
 18  Land_Area_sq_km              169 non-null    float64 
 19  Arable_Land_HA_per_Capita    168 non-null    float64 
 20  HDI                          164 non-null    float64 
 21  Regime_Type                  164 non-null    object  
 22  Democracy_Index              164 non-null    float64 
 23  CO2_Emissions_MM_Tonnes      172 non-null    float64 
 24  Primary_Consumption_Quads    172 non-null    float64 
 25  Coal_Consumption_Quads       172 non-null    float64 
 26  Gas_Consumption_Quads        172 non-null    float64 
 27  Oil_Consumption_Quads        172 non-null    float64 
 28  Nuclear_Consumption_Quads    31 non-null     float64 
 29  Renewable_Consumption_Quads  172 non-null    float64 
 30  Primary_Production_Quads     172 non-null    float64 
 31  Coal_Production_Quads        172 non-null    float64 
 32  Gas_Production_Quads         172 non-null    float64 
 33  Oil_Production_Quads         172 non-null    float64 
 34  Nuclear_Production_Quads     31 non-null     float64 
 35  Renewable_Production_Quads   172 non-null    float64 
 36  Electricity_Capacity_gW      172 non-null    float64 
 37  Electricity_Generation_tWh   172 non-null    float64 
 38  Electricity_Consumption_tWh  172 non-null    float64 
 39  Oil_Production_Mbd           172 non-null    float64 
 40  Oil_Consumption_Mbd          172 non-null    float64 
 41  Oil_Reserves_B_Barrels       169 non-null    float64 
 42  Gas_Production_bcf           172 non-null    float64 
 43  Gas_Consumption_bcf          172 non-null    float64 
 44  Gas_Reserves_tcf             161 non-null    float64 
 45  Coal_Production_Mst          172 non-null    float64 
 46  Coal_Consumption_Mst         172 non-null    float64 
 47  Coal_Reserves_Mst            172 non-null    float64 
 48  MM_BTU_per_Capita            169 non-null    float64 
 49  Renewable_Percent            172 non-null    float64 
 50  geometry                     175 non-null    geometry
dtypes: float64(45), geometry(1), object(5)
memory usage: 69.9+ KB

Default Choropleth

A thematic map "is used to display the spatial pattern of a themes or attributes" (Slocum et al. 2009, 1). This is in contrast to a reference map (like Google Maps) which provides a general overview of information, often representing multiple variables.

A choropleth is a type of thematic map where areas are colored based on a single variable that describes some characteristic of those areas.

To display a default choropleth map:

axis = countries.plot("Renewable_Percent", legend=True)

plt.show()
Figure
Default choropleth: Percent renewable energy by country (EIA)

Descriptive Statistics

Distributions

The distribution of a set of values is the manner in which the values are spread across the possible range of values.

A histogram is "a representation of a frequency distribution by means of rectangles whose widths represent class intervals and whose areas are proportional to the corresponding frequencies" (Merriam-Webster 2023).

Histograms are commonly used to examine variables and determine their distribution.

The type of distribution can influence the design choices you make with your visualizations.

Example histograms
Histograms of common types of distributions (Python code)

The hist() function from the pyplot module can be used to create histograms.

The parameter should be the column from the GeoDataFrame.

axis = plt.hist(countries["Renewable_Percent"])

plt.show()
Figure
Histogram of percentage renewable energy of primary energy use by country

The histplot() function from the Seaborn module can be used to draw a histogram with a smoothed density curve line that can make the general shape of the histogram clearer when the data is noisy.

import seaborn as sns

axis = sns.histplot(countries["Renewable_Percent"], kde=True)

plt.show()
Figure
Kernel density histogram of percentage renewable energy of primary energy use by country

The Pandas skew() method can be used to quantify the level of skew, with positive numbers representing skew to the right and negative numbers representing skew to the left.

The high positive value clearly supports the observation from the histogram that the distribution is clearly skewed to the right (positive skew).

countries["Renewable_Percent"].skew()
1.5836553372979751

Central Tendency

Central tendency statistics (means, medians, modes) aggregate a set of values into a single value that is a representative central value for comparison with other central values.

The Pandas describe() method lists the central tendency values for a variable, along with the range and the quantiles.

countries["Renewable_Percent"].describe()
count    170.000000
mean      33.420588
std       27.956439
min        0.000000
25%       10.882500
50%       24.990000
75%       53.037500
max       96.240000

Ordinal Ranking

In some cases, you might like to know information about the areas with the highest and lowest values in the distribution.

The countries with the highest percentage of renewable energy usage tend to be countries with robust hydroelectric power, and the countries with the lowest percentage of renewable energy use are petrostates or fossil-fuel dependent island states.

renewables = countries[["Country_Name", "Renewable_Percent"]].dropna()

renewables["Rank"] = renewables["Renewable_Percent"].rank(ascending=False)

renewables = renewables.sort_values("Rank", ascending=True)

renewables.head()
Country_Name 	Renewable_Percent 	Rank
Bhutan 			82.766 		1.0
Iceland 		79.135 		2.0
Paraguay 		74.106 		3.0
Dem. Rep. Congo 	72.897 		4.0
Tajikistan 		62.814 		5.0
renewables.tail()
Country_Name 	Renewable_Percent 	Rank
Oman 			0.011 		168.0
Brunei 			0.010 		169.0
Trinidad and Tobago 	0.007 		170.0
W. Sahara 		0.000 		171.0
Turkmenistan 		-0.615 		172.0

Continuous Color Choropleths

Single Hue

Choropleths that use a range of levels of brightness of a single hue are commonly used for mapping continuous variables.

axis = countries.plot("Renewable_Percent", cmap = "copper", \
	legend=True, legend_kwds={"orientation":"horizontal"})

axis.set_axis_off()

plt.show()
Figure
A single hue choropleth with a colorbar

Diverging Color

A diverging color choropleth uses a colormap based on two opposite hues separated by a neutral color like white or gray.

The use of diverging color makes it easier to see the difference between areas of low or high values.

The two colors should be complementary (opposite or nearly opposite sides of the color wheel).

Note that attaching an _r reverses the colors of a color map. In this example the seismic colormap goes from blue (low values) to red (high values), while seismic_r goes from red (low values) to blue (high values).

You should generally avoid any color choices that require being able to distinguish between red and green in order to make your maps accessible to the 8% of men of European ancestry who are red-green color blind.

axis = countries.plot("Renewable_Percent", cmap = "seismic_r", \
	legend=True, legend_kwds={"orientation":"horizontal"})

axis.set_axis_off()

plt.show()
Figure
A diverging color choropleth with a colorbar: Percent of renewable energy use by country (EIA)

Custom Colormap

Custom colormaps can be created by passing a list of colors to the LinearSegmentedColormap.from_list() function from the Matplotlib colors module.

Colors can be specified in a number of ways. In this example, we use CSS color names.

import matplotlib.colors as colors

cmap = colors.LinearSegmentedColormap.from_list('cmap', ["sienna", "whitesmoke", "darkgreen"])

axis = countries.plot("Renewable_Percent", cmap = cmap, edgecolor = "#00000040",
	legend=True, legend_kwds={"orientation":"horizontal"})

axis.set_axis_off()

plt.show()
Figure
A custom colormap choropleth with a colorbar

Classified Choropleths

A classified choropleth groups ranges of variable values into classes displayed with a limited set of colors (usually five to seven) that make it easier to identify areas with similar values.

The classification scheme for a classified map indicates how to divide values into classes. The choice of classification scheme depends on the type of distribution and the ultimate purpose of your map.

The classification scheme can be specified in the plot() function call using the scheme parameter.

Classified scheme plots create boxed legends rather than colorbars.

axis = countries.plot("Renewable_Percent", legend=True, 
	cmap = "coolwarm_r", scheme="naturalbreaks", edgecolor = "#00000040",
	legend_kwds={"fontsize":8, "bbox_to_anchor":(0.2, 0.4)})

axis.set_axis_off()

plt.show()
Figure
A classified choropleth of percent of primary energy from renewable sources

Bubble Maps

A graduated symbol map uses differently sized symbols to map variable values. A common example of a graduated symbol map is the "bubble" map that uses differently sized circles based on the variable being mapped. Although circles are most common, other types of icons can be used for aesthetic variety.

When mapping counts or quantities, graduated symbol maps are more appropriate than choropleths when mapping counts rather than amounts (rates).

Because bubble maps do not have outlines of areas, a base map will need to be plotted before the bubbles so the base map sits under the bubbles and provides geographic context. For this base map, we use the country polygons.

centroids = countries.copy()

centroids["geometry"] = centroids.centroid

centroids = centroids.dropna()

centroids["size"] = 500 * centroids["CO2_Emissions_MM_Tonnes"] / max(centroids["CO2_Emissions_MM_Tonnes"])

basemap = countries.plot(facecolor='white', edgecolor='gray')

axis = centroids.plot(ax=basemap, markersize="size", color="navy", categorical=False, legend=True)

axis.set_axis_off()

plt.show()
Figure
A bubble map

Filters

If you want to show only a subset of your data, you can filter specific rows of your data. Filtering is also sometimes called selection or subsetting.

For example, this code filters only the member states of the Association of Southeast Asian Nations.

The isin() method on the Country_Code column returns an boolean array with a true/false value for each row based on whether the strings are contained in the array.

asean = countries[countries.Country_Code.isin(["BRN", "KHM", "IDN", "LAO", 
	"MYS", "MMR", "PHL", "SGP", "THA", "VNM"])]

axis = asean.plot("Renewable_Percent", legend=True, 
	cmap = "coolwarm_r", edgecolor = "#00000040", scheme="naturalbreaks")

axis.set_axis_off()

plt.show()
Figure
A filtered map showing renewable energy percent of total consumption in ASEAN member states

Categorical Maps

Categorical Variable

Categorical variables are variables that have a fixed number of values representing categories.

For example, in this data set, the Regime_Type is the type of national government based on the V-Dem Democracy Index, using the regime classification by Lührmann et al. (2018).

axis = countries.plot("Regime_Type", cmap = "coolwarm_r", edgecolor = "#00000040", 
	legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.4)})

axis.set_axis_off()

plt.show()
Figure
A categorical variable map by type of governing regieme

Categorical Selection

Categorical variable mapping can be used to map selected areas within a GeoDataFrame.

countries["EU"] = countries.Country_Code.isin(["AUT", "BEL", "BGR", "HRV", \
	"CYP", "CZE", "DNK", "EST", "FIN", "FRA", "DEU", "GRC", "HUN", \
	"IRL", "ITA", "LVA", "LTU", "LUX", "MLT", "NLD", "POL", "PRT", \
	"ROU", "SVK", "SVN", "ESP", "SWE"])

axis = countries.plot("EU", cmap = "Blues", edgecolor = "#00000040")

axis.set_axis_off()

plt.show()
Figure
A categorized map showing only EU member states

Base Maps

Maps of world countries are generally recognizable by anyone with at least a rudamentary knowledge of how countries of the world are represented on maps. However, there are situations where the areas being depicted are less obvious, especially with local maps. For those situations it can be helpful to provide a base map that gives a broader context for where the mapped features are located.

The contextily package can be used to plot a base map under your features.

import contextily as cx

asean = countries[countries.Country_Code.isin(["BRN", "KHM", "IDN", "LAO", 
	"MYS", "MMR", "PHL", "SGP", "THA", "VNM"])]

asean = asean.to_crs("EPSG:3857") # Web Mercator

axis = asean.plot("Renewable_Percent", legend=True, 
	cmap = "coolwarm_r", edgecolor = "#00000040", 
	scheme="naturalbreaks", alpha=0.7)

cx.add_basemap(axis)

axis.set_axis_off()

plt.show()
Figure
A filtered map with a base map: Percent renewable energy (EIA)

Interactive Maps with Folium

You can embed interactive maps in Jupyter notebooks with the Folium library.

Choropleths can be created using the folium.Choropleth() function, which has some quirks when working with GeoDataframes:

import folium

renewables = countries[["Country_Code", "Renewable_Percent", "geometry"]]

renewables = renewables.to_crs(epsg=4326)

fmap = folium.Map(location=[30,0], zoom_start=1)

folium.Choropleth(
    geo_data = renewables,
    data = renewables,
    columns=["Country_Code", "Renewable_Percent"],
    key_on="feature.properties.Country_Code",
    fill_color="RdBu",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Percent Renewable"
).add_to(fmap)

fmap
Figure
An interactive map using the folium library

Appendix: Example Distributions

import numpy

import seaborn as sns

fig, axis = plt.subplots(2, 2)

numpy.random.seed(42)

axis[0, 0].set_title("Uniform")

x = numpy.random.uniform(0.0, 1.0, 1000)

sns.histplot(x, ax=axis[0,0], kde=True)

axis[0, 1].set_title("Normal")

x = numpy.random.normal(0, 1, 1000)

sns.histplot(x, ax=axis[0,1], kde=True)

axis[1, 0].set_title("Skewed Right (Positive)")

x = numpy.random.chisquare(4, 1000)

sns.histplot(x, ax=axis[1,0], kde=True)

axis[1, 1].set_title("Bimodal")

x = 25 - numpy.random.chisquare(3, 1000)

x = numpy.concatenate((numpy.random.normal(-2,1, 600), numpy.random.normal(2,1,400)))

sns.histplot(x, ax=axis[1,1], kde=True)

plt.show()