Exploring Data in Python

This tutorial covers the creation of basic descriptive statistics in Python that can be used to explore geospatial data.

Loading and Mapping Data

The example data used for this tutorial is a collection of country-level energy indicators for the year 2019 from the US Energy Information Administration, the World Bank, and others.

The following code shows how to load an online GeoJSON file into a GeoPandas GeoDataFrame for visualization and analysis.

GeoPandas is an extension of Pandas and Shapely libraries for working with geospatial data.
Matplotlib is a visualization library. matplotlib.pyplot is a state-based interface into matplotlib.
The GeoPandas read_file() function reads a geospatial data file into a GeoDataFrame.
The figure.figsize in rcParams sets the size of plots (width, height). The appropriate value for your system will depend on the type of display you are using and your aesthetic preferences.
The GeoDataFrame plot() method creates a plot using matplotlib.

For this example, we will color the countries by the MM_BTU_per_Capita variable (millions of BTU of energy by the categorical PrimSource variable representing the primary fuel source for the countries.
The legend=True parameter indicates to plot a legend so the reader can tell what the dot colors mean.
The markersize=2 parameter indicates to plot a two point dot that is smaller than the default, which makes the large number of dots easier to distinguish on this small map. As a graphical size unit, a point is 1/72nd of an inch (Merriam-Webster 2023)

The plt.show() function needs to be called after plotting to display plots set up with prior plot() calls.

import geopandas

import matplotlib.pyplot as plt

countries = geopandas.read_file("https://michaelminn.net/tutorials/data/2019-world-energy-indicators.geojson")

plt.rcParams['figure.figsize'] = [9, 6]

axis = countries.plot("MM_BTU_per_Capita", legend=True, scheme="quantiles", cmap="coolwarm")

plt.show()

2019 per capita power consumption by country

To improve cartographic representation we use the to_crs() method to reproject the data from WGS 84 lat/long to the World Robinson projection commonly used on print maps (WKID ESRI 54030). Projections are described in greater detail in this tutorial on Map Projections.

The set_axis_off() method turns off the unnecessary map frame and axis labels, which are largely meaningless when mapping in a web Mercator projection. Unfortunately, there is no global parameter to turn axes off, so this function must be called before each plt.show().

Mapping permits visual analysis of geospatial data for answering general where questions.

What regions of the world have the highest per capita energy consumption? (North America, Western Europe, the Global North)
What regions of the world have the lowest per capita energy consumption? (Sub-Saharan Africa, South Asia)

countries = countries.to_crs("ESRI:54030")

axis = countries.plot("MM_BTU_per_Capita", legend=True, scheme="quantiles", cmap="coolwarm")

axis.set_axis_off()

plt.show()

Metadata

When working with a data set for the first time, you should investigate whether metadata is available for the data set that will answer questions about the data like:

What time period does the data cover?
What were the original sources for the data?
Who created the data?
When was the data published?
What do the different data fields represent?
Are there restrictions on reuse or redistribution of the data?
Are there warnings about possible problems with the quality or validity of the data?

Because metadata is created separately from the data and is tedious to create and maintain, metadata is often incomplete and out-of-date.

You can embed an iframe of a web page in a Jupyter notebook using IPython.

import IPython

iframe = IPython.display.IFrame("https://michaelminn.net/tutorials/data/2019-world-energy-indicators-metadata.html", 
	width=700, height=350)

IPython.display.display(iframe)

The metadata page for the world energy indicators used in this tutorial

Available Fields

The info() method list the attribute columns along with the data types and number of non-null (not empty) values.

info() output can be used to answer questions.

What fields are available in the data?
What are the types of the available fields?
How many of the values in a field are blank?

print(countries.info())

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 175 entries, 0 to 174
Data columns (total 52 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   Country_Code                 175 non-null    object  
 1   Country_Name                 175 non-null    object  
 2   Longitude                    175 non-null    float64 
 3   Latitude                     175 non-null    float64 
 4   WB_Region                    171 non-null    object  
 5   WB_Income_Group              170 non-null    object  
 6   Population                   170 non-null    float64 
 7   GNI_PPP_B_Dollars            162 non-null    float64 
 8   GDP_per_Capita_PPP_Dollars   162 non-null    float64 
 9   MJ_per_Dollar_GDP            166 non-null    float64 
 10  Fuel_Percent_Exports         148 non-null    float64 
 11  Resource_Rent_Percent_GDP    168 non-null    float64 
 12  Exports_Percent_GDP          161 non-null    float64 
 13  Imports_Percent_GDP          161 non-null    float64 
 14  Industry_Percent_GDP         167 non-null    float64 
 15  Agriculture_Percent_GDP      167 non-null    float64 
 16  Military_Percent_GDP         150 non-null    float64 
 17  Gini_Index                   126 non-null    float64 
 18  Land_Area_sq_km              169 non-null    float64 
 19  Arable_Land_HA_per_Capita    168 non-null    float64 
 20  CO2_per_Capita_Tonnes        165 non-null    float64 
 21  HDI                          164 non-null    float64 
 22  Regime_Type                  164 non-null    object  
 23  Democracy_Index              164 non-null    float64 
 24  CO2_Emissions_MM_Tonnes      172 non-null    float64 
 25  Primary_Consumption_Quads    172 non-null    float64 
 26  Coal_Consumption_Quads       172 non-null    float64 
 27  Gas_Consumption_Quads        172 non-null    float64 
 28  Oil_Consumption_Quads        172 non-null    float64 
 29  Nuclear_Consumption_Quads    31 non-null     float64 
 30  Renewable_Consumption_Quads  172 non-null    float64 
 31  Primary_Production_Quads     172 non-null    float64 
 32  Coal_Production_Quads        172 non-null    float64 
 33  Gas_Production_Quads         172 non-null    float64 
 34  Oil_Production_Quads         172 non-null    float64 
 35  Nuclear_Production_Quads     31 non-null     float64 
 36  Renewable_Production_Quads   172 non-null    float64 
 37  Electricity_Capacity_gW      172 non-null    float64 
 38  Electricity_Generation_tWh   172 non-null    float64 
 39  Electricity_Consumption_tWh  172 non-null    float64 
 40  Oil_Production_Mbd           172 non-null    float64 
 41  Oil_Consumption_Mbd          172 non-null    float64 
 42  Oil_Reserves_B_Barrels       169 non-null    float64 
 43  Gas_Production_bcf           172 non-null    float64 
 44  Gas_Consumption_bcf          172 non-null    float64 
 45  Gas_Reserves_tcf             161 non-null    float64 
 46  Coal_Production_Mst          172 non-null    float64 
 47  Coal_Consumption_Mst         172 non-null    float64 
 48  Coal_Reserves_Mst            172 non-null    float64 
 49  MM_BTU_per_Capita            169 non-null    float64 
 50  Renewable_Percent            172 non-null    float64 
 51  geometry                     175 non-null    geometry
dtypes: float64(46), geometry(1), object(5)
memory usage: 71.2+ KB

If you only need the dimensions of the GeoDataFrame the shape of a DataFrame or GeoDataFrame is the number of rows and columns.

print(countries.shape)

(175, 52)

If you only need the column names, the Pandas columns attribute contains the column labels for the fields in the DataFrame / GeoDataFrame.

print(countries.columns)

Index(['Country_Code', 'Country_Name', 'Longitude', 'Latitude', 'WB_Region',
       'WB_Income_Group', 'Population', 'GNI_PPP_B_Dollars',
       'GDP_per_Capita_PPP_Dollars', 'MJ_per_Dollar_GDP',
       'Fuel_Percent_Exports', 'Resource_Rent_Percent_GDP',
       'Exports_Percent_GDP', 'Imports_Percent_GDP', 'Industry_Percent_GDP',
       'Agriculture_Percent_GDP', 'Military_Percent_GDP', 'Gini_Index',
       'Land_Area_sq_km', 'Arable_Land_HA_per_Capita', 'CO2_per_Capita_Tonnes',
       'HDI', 'Regime_Type', 'Democracy_Index', 'CO2_Emissions_MM_Tonnes',
       'Primary_Consumption_Quads', 'Coal_Consumption_Quads',
       'Gas_Consumption_Quads', 'Oil_Consumption_Quads',
       'Nuclear_Consumption_Quads', 'Renewable_Consumption_Quads',
       'Primary_Production_Quads', 'Coal_Production_Quads',
       'Gas_Production_Quads', 'Oil_Production_Quads',
       'Nuclear_Production_Quads', 'Renewable_Production_Quads',
       'Electricity_Capacity_gW', 'Electricity_Generation_tWh',
       'Electricity_Consumption_tWh', 'Oil_Production_Mbd',
       'Oil_Consumption_Mbd', 'Oil_Reserves_B_Barrels', 'Gas_Production_bcf',
       'Gas_Consumption_bcf', 'Gas_Reserves_tcf', 'Coal_Production_Mst',
       'Coal_Consumption_Mst', 'Coal_Reserves_Mst', 'MM_BTU_per_Capita',
       'Renewable_Percent', 'geometry'],
      dtype='object')

Category Counts

The Pandas value_counts() method can be used to list the frequency (count) of values in a categorical variable().

How many countries are in each World Bank income group?

print(countries["WB_Income_Group"].value_counts())

WB_Income_Group
High income            51
Lower middle income    48
Upper middle income    43
Low income             28
Name: count, dtype: int64

You can convert those counts as percentages by dividing by the Pandas count() method and multiplying by 100.

What percentage of countries are in each World Bank income group?

print(countries["WB_Income_Group"].value_counts() / countries["WB_Income_Group"].count() * 100)

WB_Income_Group
High income            30.000000
Lower middle income    28.235294
Upper middle income    25.294118
Low income             16.470588
Name: count, dtype: float64

Row Aggregation

The Pandas groupby() method can be used to aggregate quantitative variable values based on a categorical variable.

What was the median annual energy consumption by countries in the different World Bank income groups in 2019 (millions of BTU per capita)?

print(countries.groupby("WB_Income_Group")["MM_BTU_per_Capita"].median())

WB_Income_Group
High income            150.0705
Low income               2.9575
Lower middle income     14.4875
Upper middle income     60.5600
Name: MM_BTU_per_Capita, dtype: float64

You can aggregate by other functions.

What was the total annual energy consumption by countries in the different World Bank income groups in 2019 (quadrillion BTU)?

print(countries.groupby("WB_Income_Group")["Primary_Consumption_Quads"].sum())

WB_Income_Group
High income            255.894
Low income               2.978
Lower middle income     83.985
Upper middle income    249.126
Name: Primary_Consumption_Quads, dtype: float64

If you want the table to be in sorted order, you can use the Pandas sort_values() method to sort by the aggregated variable.

group_totals = countries.groupby("WB_Income_Group")["Primary_Consumption_Quads"].sum()

group_totals = group_totals.sort_values(ascending=False)

print(group_totals)

WB_Income_Group
High income            255.894
Upper middle income    249.126
Lower middle income     83.985
Low income               2.978
Name: Primary_Consumption_Quads, dtype: float64

Distribution

The distribution of a variable is the manner in which the values are spread across the range of possible values.

Distributions are commonly summarized with a central tendency like mean or median.
The mean (commonly called average) is the sum of all values divided by the number of values.
Normal distributions are further summarized with standard deviation that indicates how far values are spread away from the mean.
Skew is the extent to which values are evenly distributed around the mean. Skew will be negative if the left tail of the distribution is longer, and positive if the right tail of the distribution is longer. The skewness of a normal distribution is zero.
Kurtosis is a measurement of how flat or peaked the distribution is. Kurtosis is positive for a sharply peaked distribution and negative for a distribution that is flatter than a normal distribution. The kurtosis of a normal distribution is zero.

A quantile shows the values in a distribution below given percentages of the population.

The 0% quantile is the minimum value.
The 50% quantile is the median where 50% of the values in the distribution are at or below the median value.
The 100% quantile is the maximum value.
Medians are often preferred over means because medians give a clearer value for central tendency than means that can be distorted by skew and outliers.
Quantiles that divide the distribution into four groups are called quartiles and quantiles that divide the distribution into five groups are called quintiles.

Describe

The range and central tendency values for a DataFrame / GeoDataFrame can be displayed with the Pandas describe() method.

print(countries.describe())

        Longitude    Latitude    Population  GNI_PPP_B_Dollars   
count  175.000000  175.000000  1.700000e+02         162.000000  \
mean    21.815811   20.258486  4.527349e+07         816.854432   
std     60.610593   25.290149  1.556804e+08        2700.076881   
min   -112.599000  -51.713000  5.622500e+04           1.056000   
25%     -4.614500    5.226500  3.806434e+06          35.968250   
50%     22.720000   20.209000  1.068528e+07         113.593000   
75%     46.461000   40.745000  3.281965e+07         463.222750   
max    172.702000   74.770000  1.407745e+09       23376.230000   

       GDP_per_Capita_PPP_Dollars  MJ_per_Dollar_GDP  Fuel_Percent_Exports   
count                  162.000000         166.000000            148.000000  \
mean                 21451.140463           4.804271             16.707743   
std                  21655.703338           2.787646             26.508811   
min                    760.453000           0.630000              0.000000   
25%                   4985.353000           3.062500              1.063500   
50%                  13992.158000           4.065000              4.365000   
75%                  33023.375500           5.662500             19.698250   
max                 120174.755000          19.900000             99.986000   

       Resource_Rent_Percent_GDP  Exports_Percent_GDP  Imports_Percent_GDP   
count                 168.000000           161.000000           161.000000  \
mean                    6.121173            39.046832            43.324205   
std                     9.299281            26.385164            23.125305   
min                     0.000000             0.631000             0.588000   
25%                     0.476000            22.713000            28.420000   
50%                     2.128000            34.907000            39.375000   
75%                     7.409250            48.131000            52.075000   
max                    45.977000           204.321000           173.522000   

       ...  Oil_Consumption_Mbd  Oil_Reserves_B_Barrels  Gas_Production_bcf   
count  ...           172.000000              169.000000          172.000000  \
mean   ...           572.581953                9.817071          823.347541   
std    ...          2006.368698               38.986053         3349.269102   
min    ...             0.309000                0.000000            0.000000   
25%    ...            22.215750                0.000000            0.000000   
50%    ...            77.363000                0.015000            1.708000   
75%    ...           294.083750                0.600000          347.529000   
max    ...         20542.854000              302.809000        33899.021000   

       Gas_Consumption_bcf  Gas_Reserves_tcf  Coal_Production_Mst   
count           172.000000        161.000000         1.720000e+02  \
mean            809.929860         44.530932         5.163426e+04   
std            2909.592193        182.810288         3.406489e+05   
min               0.000000          0.000000         0.000000e+00   
25%               0.000000          0.000000         0.000000e+00   
50%              46.585000          0.220000         0.000000e+00   
75%             437.719000          8.500000         1.841574e+03   
max           31132.041000       1688.228000         4.239848e+06   

       Coal_Consumption_Mst  Coal_Reserves_Mst  MM_BTU_per_Capita   
count          1.720000e+02         172.000000         169.000000  \
mean           4.962037e+04        6711.257645          89.017438   
std            3.505100e+05       30589.108680         119.299077   
min            0.000000e+00           0.000000           0.777000   
25%            0.000000e+00           0.000000          10.699000   
50%            2.778840e+02           0.000000          53.718000   
75%            6.405231e+03         400.689750         109.448000   
max            4.430349e+06      252057.000000         803.056000   

       Renewable_Percent  
count         172.000000  
mean           16.739727  
std            16.866886  
min            -0.615000  
25%             3.748750  
50%            12.208500  
75%            25.269000  
max            82.766000

If you have a GeoDataFrame with a large number of columns you can run describe() on specified columns.

What was the median per capita energy consumption by country in 2019? (53.7 MM BTU)

print(countries["MM_BTU_per_Capita"].describe())

count    169.000000
mean      89.017438
std      119.299077
min        0.777000
25%       10.699000
50%       53.718000
75%      109.448000
max      803.056000
Name: MM_BTU_per_Capita, dtype: float64

Histograms

Histograms are charts commonly used to visualize distributions by showing bars with the number of entries in different ranges of values.

Histograms are used to answer the question, "How are the values in a variable distributed across the range of possible values?"

What is the distribution of energy consumption by country? This histogram and the describe() output shows that most countries used under 100 MM BTU of energy per capita in 2019, with a handful of countries with exceptionally high energy consumption.

The pyplot.hist() function can be used to generate histograms.

plt.hist(countries["MM_BTU_per_Capita"])

plt.show()

Histogram of per capita energy consumption by country in BTU in 2019

The jagged quality of a histogram can be smoothed into an overlaid density plot using the kde = True parameter to the histplot() function from the seaborn statistical data visualization library.

import seaborn

seaborn.histplot(countries["MM_BTU_per_Capita"], kde=True)

plt.show()

Box Plot

A boxplot() is a visualization for comparing distributions between groups. Box plots give a clearer comparison of the range of values between multiple groups compared to the single-number simplification of central tendency values.

countries.boxplot(column="MM_BTU_per_Capita", by="WB_Income_Group")

plt.show()

Box plot comparing per capita energy consumption by country between World Bank income groups

Rank Order

A rank order graphic or table displays the values in a variable sorted highest to lowest.

The Pandas sort_values() method can be used to order the GeoDataFrame.
The ascending=False parameter indicates to sort from greatest to least so the highest values would be at the top of the GeoDataFrame. ascending=True would sort from least to greatest and place the lowest values at the top of the GeoDataFrame.
dropna() is used to remove rows with no data. NaN rows will appear at the bottom of a sorted list.
reset_index() moves the aggregate group attribute into an attribute column and resequences the row indices as numbers that give an better indication of rank when displayed.

The head() method can be used to subset the five rows at the beginning of the GeoDataFrame.

What were the top five countries in per capita energy consumption in 2019?

rank_countries = countries.sort_values("MM_BTU_per_Capita", ascending=False)

rank_countries = rank_countries.dropna("MM_BTU_per_Capita")

rank_countries = rank_countries.reset_index()

print(rank_countries[["Country_Name", "MM_BTU_per_Capita"]].head())

           Country_Name  MM_BTU_per_Capita
0                 Qatar            803.056
1               Iceland            598.031
2  United Arab Emirates            500.898
3   Trinidad and Tobago            480.882
4                Brunei            410.541

The tail() method can be used to subset the five rows at the end of the GeoDataFrame.

What were bottom top five countries in per capita energy consumption in 2019?

print(rank_countries[["Country_Name", "MM_BTU_per_Capita"]].tail())

             Country_Name  MM_BTU_per_Capita
164                  Chad              1.563
165       Dem. Rep. Congo              1.512
166               Burundi              1.056
167  Central African Rep.              1.021
168               Somalia              0.777

As an alternative to a histogram, a rank order plot.bar() of a sorted variable can be used to show the distribution of values.

axis = rank_countries.plot.bar(y="MM_BTU_per_Capita")

axis.xaxis.set_visible(False)

plt.show()

Correlation

In exploring data, we often look for relationships between variables.

Correlation is "a relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or occur together in a way not expected on the basis of chance alone" (Merriam-Webster 2022).

While it is important to always remember that correlation and causation are two different things, correlation analysis is a very powerful exploratory technique for determining whether there is a relationship between two variables.

The strength of a correlation is measured using the coefficient of determination which is more commonly called R-squared.This can be written as R², R squared, R-squared, or R^2.

Evaluation of R² to determine whether correlation should be considered strong or not depends on the type of phenomena being studied.

The range of R² is from 0.000 (no correlation) to 1.000 (perfect correlation).
Values of less than 0.100 can usually be considered to represent no meaningful correlation.
In the social sciences where relationships often involve the complex interplay of ambiguous factors, values in the 0.200s or 0.300s can be considered strongly correlated enough to merit further investigation.
In the natural sciences, values above 0.600 are often expected from variables that are strongly correlated.

The Pandas corr() method calculates correlation coefficients between all variables in a DataFrame.

The Pandas select_dtypes() method is used to subset only numeric columns since only numeric values can be correlated.
The correlation coefficients are squared to get R-squared.
sort_values() sorts the list from high to low.

import numpy

r_squared = countries.select_dtypes(include=numpy.number)

r_squared = r_squared.corr()["MM_BTU_per_Capita"]**2

r_squared = r_squared.sort_values(ascending=False).round(3)

print(r_squared)

MM_BTU_per_Capita              1.000
GDP_per_Capita_PPP_Dollars     0.606
Agriculture_Percent_GDP        0.236
Nuclear_Consumption_Quads      0.178
Nuclear_Production_Quads       0.178
Fuel_Percent_Exports           0.173
Gini_Index                     0.150
Gas_Reserves_tcf               0.120
Latitude                       0.119
...

Scatter Chart

Correlation can be visualized by plotting the two variables on an x/y scatter chart and looking for an upward or downward pattern of dots diagonally across the chart.

Negative correlation means that as one variable goes up, the other goes down. When two variables with a negative correlation are plotted on the two axes of an X/Y scatter chart, the points form a rough line or curve downward from left to right.
Positive correlation means that as one variable goes up, the other goes up as well. When two variables with a positive correlation are plotted on the two axes of an X/Y scatter chart, the points form a rough line or curve upward from left to right.
When there is no correlation, the X/Y scatter chart dots have no clear upward or downward pattern.

The Matplotlib pyplot.scatter() creates an x/y scatter chart.

axis = plt.scatter(countries["MM_BTU_per_Capita"], countries["GDP_per_Capita_PPP_Dollars"])

plt.show()

To spread out the large collection of small countries clumped at the bottom left, we can use logarithmic axes.

path = plt.scatter(countries["MM_BTU_per_Capita"], countries["GDP_per_Capita_PPP_Dollars"])

path.axes.set_xscale("log");

path.axes.set_yscale("log");

plt.show()

The Seaborn regplot() function can be used to draw a scatter plot with a regression line that highlights the correlation pattern and direction.

import seaborn

seaborn.regplot(x = "MM_BTU_per_Capita", y = "GDP_per_Capita_PPP_Dollars", 
	data=countries, line_kws={"color": "red"})

plt.show()

Because regplot() cannot display logarithmic y-axes, you will need to log transform the data prior to plotting if you want logarithmic x and y scales with a regression line.

import numpy

energy_economy = countries[["MM_BTU_per_Capita", "GDP_per_Capita_PPP_Dollars"]]

energy_economy = energy_economy.dropna()

energy_economy["MM_BTU_per_Capita_log"] = numpy.log(energy_economy["MM_BTU_per_Capita"])

energy_economy["GDP_per_Capita_log"] = numpy.log(energy_economy["GDP_per_Capita_PPP_Dollars"])

seaborn.regplot(x = "MM_BTU_per_Capita_log", y = "GDP_per_Capita_log", 
	data=energy_economy, line_kws={"color": "red"})

plt.show()

The Post Hoc Fallacy

While correlation may be interesting, what we are usually more interested in is causation. Correlation is a simple mathematical relationship between two variables, but causation means that there is a material cause-and-effect relationship between the two phenomenon we are measuring with our variables.

Correlation is empirical (based on observation), and causation is rational (based on reason). When we observe two phenomena occurring together and we observe that there is some mechanism connecting the two phenomena, we use reason and logic to tie those two phenomena together in a cause and effect relationship.

Assuming that correlation proves causation is the post hoc fallacy, from the Latin phrase post hoc ergo propter hoc (after this, therefore because of this). A logical fallacy is "an often plausible argument using false or invalid inference."

For example, the correlation between per capita energy use and level of nuclear power production could lead to a fallacious inference that the use of nuclear power increases energy use.

However, both the presence of nuclear power and high per capita energy use are probably more causally related through a third variable of early economic development, which results in higher energy use and the development of complex and expensive nuclear technology that is both economically out of reach of poorer countries, and which is actively blocked from proliferation for political reasons by wealthy countries.

Correlation points to possible causal relationships, but does not prove them, and there are a variety of logical arguments to show how making a simple assumption that correlation is causation will lead you astray. Determining whether there is a cause-and-effect relationship requires more sophisticated techniques and domain knowledge beyond simple mathematical correlation.