Exploring Data in Python
This tutorial covers the creation of basic descriptive statistics in Python that can be used to explore geospatial data.
Loading and Mapping Data
The example data used for this tutorial is a collection of country-level energy indicators for the year 2019 from the US Energy Information Administration, the World Bank, and others.
The following code shows how to load an online GeoJSON file into a GeoPandas GeoDataFrame for visualization and analysis.
- GeoPandas is an extension of Pandas and Shapely libraries for working with geospatial data.
- Matplotlib is a visualization library. matplotlib.pyplot is a state-based interface into matplotlib.
- The GeoPandas read_file() function reads a geospatial data file into a GeoDataFrame.
- The figure.figsize in rcParams sets the size of plots (width, height). The appropriate value for your system will depend on the type of display you are using and your aesthetic preferences.
- The GeoDataFrame plot() method creates a plot using matplotlib.
- For this example, we will color the countries by the MM_BTU_per_Capita variable (millions of BTU of energy by the categorical PrimSource variable representing the primary fuel source for the countries.
- The legend=True parameter indicates to plot a legend so the reader can tell what the dot colors mean.
- The markersize=2 parameter indicates to plot a two point dot that is smaller than the default, which makes the large number of dots easier to distinguish on this small map. As a graphical size unit, a point is 1/72nd of an inch (Merriam-Webster 2023)
- The plt.show() function needs to be called after plotting to display plots set up with prior plot() calls.
import geopandas import matplotlib.pyplot as plt countries = geopandas.read_file("https://michaelminn.net/tutorials/data/2019-world-energy-indicators.geojson") plt.rcParams['figure.figsize'] = [9, 6] axis = countries.plot("MM_BTU_per_Capita", legend=True, scheme="quantiles", cmap="coolwarm") plt.show()
Mapping permits visual analysis of geospatial data for answering general where questions.
- What regions of the world have the highest per capita energy consumption? (North America, Western Europe, the Global North)
- What regions of the world have the lowest per capita energy consumption? (Sub-Saharan Africa, South Asia)
countries = countries.to_crs("ESRI:54030") axis = countries.plot("MM_BTU_per_Capita", legend=True, scheme="quantiles", cmap="coolwarm") axis.set_axis_off() plt.show()
Metadata
When working with a data set for the first time, you should investigate whether metadata is available for the data set that will answer questions about the data like:
- What time period does the data cover?
- What were the original sources for the data?
- Who created the data?
- When was the data published?
- What do the different data fields represent?
- Are there restrictions on reuse or redistribution of the data?
- Are there warnings about possible problems with the quality or validity of the data?
Because metadata is created separately from the data and is tedious to create and maintain, metadata is often incomplete and out-of-date.
You can embed an iframe of a web page in a Jupyter notebook using IPython.
import IPython iframe = IPython.display.IFrame("https://michaelminn.net/tutorials/data/2019-world-energy-indicators-metadata.html", width=700, height=350) IPython.display.display(iframe)
Available Fields
The info() method list the attribute columns along with the data types and number of non-null (not empty) values.
info() output can be used to answer questions.
- What fields are available in the data?
- What are the types of the available fields?
- How many of the values in a field are blank?
print(countries.info())
<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 175 entries, 0 to 174 Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country_Code 175 non-null object 1 Country_Name 175 non-null object 2 Longitude 175 non-null float64 3 Latitude 175 non-null float64 4 WB_Region 171 non-null object 5 WB_Income_Group 170 non-null object 6 Population 170 non-null float64 7 GNI_PPP_B_Dollars 162 non-null float64 8 GDP_per_Capita_PPP_Dollars 162 non-null float64 9 MJ_per_Dollar_GDP 166 non-null float64 10 Fuel_Percent_Exports 148 non-null float64 11 Resource_Rent_Percent_GDP 168 non-null float64 12 Exports_Percent_GDP 161 non-null float64 13 Imports_Percent_GDP 161 non-null float64 14 Industry_Percent_GDP 167 non-null float64 15 Agriculture_Percent_GDP 167 non-null float64 16 Military_Percent_GDP 150 non-null float64 17 Gini_Index 126 non-null float64 18 Land_Area_sq_km 169 non-null float64 19 Arable_Land_HA_per_Capita 168 non-null float64 20 CO2_per_Capita_Tonnes 165 non-null float64 21 HDI 164 non-null float64 22 Regime_Type 164 non-null object 23 Democracy_Index 164 non-null float64 24 CO2_Emissions_MM_Tonnes 172 non-null float64 25 Primary_Consumption_Quads 172 non-null float64 26 Coal_Consumption_Quads 172 non-null float64 27 Gas_Consumption_Quads 172 non-null float64 28 Oil_Consumption_Quads 172 non-null float64 29 Nuclear_Consumption_Quads 31 non-null float64 30 Renewable_Consumption_Quads 172 non-null float64 31 Primary_Production_Quads 172 non-null float64 32 Coal_Production_Quads 172 non-null float64 33 Gas_Production_Quads 172 non-null float64 34 Oil_Production_Quads 172 non-null float64 35 Nuclear_Production_Quads 31 non-null float64 36 Renewable_Production_Quads 172 non-null float64 37 Electricity_Capacity_gW 172 non-null float64 38 Electricity_Generation_tWh 172 non-null float64 39 Electricity_Consumption_tWh 172 non-null float64 40 Oil_Production_Mbd 172 non-null float64 41 Oil_Consumption_Mbd 172 non-null float64 42 Oil_Reserves_B_Barrels 169 non-null float64 43 Gas_Production_bcf 172 non-null float64 44 Gas_Consumption_bcf 172 non-null float64 45 Gas_Reserves_tcf 161 non-null float64 46 Coal_Production_Mst 172 non-null float64 47 Coal_Consumption_Mst 172 non-null float64 48 Coal_Reserves_Mst 172 non-null float64 49 MM_BTU_per_Capita 169 non-null float64 50 Renewable_Percent 172 non-null float64 51 geometry 175 non-null geometry dtypes: float64(46), geometry(1), object(5) memory usage: 71.2+ KB
If you only need the dimensions of the GeoDataFrame the shape of a DataFrame or GeoDataFrame is the number of rows and columns.
print(countries.shape)
(175, 52)
If you only need the column names, the Pandas columns attribute contains the column labels for the fields in the DataFrame / GeoDataFrame.
print(countries.columns)
Index(['Country_Code', 'Country_Name', 'Longitude', 'Latitude', 'WB_Region', 'WB_Income_Group', 'Population', 'GNI_PPP_B_Dollars', 'GDP_per_Capita_PPP_Dollars', 'MJ_per_Dollar_GDP', 'Fuel_Percent_Exports', 'Resource_Rent_Percent_GDP', 'Exports_Percent_GDP', 'Imports_Percent_GDP', 'Industry_Percent_GDP', 'Agriculture_Percent_GDP', 'Military_Percent_GDP', 'Gini_Index', 'Land_Area_sq_km', 'Arable_Land_HA_per_Capita', 'CO2_per_Capita_Tonnes', 'HDI', 'Regime_Type', 'Democracy_Index', 'CO2_Emissions_MM_Tonnes', 'Primary_Consumption_Quads', 'Coal_Consumption_Quads', 'Gas_Consumption_Quads', 'Oil_Consumption_Quads', 'Nuclear_Consumption_Quads', 'Renewable_Consumption_Quads', 'Primary_Production_Quads', 'Coal_Production_Quads', 'Gas_Production_Quads', 'Oil_Production_Quads', 'Nuclear_Production_Quads', 'Renewable_Production_Quads', 'Electricity_Capacity_gW', 'Electricity_Generation_tWh', 'Electricity_Consumption_tWh', 'Oil_Production_Mbd', 'Oil_Consumption_Mbd', 'Oil_Reserves_B_Barrels', 'Gas_Production_bcf', 'Gas_Consumption_bcf', 'Gas_Reserves_tcf', 'Coal_Production_Mst', 'Coal_Consumption_Mst', 'Coal_Reserves_Mst', 'MM_BTU_per_Capita', 'Renewable_Percent', 'geometry'], dtype='object')
Category Counts
The Pandas value_counts() method can be used to list the frequency (count) of values in a categorical variable().
How many countries are in each World Bank income group?
print(countries["WB_Income_Group"].value_counts())
WB_Income_Group High income 51 Lower middle income 48 Upper middle income 43 Low income 28 Name: count, dtype: int64
You can convert those counts as percentages by dividing by the Pandas count() method and multiplying by 100.
What percentage of countries are in each World Bank income group?
print(countries["WB_Income_Group"].value_counts() / countries["WB_Income_Group"].count() * 100)
WB_Income_Group High income 30.000000 Lower middle income 28.235294 Upper middle income 25.294118 Low income 16.470588 Name: count, dtype: float64
Row Aggregation
The Pandas groupby() method can be used to aggregate quantitative variable values based on a categorical variable.
What was the median annual energy consumption by countries in the different World Bank income groups in 2019 (millions of BTU per capita)?
print(countries.groupby("WB_Income_Group")["MM_BTU_per_Capita"].median())
WB_Income_Group High income 150.0705 Low income 2.9575 Lower middle income 14.4875 Upper middle income 60.5600 Name: MM_BTU_per_Capita, dtype: float64
You can aggregate by other functions.
What was the total annual energy consumption by countries in the different World Bank income groups in 2019 (quadrillion BTU)?
print(countries.groupby("WB_Income_Group")["Primary_Consumption_Quads"].sum())
WB_Income_Group High income 255.894 Low income 2.978 Lower middle income 83.985 Upper middle income 249.126 Name: Primary_Consumption_Quads, dtype: float64
If you want the table to be in sorted order, you can use the Pandas sort_values() method to sort by the aggregated variable.
group_totals = countries.groupby("WB_Income_Group")["Primary_Consumption_Quads"].sum() group_totals = group_totals.sort_values(ascending=False) print(group_totals)
WB_Income_Group High income 255.894 Upper middle income 249.126 Lower middle income 83.985 Low income 2.978 Name: Primary_Consumption_Quads, dtype: float64
Distribution
The distribution of a variable is the manner in which the values are spread across the range of possible values.
- Distributions are commonly summarized with a central tendency like mean or median.
- The mean (commonly called average) is the sum of all values divided by the number of values.
- Normal distributions are further summarized with standard deviation that indicates how far values are spread away from the mean.
- Skew is the extent to which values are evenly distributed around the mean. Skew will be negative if the left tail of the distribution is longer, and positive if the right tail of the distribution is longer. The skewness of a normal distribution is zero.
- Kurtosis is a measurement of how flat or peaked the distribution is. Kurtosis is positive for a sharply peaked distribution and negative for a distribution that is flatter than a normal distribution. The kurtosis of a normal distribution is zero.
A quantile shows the values in a distribution below given percentages of the population.
- The 0% quantile is the minimum value.
- The 50% quantile is the median where 50% of the values in the distribution are at or below the median value.
- The 100% quantile is the maximum value.
- Medians are often preferred over means because medians give a clearer value for central tendency than means that can be distorted by skew and outliers.
- Quantiles that divide the distribution into four groups are called quartiles and quantiles that divide the distribution into five groups are called quintiles.
Describe
The range and central tendency values for a DataFrame / GeoDataFrame can be displayed with the Pandas describe() method.
print(countries.describe())
Longitude Latitude Population GNI_PPP_B_Dollars count 175.000000 175.000000 1.700000e+02 162.000000 \ mean 21.815811 20.258486 4.527349e+07 816.854432 std 60.610593 25.290149 1.556804e+08 2700.076881 min -112.599000 -51.713000 5.622500e+04 1.056000 25% -4.614500 5.226500 3.806434e+06 35.968250 50% 22.720000 20.209000 1.068528e+07 113.593000 75% 46.461000 40.745000 3.281965e+07 463.222750 max 172.702000 74.770000 1.407745e+09 23376.230000 GDP_per_Capita_PPP_Dollars MJ_per_Dollar_GDP Fuel_Percent_Exports count 162.000000 166.000000 148.000000 \ mean 21451.140463 4.804271 16.707743 std 21655.703338 2.787646 26.508811 min 760.453000 0.630000 0.000000 25% 4985.353000 3.062500 1.063500 50% 13992.158000 4.065000 4.365000 75% 33023.375500 5.662500 19.698250 max 120174.755000 19.900000 99.986000 Resource_Rent_Percent_GDP Exports_Percent_GDP Imports_Percent_GDP count 168.000000 161.000000 161.000000 \ mean 6.121173 39.046832 43.324205 std 9.299281 26.385164 23.125305 min 0.000000 0.631000 0.588000 25% 0.476000 22.713000 28.420000 50% 2.128000 34.907000 39.375000 75% 7.409250 48.131000 52.075000 max 45.977000 204.321000 173.522000 ... Oil_Consumption_Mbd Oil_Reserves_B_Barrels Gas_Production_bcf count ... 172.000000 169.000000 172.000000 \ mean ... 572.581953 9.817071 823.347541 std ... 2006.368698 38.986053 3349.269102 min ... 0.309000 0.000000 0.000000 25% ... 22.215750 0.000000 0.000000 50% ... 77.363000 0.015000 1.708000 75% ... 294.083750 0.600000 347.529000 max ... 20542.854000 302.809000 33899.021000 Gas_Consumption_bcf Gas_Reserves_tcf Coal_Production_Mst count 172.000000 161.000000 1.720000e+02 \ mean 809.929860 44.530932 5.163426e+04 std 2909.592193 182.810288 3.406489e+05 min 0.000000 0.000000 0.000000e+00 25% 0.000000 0.000000 0.000000e+00 50% 46.585000 0.220000 0.000000e+00 75% 437.719000 8.500000 1.841574e+03 max 31132.041000 1688.228000 4.239848e+06 Coal_Consumption_Mst Coal_Reserves_Mst MM_BTU_per_Capita count 1.720000e+02 172.000000 169.000000 \ mean 4.962037e+04 6711.257645 89.017438 std 3.505100e+05 30589.108680 119.299077 min 0.000000e+00 0.000000 0.777000 25% 0.000000e+00 0.000000 10.699000 50% 2.778840e+02 0.000000 53.718000 75% 6.405231e+03 400.689750 109.448000 max 4.430349e+06 252057.000000 803.056000 Renewable_Percent count 172.000000 mean 16.739727 std 16.866886 min -0.615000 25% 3.748750 50% 12.208500 75% 25.269000 max 82.766000
If you have a GeoDataFrame with a large number of columns you can run describe() on specified columns.
What was the median per capita energy consumption by country in 2019? (53.7 MM BTU)
print(countries["MM_BTU_per_Capita"].describe())
count 169.000000 mean 89.017438 std 119.299077 min 0.777000 25% 10.699000 50% 53.718000 75% 109.448000 max 803.056000 Name: MM_BTU_per_Capita, dtype: float64
Histograms
Histograms are charts commonly used to visualize distributions by showing bars with the number of entries in different ranges of values.
Histograms are used to answer the question, "How are the values in a variable distributed across the range of possible values?"
What is the distribution of energy consumption by country? This histogram and the describe() output shows that most countries used under 100 MM BTU of energy per capita in 2019, with a handful of countries with exceptionally high energy consumption.
The pyplot.hist() function can be used to generate histograms.
plt.hist(countries["MM_BTU_per_Capita"]) plt.show()
The jagged quality of a histogram can be smoothed into an overlaid density plot using the kde = True parameter to the histplot() function from the seaborn statistical data visualization library.
import seaborn seaborn.histplot(countries["MM_BTU_per_Capita"], kde=True) plt.show()
Box Plot
A boxplot() is a visualization for comparing distributions between groups. Box plots give a clearer comparison of the range of values between multiple groups compared to the single-number simplification of central tendency values.
countries.boxplot(column="MM_BTU_per_Capita", by="WB_Income_Group") plt.show()
Rank Order
A rank order graphic or table displays the values in a variable sorted highest to lowest.
- The Pandas sort_values() method can be used to order the GeoDataFrame.
- The ascending=False parameter indicates to sort from greatest to least so the highest values would be at the top of the GeoDataFrame. ascending=True would sort from least to greatest and place the lowest values at the top of the GeoDataFrame.
- dropna() is used to remove rows with no data. NaN rows will appear at the bottom of a sorted list.
- reset_index() moves the aggregate group attribute into an attribute column and resequences the row indices as numbers that give an better indication of rank when displayed.
The head() method can be used to subset the five rows at the beginning of the GeoDataFrame.
What were the top five countries in per capita energy consumption in 2019?
rank_countries = countries.sort_values("MM_BTU_per_Capita", ascending=False) rank_countries = rank_countries.dropna("MM_BTU_per_Capita") rank_countries = rank_countries.reset_index() print(rank_countries[["Country_Name", "MM_BTU_per_Capita"]].head())
Country_Name MM_BTU_per_Capita 0 Qatar 803.056 1 Iceland 598.031 2 United Arab Emirates 500.898 3 Trinidad and Tobago 480.882 4 Brunei 410.541
The tail() method can be used to subset the five rows at the end of the GeoDataFrame.
What were bottom top five countries in per capita energy consumption in 2019?
print(rank_countries[["Country_Name", "MM_BTU_per_Capita"]].tail())
Country_Name MM_BTU_per_Capita 164 Chad 1.563 165 Dem. Rep. Congo 1.512 166 Burundi 1.056 167 Central African Rep. 1.021 168 Somalia 0.777
As an alternative to a histogram, a rank order plot.bar() of a sorted variable can be used to show the distribution of values.
axis = rank_countries.plot.bar(y="MM_BTU_per_Capita") axis.xaxis.set_visible(False) plt.show()
Correlation
In exploring data, we often look for relationships between variables.
Correlation is "a relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or occur together in a way not expected on the basis of chance alone" (Merriam-Webster 2022).
While it is important to always remember that correlation and causation are two different things, correlation analysis is a very powerful exploratory technique for determining whether there is a relationship between two variables.
The strength of a correlation is measured using the coefficient of determination which is more commonly called R-squared.This can be written as R2, R squared, R-squared, or R^2.
Evaluation of R2 to determine whether correlation should be considered strong or not depends on the type of phenomena being studied.
- The range of R2 is from 0.000 (no correlation) to 1.000 (perfect correlation).
- Values of less than 0.100 can usually be considered to represent no meaningful correlation.
- In the social sciences where relationships often involve the complex interplay of ambiguous factors, values in the 0.200s or 0.300s can be considered strongly correlated enough to merit further investigation.
- In the natural sciences, values above 0.600 are often expected from variables that are strongly correlated.
The Pandas corr() method calculates correlation coefficients between all variables in a DataFrame.
- The Pandas select_dtypes() method is used to subset only numeric columns since only numeric values can be correlated.
- The correlation coefficients are squared to get R-squared.
- sort_values() sorts the list from high to low.
import numpy r_squared = countries.select_dtypes(include=numpy.number) r_squared = r_squared.corr()["MM_BTU_per_Capita"]**2 r_squared = r_squared.sort_values(ascending=False).round(3) print(r_squared)
MM_BTU_per_Capita 1.000 GDP_per_Capita_PPP_Dollars 0.606 Agriculture_Percent_GDP 0.236 Nuclear_Consumption_Quads 0.178 Nuclear_Production_Quads 0.178 Fuel_Percent_Exports 0.173 Gini_Index 0.150 Gas_Reserves_tcf 0.120 Latitude 0.119 ...
Scatter Chart
Correlation can be visualized by plotting the two variables on an x/y scatter chart and looking for an upward or downward pattern of dots diagonally across the chart.
- Negative correlation means that as one variable goes up, the other goes down. When two variables with a negative correlation are plotted on the two axes of an X/Y scatter chart, the points form a rough line or curve downward from left to right.
- Positive correlation means that as one variable goes up, the other goes up as well. When two variables with a positive correlation are plotted on the two axes of an X/Y scatter chart, the points form a rough line or curve upward from left to right.
- When there is no correlation, the X/Y scatter chart dots have no clear upward or downward pattern.
The Matplotlib pyplot.scatter() creates an x/y scatter chart.
axis = plt.scatter(countries["MM_BTU_per_Capita"], countries["GDP_per_Capita_PPP_Dollars"]) plt.show()
To spread out the large collection of small countries clumped at the bottom left, we can use logarithmic axes.
path = plt.scatter(countries["MM_BTU_per_Capita"], countries["GDP_per_Capita_PPP_Dollars"]) path.axes.set_xscale("log"); path.axes.set_yscale("log"); plt.show()
The Seaborn regplot() function can be used to draw a scatter plot with a regression line that highlights the correlation pattern and direction.
import seaborn seaborn.regplot(x = "MM_BTU_per_Capita", y = "GDP_per_Capita_PPP_Dollars", data=countries, line_kws={"color": "red"}) plt.show()
Because regplot() cannot display logarithmic y-axes, you will need to log transform the data prior to plotting if you want logarithmic x and y scales with a regression line.
import numpy energy_economy = countries[["MM_BTU_per_Capita", "GDP_per_Capita_PPP_Dollars"]] energy_economy = energy_economy.dropna() energy_economy["MM_BTU_per_Capita_log"] = numpy.log(energy_economy["MM_BTU_per_Capita"]) energy_economy["GDP_per_Capita_log"] = numpy.log(energy_economy["GDP_per_Capita_PPP_Dollars"]) seaborn.regplot(x = "MM_BTU_per_Capita_log", y = "GDP_per_Capita_log", data=energy_economy, line_kws={"color": "red"}) plt.show()
The Post Hoc Fallacy
While correlation may be interesting, what we are usually more interested in is causation. Correlation is a simple mathematical relationship between two variables, but causation means that there is a material cause-and-effect relationship between the two phenomenon we are measuring with our variables.
Correlation is empirical (based on observation), and causation is rational (based on reason). When we observe two phenomena occurring together and we observe that there is some mechanism connecting the two phenomena, we use reason and logic to tie those two phenomena together in a cause and effect relationship.
Assuming that correlation proves causation is the post hoc fallacy, from the Latin phrase post hoc ergo propter hoc (after this, therefore because of this). A logical fallacy is "an often plausible argument using false or invalid inference."
For example, the correlation between per capita energy use and level of nuclear power production could lead to a fallacious inference that the use of nuclear power increases energy use.
However, both the presence of nuclear power and high per capita energy use are probably more causally related through a third variable of early economic development, which results in higher energy use and the development of complex and expensive nuclear technology that is both economically out of reach of poorer countries, and which is actively blocked from proliferation for political reasons by wealthy countries.
Correlation points to possible causal relationships, but does not prove them, and there are a variety of logical arguments to show how making a simple assumption that correlation is causation will lead you astray. Determining whether there is a cause-and-effect relationship requires more sophisticated techniques and domain knowledge beyond simple mathematical correlation.