Exploring Data in Python

This tutorial covers the creation of basic descriptive statistics in Python that can be used to explore geospatial data.

Loading and Mapping Data

The example data used for this tutorial is a collection of country-level energy indicators for the year 2019 from the US Energy Information Administration, the World Bank, and others.

The following code shows how to load an online GeoJSON file into a GeoPandas GeoDataFrame for visualization and analysis.

import geopandas

import matplotlib.pyplot as plt

countries = geopandas.read_file("https://michaelminn.net/tutorials/data/2019-world-energy-indicators.geojson")

plt.rcParams['figure.figsize'] = [9, 6]

axis = countries.plot("MM_BTU_per_Capita", legend=True, scheme="quantiles", cmap="coolwarm")

plt.show()
Figure
2019 per capita power consumption by country
  • To improve cartographic representation we use the to_crs() method to reproject the data from WGS 84 lat/long to the World Robinson projection commonly used on print maps (WKID ESRI 54030). Projections are described in greater detail in this tutorial on Map Projections.
  • The set_axis_off() method turns off the unnecessary map frame and axis labels, which are largely meaningless when mapping in a web Mercator projection. Unfortunately, there is no global parameter to turn axes off, so this function must be called before each plt.show().
  • Mapping permits visual analysis of geospatial data for answering general where questions.

    countries = countries.to_crs("ESRI:54030")
    
    axis = countries.plot("MM_BTU_per_Capita", legend=True, scheme="quantiles", cmap="coolwarm")
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    2019 per capita power consumption by country

    Metadata

    When working with a data set for the first time, you should investigate whether metadata is available for the data set that will answer questions about the data like:

    Because metadata is created separately from the data and is tedious to create and maintain, metadata is often incomplete and out-of-date.

    You can embed an iframe of a web page in a Jupyter notebook using IPython.

    import IPython
    
    iframe = IPython.display.IFrame("https://michaelminn.net/tutorials/data/2019-world-energy-indicators-metadata.html", 
    	width=700, height=350)
    
    IPython.display.display(iframe)
    
    Figure
    The metadata page for the world energy indicators used in this tutorial

    Available Fields

    The info() method list the attribute columns along with the data types and number of non-null (not empty) values.

    info() output can be used to answer questions.

    print(countries.info())
    
    <class 'geopandas.geodataframe.GeoDataFrame'>
    RangeIndex: 175 entries, 0 to 174
    Data columns (total 52 columns):
     #   Column                       Non-Null Count  Dtype   
    ---  ------                       --------------  -----   
     0   Country_Code                 175 non-null    object  
     1   Country_Name                 175 non-null    object  
     2   Longitude                    175 non-null    float64 
     3   Latitude                     175 non-null    float64 
     4   WB_Region                    171 non-null    object  
     5   WB_Income_Group              170 non-null    object  
     6   Population                   170 non-null    float64 
     7   GNI_PPP_B_Dollars            162 non-null    float64 
     8   GDP_per_Capita_PPP_Dollars   162 non-null    float64 
     9   MJ_per_Dollar_GDP            166 non-null    float64 
     10  Fuel_Percent_Exports         148 non-null    float64 
     11  Resource_Rent_Percent_GDP    168 non-null    float64 
     12  Exports_Percent_GDP          161 non-null    float64 
     13  Imports_Percent_GDP          161 non-null    float64 
     14  Industry_Percent_GDP         167 non-null    float64 
     15  Agriculture_Percent_GDP      167 non-null    float64 
     16  Military_Percent_GDP         150 non-null    float64 
     17  Gini_Index                   126 non-null    float64 
     18  Land_Area_sq_km              169 non-null    float64 
     19  Arable_Land_HA_per_Capita    168 non-null    float64 
     20  CO2_per_Capita_Tonnes        165 non-null    float64 
     21  HDI                          164 non-null    float64 
     22  Regime_Type                  164 non-null    object  
     23  Democracy_Index              164 non-null    float64 
     24  CO2_Emissions_MM_Tonnes      172 non-null    float64 
     25  Primary_Consumption_Quads    172 non-null    float64 
     26  Coal_Consumption_Quads       172 non-null    float64 
     27  Gas_Consumption_Quads        172 non-null    float64 
     28  Oil_Consumption_Quads        172 non-null    float64 
     29  Nuclear_Consumption_Quads    31 non-null     float64 
     30  Renewable_Consumption_Quads  172 non-null    float64 
     31  Primary_Production_Quads     172 non-null    float64 
     32  Coal_Production_Quads        172 non-null    float64 
     33  Gas_Production_Quads         172 non-null    float64 
     34  Oil_Production_Quads         172 non-null    float64 
     35  Nuclear_Production_Quads     31 non-null     float64 
     36  Renewable_Production_Quads   172 non-null    float64 
     37  Electricity_Capacity_gW      172 non-null    float64 
     38  Electricity_Generation_tWh   172 non-null    float64 
     39  Electricity_Consumption_tWh  172 non-null    float64 
     40  Oil_Production_Mbd           172 non-null    float64 
     41  Oil_Consumption_Mbd          172 non-null    float64 
     42  Oil_Reserves_B_Barrels       169 non-null    float64 
     43  Gas_Production_bcf           172 non-null    float64 
     44  Gas_Consumption_bcf          172 non-null    float64 
     45  Gas_Reserves_tcf             161 non-null    float64 
     46  Coal_Production_Mst          172 non-null    float64 
     47  Coal_Consumption_Mst         172 non-null    float64 
     48  Coal_Reserves_Mst            172 non-null    float64 
     49  MM_BTU_per_Capita            169 non-null    float64 
     50  Renewable_Percent            172 non-null    float64 
     51  geometry                     175 non-null    geometry
    dtypes: float64(46), geometry(1), object(5)
    memory usage: 71.2+ KB
    

    If you only need the dimensions of the GeoDataFrame the shape of a DataFrame or GeoDataFrame is the number of rows and columns.

    print(countries.shape)
    
    (175, 52)
    

    If you only need the column names, the Pandas columns attribute contains the column labels for the fields in the DataFrame / GeoDataFrame.

    print(countries.columns)
    
    Index(['Country_Code', 'Country_Name', 'Longitude', 'Latitude', 'WB_Region',
           'WB_Income_Group', 'Population', 'GNI_PPP_B_Dollars',
           'GDP_per_Capita_PPP_Dollars', 'MJ_per_Dollar_GDP',
           'Fuel_Percent_Exports', 'Resource_Rent_Percent_GDP',
           'Exports_Percent_GDP', 'Imports_Percent_GDP', 'Industry_Percent_GDP',
           'Agriculture_Percent_GDP', 'Military_Percent_GDP', 'Gini_Index',
           'Land_Area_sq_km', 'Arable_Land_HA_per_Capita', 'CO2_per_Capita_Tonnes',
           'HDI', 'Regime_Type', 'Democracy_Index', 'CO2_Emissions_MM_Tonnes',
           'Primary_Consumption_Quads', 'Coal_Consumption_Quads',
           'Gas_Consumption_Quads', 'Oil_Consumption_Quads',
           'Nuclear_Consumption_Quads', 'Renewable_Consumption_Quads',
           'Primary_Production_Quads', 'Coal_Production_Quads',
           'Gas_Production_Quads', 'Oil_Production_Quads',
           'Nuclear_Production_Quads', 'Renewable_Production_Quads',
           'Electricity_Capacity_gW', 'Electricity_Generation_tWh',
           'Electricity_Consumption_tWh', 'Oil_Production_Mbd',
           'Oil_Consumption_Mbd', 'Oil_Reserves_B_Barrels', 'Gas_Production_bcf',
           'Gas_Consumption_bcf', 'Gas_Reserves_tcf', 'Coal_Production_Mst',
           'Coal_Consumption_Mst', 'Coal_Reserves_Mst', 'MM_BTU_per_Capita',
           'Renewable_Percent', 'geometry'],
          dtype='object')
    

    Category Counts

    The Pandas value_counts() method can be used to list the frequency (count) of values in a categorical variable().

    How many countries are in each World Bank income group?

    print(countries["WB_Income_Group"].value_counts())
    
    WB_Income_Group
    High income            51
    Lower middle income    48
    Upper middle income    43
    Low income             28
    Name: count, dtype: int64
    

    You can convert those counts as percentages by dividing by the Pandas count() method and multiplying by 100.

    What percentage of countries are in each World Bank income group?

    print(countries["WB_Income_Group"].value_counts() / countries["WB_Income_Group"].count() * 100)
    
    WB_Income_Group
    High income            30.000000
    Lower middle income    28.235294
    Upper middle income    25.294118
    Low income             16.470588
    Name: count, dtype: float64
    

    Row Aggregation

    The Pandas groupby() method can be used to aggregate quantitative variable values based on a categorical variable.

    What was the median annual energy consumption by countries in the different World Bank income groups in 2019 (millions of BTU per capita)?

    print(countries.groupby("WB_Income_Group")["MM_BTU_per_Capita"].median())
    
    WB_Income_Group
    High income            150.0705
    Low income               2.9575
    Lower middle income     14.4875
    Upper middle income     60.5600
    Name: MM_BTU_per_Capita, dtype: float64
    

    You can aggregate by other functions.

    What was the total annual energy consumption by countries in the different World Bank income groups in 2019 (quadrillion BTU)?

    print(countries.groupby("WB_Income_Group")["Primary_Consumption_Quads"].sum())
    
    WB_Income_Group
    High income            255.894
    Low income               2.978
    Lower middle income     83.985
    Upper middle income    249.126
    Name: Primary_Consumption_Quads, dtype: float64
    

    If you want the table to be in sorted order, you can use the Pandas sort_values() method to sort by the aggregated variable.

    group_totals = countries.groupby("WB_Income_Group")["Primary_Consumption_Quads"].sum()
    
    group_totals = group_totals.sort_values(ascending=False)
    
    print(group_totals)
    
    WB_Income_Group
    High income            255.894
    Upper middle income    249.126
    Lower middle income     83.985
    Low income               2.978
    Name: Primary_Consumption_Quads, dtype: float64
    

    Distribution

    The distribution of a variable is the manner in which the values are spread across the range of possible values.

    A quantile shows the values in a distribution below given percentages of the population.

    Describe

    The range and central tendency values for a DataFrame / GeoDataFrame can be displayed with the Pandas describe() method.

    print(countries.describe())
    
            Longitude    Latitude    Population  GNI_PPP_B_Dollars   
    count  175.000000  175.000000  1.700000e+02         162.000000  \
    mean    21.815811   20.258486  4.527349e+07         816.854432   
    std     60.610593   25.290149  1.556804e+08        2700.076881   
    min   -112.599000  -51.713000  5.622500e+04           1.056000   
    25%     -4.614500    5.226500  3.806434e+06          35.968250   
    50%     22.720000   20.209000  1.068528e+07         113.593000   
    75%     46.461000   40.745000  3.281965e+07         463.222750   
    max    172.702000   74.770000  1.407745e+09       23376.230000   
    
           GDP_per_Capita_PPP_Dollars  MJ_per_Dollar_GDP  Fuel_Percent_Exports   
    count                  162.000000         166.000000            148.000000  \
    mean                 21451.140463           4.804271             16.707743   
    std                  21655.703338           2.787646             26.508811   
    min                    760.453000           0.630000              0.000000   
    25%                   4985.353000           3.062500              1.063500   
    50%                  13992.158000           4.065000              4.365000   
    75%                  33023.375500           5.662500             19.698250   
    max                 120174.755000          19.900000             99.986000   
    
           Resource_Rent_Percent_GDP  Exports_Percent_GDP  Imports_Percent_GDP   
    count                 168.000000           161.000000           161.000000  \
    mean                    6.121173            39.046832            43.324205   
    std                     9.299281            26.385164            23.125305   
    min                     0.000000             0.631000             0.588000   
    25%                     0.476000            22.713000            28.420000   
    50%                     2.128000            34.907000            39.375000   
    75%                     7.409250            48.131000            52.075000   
    max                    45.977000           204.321000           173.522000   
    
           ...  Oil_Consumption_Mbd  Oil_Reserves_B_Barrels  Gas_Production_bcf   
    count  ...           172.000000              169.000000          172.000000  \
    mean   ...           572.581953                9.817071          823.347541   
    std    ...          2006.368698               38.986053         3349.269102   
    min    ...             0.309000                0.000000            0.000000   
    25%    ...            22.215750                0.000000            0.000000   
    50%    ...            77.363000                0.015000            1.708000   
    75%    ...           294.083750                0.600000          347.529000   
    max    ...         20542.854000              302.809000        33899.021000   
    
           Gas_Consumption_bcf  Gas_Reserves_tcf  Coal_Production_Mst   
    count           172.000000        161.000000         1.720000e+02  \
    mean            809.929860         44.530932         5.163426e+04   
    std            2909.592193        182.810288         3.406489e+05   
    min               0.000000          0.000000         0.000000e+00   
    25%               0.000000          0.000000         0.000000e+00   
    50%              46.585000          0.220000         0.000000e+00   
    75%             437.719000          8.500000         1.841574e+03   
    max           31132.041000       1688.228000         4.239848e+06   
    
           Coal_Consumption_Mst  Coal_Reserves_Mst  MM_BTU_per_Capita   
    count          1.720000e+02         172.000000         169.000000  \
    mean           4.962037e+04        6711.257645          89.017438   
    std            3.505100e+05       30589.108680         119.299077   
    min            0.000000e+00           0.000000           0.777000   
    25%            0.000000e+00           0.000000          10.699000   
    50%            2.778840e+02           0.000000          53.718000   
    75%            6.405231e+03         400.689750         109.448000   
    max            4.430349e+06      252057.000000         803.056000   
    
           Renewable_Percent  
    count         172.000000  
    mean           16.739727  
    std            16.866886  
    min            -0.615000  
    25%             3.748750  
    50%            12.208500  
    75%            25.269000  
    max            82.766000  
    

    If you have a GeoDataFrame with a large number of columns you can run describe() on specified columns.

    What was the median per capita energy consumption by country in 2019? (53.7 MM BTU)

    print(countries["MM_BTU_per_Capita"].describe())
    
    count    169.000000
    mean      89.017438
    std      119.299077
    min        0.777000
    25%       10.699000
    50%       53.718000
    75%      109.448000
    max      803.056000
    Name: MM_BTU_per_Capita, dtype: float64
    

    Histograms

    Histograms are charts commonly used to visualize distributions by showing bars with the number of entries in different ranges of values.

    Histograms are used to answer the question, "How are the values in a variable distributed across the range of possible values?"

    What is the distribution of energy consumption by country? This histogram and the describe() output shows that most countries used under 100 MM BTU of energy per capita in 2019, with a handful of countries with exceptionally high energy consumption.

    The pyplot.hist() function can be used to generate histograms.

    plt.hist(countries["MM_BTU_per_Capita"])
    
    plt.show()
    
    Figure
    Histogram of per capita energy consumption by country in BTU in 2019

    The jagged quality of a histogram can be smoothed into an overlaid density plot using the kde = True parameter to the histplot() function from the seaborn statistical data visualization library.

    import seaborn
    
    seaborn.histplot(countries["MM_BTU_per_Capita"], kde=True)
    
    plt.show()
    
    Figure
    Histogram with a density plot

    Box Plot

    A boxplot() is a visualization for comparing distributions between groups. Box plots give a clearer comparison of the range of values between multiple groups compared to the single-number simplification of central tendency values.

    countries.boxplot(column="MM_BTU_per_Capita", by="WB_Income_Group")
    
    plt.show()
    
    Figure
    Box plot comparing per capita energy consumption by country between World Bank income groups

    Rank Order

    A rank order graphic or table displays the values in a variable sorted highest to lowest.

    The head() method can be used to subset the five rows at the beginning of the GeoDataFrame.

    What were the top five countries in per capita energy consumption in 2019?

    rank_countries = countries.sort_values("MM_BTU_per_Capita", ascending=False)
    
    rank_countries = rank_countries.dropna("MM_BTU_per_Capita")
    
    rank_countries = rank_countries.reset_index()
    
    print(rank_countries[["Country_Name", "MM_BTU_per_Capita"]].head())
    
               Country_Name  MM_BTU_per_Capita
    0                 Qatar            803.056
    1               Iceland            598.031
    2  United Arab Emirates            500.898
    3   Trinidad and Tobago            480.882
    4                Brunei            410.541
    

    The tail() method can be used to subset the five rows at the end of the GeoDataFrame.

    What were bottom top five countries in per capita energy consumption in 2019?

    print(rank_countries[["Country_Name", "MM_BTU_per_Capita"]].tail())
    
                 Country_Name  MM_BTU_per_Capita
    164                  Chad              1.563
    165       Dem. Rep. Congo              1.512
    166               Burundi              1.056
    167  Central African Rep.              1.021
    168               Somalia              0.777
    

    As an alternative to a histogram, a rank order plot.bar() of a sorted variable can be used to show the distribution of values.

    axis = rank_countries.plot.bar(y="MM_BTU_per_Capita")
    
    axis.xaxis.set_visible(False)
    
    plt.show()
    
    Figure
    Rank order bar plot

    Correlation

    In exploring data, we often look for relationships between variables.

    Correlation is "a relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or occur together in a way not expected on the basis of chance alone" (Merriam-Webster 2022).

    While it is important to always remember that correlation and causation are two different things, correlation analysis is a very powerful exploratory technique for determining whether there is a relationship between two variables.

    The strength of a correlation is measured using the coefficient of determination which is more commonly called R-squared.This can be written as R2, R squared, R-squared, or R^2.

    Evaluation of R2 to determine whether correlation should be considered strong or not depends on the type of phenomena being studied.

    The Pandas corr() method calculates correlation coefficients between all variables in a DataFrame.

    import numpy
    
    r_squared = countries.select_dtypes(include=numpy.number)
    
    r_squared = r_squared.corr()["MM_BTU_per_Capita"]**2
    
    r_squared = r_squared.sort_values(ascending=False).round(3)
    
    print(r_squared)
    
    MM_BTU_per_Capita              1.000
    GDP_per_Capita_PPP_Dollars     0.606
    Agriculture_Percent_GDP        0.236
    Nuclear_Consumption_Quads      0.178
    Nuclear_Production_Quads       0.178
    Fuel_Percent_Exports           0.173
    Gini_Index                     0.150
    Gas_Reserves_tcf               0.120
    Latitude                       0.119
    ...
    

    Scatter Chart

    Correlation can be visualized by plotting the two variables on an x/y scatter chart and looking for an upward or downward pattern of dots diagonally across the chart.

    Example x/y scatter charts

    The Matplotlib pyplot.scatter() creates an x/y scatter chart.

    axis = plt.scatter(countries["MM_BTU_per_Capita"], countries["GDP_per_Capita_PPP_Dollars"])
    
    plt.show()
    
    Figure
    Energy vs. GDP per capita scatter chart

    To spread out the large collection of small countries clumped at the bottom left, we can use logarithmic axes.

    path = plt.scatter(countries["MM_BTU_per_Capita"], countries["GDP_per_Capita_PPP_Dollars"])
    
    path.axes.set_xscale("log");
    
    path.axes.set_yscale("log");
    
    plt.show()
    
    Figure
    Logarithmic axes

    The Seaborn regplot() function can be used to draw a scatter plot with a regression line that highlights the correlation pattern and direction.

    import seaborn
    
    seaborn.regplot(x = "MM_BTU_per_Capita", y = "GDP_per_Capita_PPP_Dollars", 
    	data=countries, line_kws={"color": "red"})
    
    plt.show()
    
    Figure
    Linear regression line

    Because regplot() cannot display logarithmic y-axes, you will need to log transform the data prior to plotting if you want logarithmic x and y scales with a regression line.

    import numpy
    
    energy_economy = countries[["MM_BTU_per_Capita", "GDP_per_Capita_PPP_Dollars"]]
    
    energy_economy = energy_economy.dropna()
    
    energy_economy["MM_BTU_per_Capita_log"] = numpy.log(energy_economy["MM_BTU_per_Capita"])
    
    energy_economy["GDP_per_Capita_log"] = numpy.log(energy_economy["GDP_per_Capita_PPP_Dollars"])
    
    seaborn.regplot(x = "MM_BTU_per_Capita_log", y = "GDP_per_Capita_log", 
    	data=energy_economy, line_kws={"color": "red"})
    
    plt.show()
    
    Figure
    Log regression line

    The Post Hoc Fallacy

    While correlation may be interesting, what we are usually more interested in is causation. Correlation is a simple mathematical relationship between two variables, but causation means that there is a material cause-and-effect relationship between the two phenomenon we are measuring with our variables.

    Correlation is empirical (based on observation), and causation is rational (based on reason). When we observe two phenomena occurring together and we observe that there is some mechanism connecting the two phenomena, we use reason and logic to tie those two phenomena together in a cause and effect relationship.

    Assuming that correlation proves causation is the post hoc fallacy, from the Latin phrase post hoc ergo propter hoc (after this, therefore because of this). A logical fallacy is "an often plausible argument using false or invalid inference."

    For example, the correlation between per capita energy use and level of nuclear power production could lead to a fallacious inference that the use of nuclear power increases energy use.

    However, both the presence of nuclear power and high per capita energy use are probably more causally related through a third variable of early economic development, which results in higher energy use and the development of complex and expensive nuclear technology that is both economically out of reach of poorer countries, and which is actively blocked from proliferation for political reasons by wealthy countries.

    Causal link diagram for nuclear power

    Correlation points to possible causal relationships, but does not prove them, and there are a variety of logical arguments to show how making a simple assumption that correlation is causation will lead you astray. Determining whether there is a cause-and-effect relationship requires more sophisticated techniques and domain knowledge beyond simple mathematical correlation.

    Correlation vs. Causation