Point Data Analysis in Python

Geospatial point data is used to represent things that exist or events that occurred at specific locations on the surface of the earth. Examples of points include crime locations, animal nests, street trees, vehicle charging stations, cellphone towers, WiFi hotspots, GPS waypoints, etc. Areas are sometimes represented with points (centroids) to simplify mapping and navigation, such as with restaurants, houses, stores, or transit stations.

This tutorial demonstrates basic techniques for analysis of point data in Python.

Loading Point Data

The example point data used for this tutorial is 2022 electrical power plant location data from the US Energy Information Administration.

The oil market disruptions of 1973 resulted in the creation of a wide variety of federal systems and organizations for collecting and managing national energy data, and these functions were centralized in the US Energy Information Administration by The Department of Energy Organization Act of 1977. The US Energy Information Administration (EIA) "collects, analyzes, and disseminates independent and impartial energy information to promote sound policymaking, efficient markets, and public understanding of energy and its interaction with the economy and the environment" (EIA 2023).

A GeoJSON file and metadata containing field information is available here.

Current data is also available directly from the EIA's US Energy Atlas.

Figure
Power plants data from the EIA

Loading Geospatial Data Files

The following code shows how to load an online GeoJSON file into a GeoPandas GeoDataFrame for visualization and analysis.

import geopandas

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [9, 6]

plants = geopandas.read_file('https://michaelminn.net/tutorials/data/2022-power-plants.geojson')

plants.plot(markersize = 2)

plt.show()
Figure
Raw power plant data

The info() method list the attribute columns along with the data types and number of non-null (not empty) values.

plants.info()
RangeIndex: 11569 entries, 0 to 11568
Data columns (total 33 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   OBJECTID        11569 non-null  int64   
 1   Plant_Code      11569 non-null  int64   
 2   Plant_Name      11569 non-null  object  
 3   Utility_ID      11569 non-null  int64   
 4   Utility_Name    11565 non-null  object  
 5   sector_name     11569 non-null  object  
 6   Street_Address  11541 non-null  object  
 7   City            11549 non-null  object  
 8   County          11534 non-null  object  
 9   State           11569 non-null  object  
 10  Zip             11547 non-null  float64 
 11  PrimSource      11569 non-null  object  
 12  source_desc     11569 non-null  object  
 13  tech_desc       11569 non-null  object  
 14  Install_MW      11569 non-null  float64 
 15  Total_MW        11569 non-null  float64 
 16  Bat_MW          379 non-null    float64 
 17  Bio_MW          667 non-null    float64 
 18  Coal_MW         258 non-null    float64 
 19  Geo_MW          67 non-null     float64 
 20  Hydro_MW        1456 non-null   float64 
 21  HydroPS_MW      40 non-null     float64 
 22  NG_MW           2034 non-null   float64 
 23  Nuclear_MW      54 non-null     float64 
 24  Crude_MW        1115 non-null   float64 
 25  Solar_MW        4816 non-null   float64 
 26  Wind_MW         1300 non-null   float64 
 27  Other_MW        95 non-null     float64 
 28  Source          11569 non-null  object  
 29  Period          11569 non-null  object  
 30  Longitude       11569 non-null  float64 
 31  Latitude        11569 non-null  float64 
 32  geometry        11569 non-null  geometry
dtypes: float64(17), geometry(1), int64(3), object(12)
memory usage: 2.9+ MB

Load CSV Points

Points can also be loaded from CSV files using the GeoPandas points_from_xy function if you are working with data files that have columns of latitude and longitude.

import pandas

csv = pandas.read_csv('https://michaelminn.net/tutorials/data/2019-mlb-ballparks.csv')

points = geopandas.points_from_xy(csv['Longitude'], csv['Latitude'])

baseball = geopandas.GeoDataFrame(csv, geometry = points, crs='EPSG:4326')

baseball.plot()

plt.show()
Figure
Locations of US baseball parks in 2019 imported from a CSV file with latitudes and longitudes (Wikipedia)

Point Maps

Simple Point Map

As explained above, a simple point map can be created with the plot() method.

plants.plot(markersize = 2)

plt.show()
Figure
Power plant points

Filtering and Reprojecting

Additional processing can be used to clarify point visualizations.

plants = plants[~plants['State'].isin(['Alaska', 'Hawaii', 'Puerto Rico'])]

plants = plants.to_crs('EPSG:3857')

axis = plants.plot(markersize=2)

axis.set_axis_off()

plt.show()
Figure
Points filtered to the continental US and reprojected to web Mercator

An example of positive filtering can be used to create a list of coal-fired power plants.

The *_MW parameters give the rated power for each plant in megawatts from the different fuel sources. Commercial-scale fossil-fueled or nuclear electricity plants typically range from 500 MW to 2,000 MW or more.

  • Note that some plants have reserve or alternative coal burning capability even if their primary fuel source (PrimSource) is something other than coal.
  • coal_plants = plants[plants['Coal_MW'] > 0]
    
    axis = coal_plants.plot(markersize=2, color='darkred')
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Filtered points showing only coal-fired power plants

    Base Map

    A base map is a simple reference map that you place under mapped features to provide geographic context for where those features are located on the surface of the earth.

    While area map borders can provide geographic context on their own, point maps usually need base maps so it is possible for the user to see where the points are located relative to other known features.

    For these maps of power plants, we will use a GeoDataFrame of states for state outlines as a base map.

    states = geopandas.read_file("https://michaelminn.net/tutorials/data/2015-2019-acs-states.geojson")
    
    states = states[~states["ST"].isin(['AK', 'HI', 'PR'])]
    
    states = states.to_crs(plants.crs)
    

    To plot two GeoDataFrames on the same plot, we need to pass the axis return value from the first plot to the map as an ax parameter to subsequent calls to modify the map.

    axis = coal_plants.plot(markersize = 4, color = 'darkred')
    
    states.plot(facecolor='none', edgecolor='gray', ax=axis)
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    State outlines as a base map

    Tiled Base Map

    Contextily can be used to display tiled base maps using the add_basemap function.

    Contextily base maps can be slow to load, and highly-detailed base maps can make it difficult to distinguish point symbology. Therefore, state outlines are used for base maps in subsequent examples in this tutorial.

    import contextily
    
    axis = coal_plants.plot(markersize = 4, color = "red")
    
    contextily.add_basemap(axis)
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Tiled base map

    Mapping a Categorical Attribute

    If your point data has a categorical attribute, you can map it by passing the name of the attribute as the first parameter to the plot() call.

    You will also need to set legend=True to add a legend that will allow the user to know what the different dot colors mean.

    axis = plants.plot("PrimSource", legend=True, markersize=2)
    
    states.plot(facecolor="none", edgecolor="gray", ax=axis)
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Example of mapping a categorical attribute: Power plants by primary fuel type

    Mapping a Quantitative Variable

    A common method of visualizing points with a quantitative attribute is the graduated symbol (bubble) map that uses differently sized shapes based on the variable being mapped. Although circles are most common, other types of icons can be used for aesthetic variety.

    il_plants = plants[plants["State"] == "Illinois"]
    
    bubble_size = 200 * plants["Install_MW"] / max(plants["Install_MW"])
    
    axis = il_plants.plot(markersize=bubble_size)
    

    As with all point maps, a base map should be used with bubble maps to provide geographic context. For this example we will use county border polygons.

    il_state = geopandas.read_file("https://michaelminn.net/tutorials/data/2015-2019-acs-counties.geojson")
    
    il_state = il_state[il_state["ST"] == "IL"]
    
    il_state = il_state.to_crs(plants.crs)
    
    il_state.plot(facecolor="none", edgecolor="gray", ax=axis)
    
    axis.set_axis_off()
    
    plt.show()
    

    As would be expected, we see a clustering of large power plants in the northern urban parts of Illinois that is more densely populated than the rural areas in the central and southern parts of the state.

    Figure
    A bubble map

    Voronoi Polygons

    Voronoi polygons are polygons around points that each contain one point and have edges that evenly divide the areas between adjacent points. Voronoi diagrams are named after Russian mathematician Georgy Feodosievych Voronoy. They are also sometimes called Thiessen polygons after American meterologist Alfred Thiessen (Wikipedia 2023).

    While Voronoi polygons are a useful way of visualizing points as areas, the polygons are mathematical abstractions that may not represent actual areas of influence or jurisdiction in the real world, and care must be taken to communicate that limitation.

    import numpy
    
    import libpysal.cg.voronoi
    
    geothermal = plants[(plants["Geo_MW"] > 0) & (plants["State"] == "Nevada")]
    
    points = numpy.column_stack((geothermal.geometry.x, geothermal.geometry.y))
    
    regions, vertices = libpysal.cg.voronoi.voronoi_frames(points)
    
    axis = geothermal.plot(color="Red", markersize=4)
    
    regions.plot(facecolor="#00008020", edgecolor="navy", ax=axis)
    
    nevada = states[states["ST"] == "NV"]
                  
    nevada.plot(facecolor="none", edgecolor="gray", ax=axis)
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Voronoi diagram of geothermal plants in Nevada

    Hulls

    Points can also be bound by hulls to more-clearly demonstrate the spatial extent of the points.

    Convex hulls can be created with the GeoPandas convex_hull() function.

    geothermal = plants[plants["Geo_MW"] > 0]
    
    axis = geothermal.plot(color = "Red", markersize = 4)
    
    hull = geopandas.GeoSeries(geothermal.unary_union.convex_hull)
    
    hull.plot(facecolor="#00008020", edgecolor="navy", ax=axis)
    
    states.plot(facecolor="none", edgecolor="gray", ax=axis)
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Convex hull around western geothermal electricity plants

    Point Autocorrelation

    One of the primary ways of analyzing the spatial distribution of phenomenon is looking where points group together, or where high or low values are clustered together.

    Spatial autocorrelation is the clustering of similar values together in adjacent or nearby areas.

    Spatial data is usually spatially autocorrelated, and this tendency is useful since clusters often provide insights into the characteristics of the clustered phenomena.

    While visible observation of spatial autocorrelation is useful when exploring data, more-rigorous means of quantifying autocorrelation are helpful when performing serious research.

    Ripley's K

    B.D. Ripley (1977) developed a collection of lettered techniques for analyzing spatial point correlation.

    The Ripley's K function iterates through each point and counts the number of other points within a given distance.

    Autocorrelation is evaluated by analyzing the position of the observed line relative to the theoretical lines.

    Projected x/y locations must first be converted to a DataFrame of points.

    The k_test() function from the pointpats runs the Ripley's K test.

    Note that if you have a large number of points, k_test() may a minute or more to complete the analysis.

    import numpy
    
    import pointpats
    
    il_plants = plants[plants["State"] == "Illinois"]
    
    il_points = numpy.column_stack((il_plants.centroid.x, il_plants.centroid.y))
    
    k_test = pointpats.k_test(il_points, keep_simulations=True)
    
    plt.plot(k_test.support, k_test.simulations.T, color='navy', alpha=.01)
    
    plt.plot(k_test.support, k_test.statistic, color='red')
    
    plt.xlabel('Distance')
    
    plt.ylabel('K Function')
    
    plt.show()
    
    Ripley's K showing clustering of electricity plants in Illinois

    Kernel Density

    A classical method of mapping areas of high spatial autocorrelation of points is kernel density analysis, in which a kernel of a given radius is used to systematically scan an area, and the density of any particular location is the number of points within the kernel surrounding that point. Locations surrounded by clusters of points will have higher values, and areas where there are few or no points will have lower values.

    Kernel density demonstration animation (R code)

    A kernel density raster can be created using the kdeplot() function from the seaborn graphing package.

    The bw_adjust=0.3 parameter determines the radius of the kernel and can be increased for greater smoothing or reduced for greater detail.

    import seaborn
    
    axis = seaborn.kdeplot(x = plants.centroid.x, y = plants.centroid.y,
            fill=True, gridsize=100, bw_adjust=0.3, cmap="coolwarm")
    
    states.plot(facecolor="none", edgecolor="gray", ax=axis)
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Kernel density plot by count

    Weighting the graph by the plant installed capacity shows clusters associated with larger facilities in the south and midwest.

    import seaborn
    
    axis = seaborn.kdeplot(x = plants.centroid.x, y = plants.centroid.y,
            weights = plants["Install_MW"],
            fill=True, gridsize=100, bw_adjust=0.3, cmap="coolwarm")
    
    states.plot(facecolor="none", edgecolor="gray", ax=axis)
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Kernel density plot weighted by installed capacity

    Focusing specifically on coal plants shows a distinct clustering in Appalachia that is not immediately visible from the facilities location dot map.

    axis = seaborn.kdeplot(x = coal_plants.centroid.x, y = coal_plants.centroid.y,
            weights = plants["Install_MW"],
            fill=True, gridsize=100, bw_adjust=0.3, cmap="coolwarm")
    
    states.plot(facecolor="none", edgecolor="gray", ax=axis)
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Kernel density plot for coal plants weighted by installed capacity

    The code below increases the bw_adjust parameter to 0.5 to increase the level of smoothing. This demonstrates the arbitrary nature of kernel density estimation, which can be overcome somewhat by the use of more rigorous areal techniques like local Moran's I and Getis-Ord GI* as described below.

    axis = seaborn.kdeplot(x = coal_plants.centroid.x, y = coal_plants.centroid.y,
            weights = plants["Install_MW"],
            fill=True, gridsize=100, bw_adjust=0.5, cmap="coolwarm")
    
    states.plot(facecolor="none", edgecolor="gray", ax=axis)
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Example of increased kernel size (bw_adjust = 0.5)

    Attribute Aggregation

    When analyzing point data, it can be helpful to aggregate counts and attribute values by areas and then analyze the values in those areas for relationships.

    For this example of coal-fired plants, we will aggregate by state. Because this power plant data set has a State attribute we can use the Pandas groupby() method to find total installed value by state.

    If your data does not have an attribute that can be used for aggregating by area, you will need to use spatial aggregation as described below.

    Note that because power plants are interconnected by grids that extend across state lines, the aggregated production values may not reflect actual consumption values within state lines.

    coal_states = coal_plants.groupby("State")["Install_MW"].sum()
    
    coal_states = coal_states.sort_values(ascending = False)
    
    coal_states = coal_states.reset_index()
    
    print(coal_states.head())
    
               State  Install_MW
    0          Texas     20586.1
    1        Indiana     15835.6
    2       Kentucky     13389.7
    3  West Virginia     13090.2
    4       Missouri     11888.7
    
    print(coal_states.tail())
    
               State  Install_MW
    36      Delaware       464.1
    37  South Dakota       451.0
    38    New Jersey       285.0
    39    California        62.5
    40         Idaho        10.2
    

    Attribute Join

    Because the point data contains no state polygon information need to perform an attribute join to be able to map the aggregated data as a choropleth.

    An attribute join connects two datasets based on common key values.

    Attribute Join

    The GeoPandas merge() method can be used to merge the aggregated attributes with the states polygons loaded above.

    coal_states = states.merge(coal_states, how="left", left_on="Name", right_on="State")
    
    coal_states["Install_MW"] = coal_states["Install_MW"].fillna(0)
    
    axis = coal_states.plot("Install_MW", legend=True, 
            scheme="naturalbreaks", cmap="coolwarm",
            legend_kwds={"bbox_to_anchor":(0.2, 0.4)})
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Total coal-fired plant installed capacity in megawatts by state

    Normalization

    States with larger populations will have higher electricity demands and larger amounts of installed power plant capacity. What is often more meaningful for the purposes of analysis is the capacity per resident of the state (per capita).

    Normalization is the adjustment of variable values to a common scale so they are comparable across space and time (Wikipedia 2023).

    For this example, we normalize by population to get per capita installed capacity values.

    We multiply by 1000 to get kilowatts from megawatts since home energy use is usually measured in kilowatt-hours.

    Note that in this case, states with high levels of coal production (and strong political influence from coal producers) now stand out with the highest per capita electricity generation from coal.

    coal_states["KW_Per_Capita"] = 1000 * coal_states["Install_MW"] / coal_states["Total Population"]
    
    axis = coal_states.plot("KW_Per_Capita", legend=True, 
            scheme="naturalbreaks", cmap="coolwarm",
            legend_kwds={"bbox_to_anchor":(0.2, 0.4)})
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Per capita kilowatts of coal-fired plant installed capacity by state

    Analyzing Change

    If you have data for two different time periods, you can analyze spatiotemporal change between those two time periods.

    For this example, we use an older data file of power plants from 2018 and groupby() to get the sum() of installed capacity for coal plants by state.

    coal_2018 = geopandas.read_file("https://michaelminn.net/tutorials/data/2018-power-plants.geojson")
    
    coal_2018["Coal_2018"] = coal_2018["Install_MW"]
    
    coal_2018 = coal_2018[coal_2018["Coal_MW"] > 0]
    
    coal_2018 = coal_2018.groupby("State")["Coal_2018"].sum()
    
    coal_2018 = coal_2018.reset_index()
    
    coal_2018.head()
    
    State
    Alabama       9130.2
    Alaska         196.4
    Arizona       6510.9
    Arkansas      5487.0
    California      62.5
    Name: Install_2018, dtype: float64
    

    Next calculate the same total coal capacity from the 2022 data.

    Note that we rename Install_MW to Coal_2018 and Coal_2022 to avoid a name conflict when the two data frames are joined.

    coal_2022 = coal_plants
    
    coal_2022["Coal_2022"] = coal_2022["Install_MW"]
    
    coal_2022 = coal_2022.groupby("State")["Coal_2022"].sum()
    
    coal_2022 = coal_2022.reset_index()
    
    coal_2022.head()
    
    Name 		Coal_Change
    Alabama 	-18.671004
    Arizona 	-43.362669
    Arkansas 	0.000000
    California 	0.000000
    Colorado 	-6.846350
    

    We merge both DataFrames with the states polygons and calculate the percentage change in installed capacity.

    coal_change = states.merge(coal_2018, left_on="Name", right_on="State")
    
    coal_change = coal_change.merge(coal_2022, left_on="Name", right_on="State")
    
    coal_change["Coal_Change"] = 100 * ((coal_change["Coal_2022"] / coal_change["Coal_2018"]) - 1)
    
    coal_change[["Name", "Coal_Change"]].head()
    
    Name 		Coal_Change
    Alabama 	-18.671004
    Arizona 	-43.362669
    Arkansas 	0.000000
    California 	0.000000
    Colorado 	-6.846350
    

    Finally, we can map the polygons as a choropleth styled by the percentage change.

    axis = coal_change.plot("Coal_Change", legend=True, 
            scheme="stdmean", cmap="coolwarm",
            legend_kwds={"bbox_to_anchor":(0.2, 0.4)})
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Map of percentage change in coal-fired power plant installed capacity between 2018 and 2022

    Spatial Aggregation

    In situations where your point data does not contain an attribute that you can use to aggregate, or if you want to aggregate into areas that are different from any specified in the point attributes, it is possible to aggregate using a spatial join.

    An spatial join connects two datasets based on a spatial relationship.

    Spatial Join

    For this example, we will aggregate power plants by North American Electric Reliability Corporation (NERC) administrative regions that are used to manage the North American electrical grid. NERC "develops and enforces reliability standards; annually assesses seasonal and longā€term reliability; monitors the bulk power system through system awareness; and educates, trains, and certifies industry personnel" (NERC 2023).

    The NERC region polygons are read directly from a GeoJSON service provided by the EIA through the US Energy Atlas.

    nerc_regions = geopandas.read_file("https://services7.arcgis.com/FGr1D95XCGALKXqM/arcgis/rest/services/NERC_Regions_EIA/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson")
    
    nerc_regions = nerc_regions.to_crs(plants.crs)
    
    axis = nerc_regions.plot("NERC", legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.4)})
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    North American Electric Reliability Corporation (NERC) regions

    For this example, we use a spatial join to get the count of coal-fired plants by NERC region.

    nerc_plants = coal_plants.sjoin(nerc_regions)
    
    nerc_plants["Count"] = 1
    
    nerc_plants = nerc_plants.groupby("NERC")["Count"].count()
    
    nerc_plants = nerc_plants.reset_index()
    
    nerc_plants = nerc_regions.merge(nerc_plants, how="left")
    
    nerc_plants["Count"] = nerc_plants["Count"].fillna(0)
    
    axis = nerc_plants.plot("Count", scheme="naturalbreaks", cmap="coolwarm", 
    		legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.4)})
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    NERC regions by number of coal-fired power plants

    Joined Sum

    We can sum() the installed capacity to get installed capacity by region.

    nerc_installed = coal_plants.sjoin(nerc_regions)
    
    nerc_installed = nerc_installed.groupby("NERC")["Install_MW"].sum()
    
    nerc_installed = nerc_installed.reset_index()
    
    nerc_installed = nerc_regions.merge(nerc_installed)
    
    axis = nerc_installed.plot("Install_MW",
                    scheme="naturalbreaks", cmap="coolwarm", legend=True,
                             legend_kwds={"bbox_to_anchor":(0.2, 0.4)})
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    NERC regions by installed coal-fired power plant capacity

    Normalized Sum

    We can use a similar spatial join to find the total installed capacity in each region and normalize and get a percentage of electricity generated by coal-fired plants.

    nerc_percent = plants.sjoin(nerc_regions)
    
    nerc_percent["Total_MW"] = nerc_percent["Install_MW"]
    
    nerc_percent = nerc_percent.groupby("NERC")["Total_MW"].sum()
    
    nerc_percent = nerc_percent.reset_index()
    
    nerc_percent = nerc_installed.merge(nerc_percent)
    
    nerc_percent["Coal_Percent"] = 100 * nerc_percent["Install_MW"] / nerc_percent["Total_MW"]
    
    axis = nerc_percent.plot("Coal_Percent",
                    scheme="naturalbreaks", cmap="coolwarm", legend=True,
                             legend_kwds={"bbox_to_anchor":(0.2, 0.4)})
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    NERC regions by percent of electricity generation from coal

    Area Labels

    To label the areas we can use the pyplot text() function.

    axis = nerc_percent.plot("Coal_Percent",
                    scheme="naturalbreaks", cmap="coolwarm")
    
    for row in range(len(nerc_percent)):
        plt.text(nerc_percent.centroid.x[row],
                 nerc_percent.centroid.y[row],
                 str(round(nerc_percent["Coal_Percent"].loc[row])) + "%")
        
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Labeled area map

    Categorical Aggregation

    After completing a spatial join, aggregation can be performed on attributes other than area.

    For this example, we spatially join the coal plants with state-level election results from the 2020 presidential election to find installed capacity by political leaning.

    electoral_states = geopandas.read_file("https://michaelminn.net/tutorials/data/2020-electoral-states.geojson")
    
    electoral_states = electoral_states.to_crs(coal_plants.crs)
    
    electoral_plants = coal_plants.sjoin(electoral_states)
    
    electoral_plants = electoral_plants.groupby("Winner 2020")["Install_MW"].sum()
    
    electoral_plants = electoral_plants.reset_index()
    
    plt.bar(electoral_plants["Winner 2020"], electoral_plants["Install_MW"], color=["navy", "darkred"])
    
    plt.show()
    
    Figure
    Total installed coal-fired power plant capacity in megawatts by state winner of the 2020 US presidential election

    Nearest Aggregation

    The GeoPandas sjoin_nearest() can be used to find nearest feature for each point in another layer. Analysis of proximity is covered in more detail in the tutorial Proximity Analysis in Python.

    For example, this code sums the installed wind capacity in the nearest metropolitan statistical area (MSA) with over 500,000 residents.

    wind_plants = plants[plants['Wind_MW'] > 0]
    
    msa = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-cbsa.geojson')
    
    msa = msa.to_crs(wind_plants.crs)
    
    msa = msa[msa['Total Population'] >= 500000]
    
    wind_msa = wind_plants.sjoin_nearest(msa)
    
    wind_msa = wind_msa.groupby('Name')['Install_MW'].sum()
    
    wind_msa = wind_msa.reset_index()
    
    wind_msa = msa.merge(wind_msa)
    
    wind_msa['Wind_KW_Per_Capita'] = 1000 * wind_msa['Install_MW'] / wind_msa['Total Population']
    

    A map shows that only a handful of Midwestern MSAs have significant local wind resources. This demonstrates why visions of a renewable energy future often include long-distance electricity transmission lines to carry wind energy from the Midwest to the high-density population areas on the coasts (Holmes 2022).

    axis = wind_msa.plot("Wind_KW_Per_Capita", scheme="naturalbreaks", cmap="coolwarm", legend=True, ax=axis)
    
    states.plot(facecolor="none", edgecolor="gray", ax=axis)
    
    wind_plants.plot(markersize=3, color="#00000040", ax=axis)
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Installed wind capacity in kilowatts per capita in MSAs nearest to local wind farms

    Area Autocorrelation

    After points have been aggregated into areas, techniques with areas can be used to analyze and visualize autocorrelation.

    Example Data

    For these examples we will use total installed coal-fired power plant capacity in megawatts by state.

    Note that the merge() method uses how="left" to perform a left join that keeps all state polygons regardless of whether there are any coal-fired power plants in the state.

    Because the autocorrelation tools fail on data with missing (NA) values, the Pandas fillna() method is used to fill the NA kilowatts per capita with zero for states with no coal-fired power plants.

    coal_states = coal_plants.groupby("State")["Install_MW"].sum()
    
    coal_states = states.merge(coal_states, how="left", left_on="Name", right_on="State")
    
    coal_states["Install_MW"] = coal_states["Install_MW"].fillna(0)
    
    axis = coal_states.plot("Install_MW", scheme="naturalbreaks", cmap="coolwarm", legend=True)
    
    axis.set_axis_off()
    
    plt.show()
    
    Figure
    Coal-fired power plant installed capacity by state

    Neighbors

    In order to detect autocorrelation between neighboring areas, we must first answer the ancient question, "Who is my neighbor?"

    Neighboring areas can be defined by adjacency or proximity.

    Adjacency vs. proximity

    To use the spatial regression tools in pysal, we must first create a weights matrix that indicates how each feature is to every other feature.

    PySal is a Python package that includes non-spatial and spatial regression functions.

    import pysal.lib
    
    import pysal.model
    

    pysal functions require matrices of weights that indicate whether areas are neighbors or not.

    For proximity neighbors, the KNN() function creates a nearest neighbor weights matrix based on taking a specific number (k) of the closest features to each features.

    coal_states["ID"] = range(coal_states.shape[0])
    
    state_weights = pysal.lib.weights.KNN.from_dataframe(coal_states, ids="ID", k=4)
    
    state_weights.plot(gdf=coal_states, indexed_on="ID")
    
    plt.show()
    
    Neighbors defined by proximity with KNN()

    Matrices of adjacent neighbors can be created with the Queen() or Rook() function, depending on what direction of adjacency is desired.

    coal_states["ID"] = range(coal_states.shape[0])
    
    state_weights = pysal.lib.weights.Queen.from_dataframe(coal_states, ids="ID")
    
    state_weights.plot(gdf=coal_states, indexed_on="ID")
    
    plt.show()
    
    
    `
    Neighbors defined by adjacency with Queen()

    Global Moran's I

    Patrick Alfred Pierce Moran (1950) developed a technique for quantifying autocorrelation within values associated with areas. Moran's I is a method for assessing area autocorrelation analogous to the use of Ripley's K with points.

    There are two types of Moran's I statistics: global and local.

    Global Moran's I assesses autocorrelation over the entire area of analysis as a single number, with values of the global Moran's I statistic varying from -1 (evenly dispersed = evenly spread out) to 0 (no autocorrelation) to +1 (completely clustered).

    Simplified examples of different values of the Moran's I statistic

    Once you have a weights matrix, the Moran() function calculates the Moran's I statistic for the areas being analyzed.

    Because Moran() cannot handle NA values, the Pandas fillna() method is used below to fill the missing Install_MW values with zeroes under the assumption that there are no coal plants in those states.

    The splot.esda functions create plots for Moran's I. The plot_moran() function can be used to visualize and analyze autocorrelation.

    import esda.moran
    
    import splot.esda
    
    coal_states["Install_MW"] = coal_states["Install_MW"].fillna(0)
    
    moran = esda.moran.Moran(coal_states["Install_MW"], state_weights)
    
    splot.esda.plot_moran(moran, zstandard=True, figsize=(10,4))
    
    plt.show()
    
    Figure
    Global Moran's I density and scatter plots

    Local Moran's I

    Local indicators of spatial association (LISA) were initially developed by Luc Anselin (1995) to decompose global indicators of spatial autocorrelation (like global Moran's I) and assess the contribution of each individual area.

    LISA analysis with the local Moran's I statistic is used to identify clusters of areas with similar values and outlier areas with values that stand out among the neighbors.

    coal_states["Install_MW"] = coal_states["Install_MW"].fillna(0)
    
    moran_local = esda.moran.Moran_Local(coal_states["Install_MW"], state_weights)
    
    splot.esda.lisa_cluster(moran_local, coal_states[["Install_MW", "geometry"]])
    
    plt.show()
    
    Figure
    Local Moran's I

    Getis-Ord GI*

    The Getis-Ord GI* statistic was developed by Arthur Getis and J.K. Ord and introduced in 1992. This algorithm locates areas with high values that are also surrounded by high values.

    Unlike simple observation or techniques like kernel density analysis, the Getis-Ord GI* algorithm uses statistical comparisons of all areas to create p-values indicating how probable it is that clusters of high values in a specific areas could have occurred by chance. This rigorous use of statistics makes it possible to have a higher level of confidence (but not complete confidence) that the patterns are more than just a visual illusion and represent clustering that is worth investigating.

    The Getis-Ord GI* statistic is a z-score on a normal probability density curve indicating the probability that an area is in a cluster that did not occur by random chance. A z-score of 1.282 represents a 90% probability that the cluster did not occur randomly, and a z-score of 1.645 represents a 95% probability.

    The esda.G_Local() function implements Getis-Ord GI*, with the returned Zs field representing the z-score.

    import esda.getisord
    
    getisord = esda.getisord.G_Local(coal_states["Install_MW"],
    	state_weights, star=True, transform="B")
    
    getis_states = coal_states
    
    getis_states["G"] = getisord.Zs
    
    bins = [-10, -1.645, -1.282, 1.282, 1.645, 10]
    
    getis_states.plot("G", cmap="coolwarm", legend=True, 
    	scheme="user_defined", classification_kwds={'bins':bins},
            edgecolor = "#00000040", legend_kwds={"bbox_to_anchor":(0.2, 0.2)})
    
    plt.show()
    
    Getis-Ord GI* for percentage of fossil fuel electricity generation by state