Proximity Analysis in Python

Proximity analysis is the extraction of information from geospatial data based on distance between features. Proximity analysis is especially useful for evaluating two characteristics:

Proximity analysis assumes distance decay, where the power of the relationship between locations decreases as distance between locations increases. Distance decay is a fundamental assumption behind gravity models.

Three methods involving proximity are covered in this tutorial.

Categorical Proximity

Categorical proximity analysis divides areas into two classes (adjacent and non-adjacent) based on proximity to specific locations.

This example will analyze census tract proximity to petroleum refineries to answer these research questions:

Location Data: Refineries

Proximity analysis can be performed on two sets of features, either of which can be points, lines, or polygons.

For this example, we will analyze proximity between points and areas.

The example point locations will be petroleum refinery locations from the Energy Information Administration (EIA) U.S. Energy Atlas.

Oil refineries are large facilities that transform crude oil pumped from the ground into gasoline and a wide variety of other products that are essential to contemporary life in the United States.

import geopandas

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [9, 6]

us_refineries = geopandas.read_file('https://services7.arcgis.com/FGr1D95XCGALKXqM/arcgis/rest/services/Petroleum_Refineries_US_EIA/FeatureServer/22/query?outFields=*&where=1%3D1&f=geojson')

us_refineries = us_refineries.to_crs("ESRI:102008")

axis = us_refineries.plot()

axis.set_axis_off()

plt.show()
Figure

Area Data: Census Tracts

The areas we use or this example will be census tracts in Louisiana with demographic data.

The American Community Survey (ACS), an ongoing survey by the US Census Bureau that provides information about people in the United States beyond the basic information collected in the decennial census. The ACS is commonly used by a wide variety of researchers when they need information about the general public.

Census tracts are subdivisions of counties that are drawn by the US Census Bureau based on clearly identifiable features to ideally contain around 4,000 residents, although in practice the range of population is usually between 1,200 and 8,000 (USCB 2019). The census tract is the smallest reliable area of aggregation in data made public by the US Census Bureau.

The code below loads GeoJSON files of areas that have been prejoined to commonly-used variables from the 2015 - 2019 American Community survey five-year estimates.

Note that there are over 70K census tracts in the US and this GeoJSON file may take a minute or two to load, depending on the speed of your internet connection.

us_tracts = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-tracts.geojson')

us_tracts = us_tracts.to_crs(us_refineries.crs)

axis = us_tracts.plot('Median Household Income', scheme='naturalbreaks', 
	cmap='coolwarm_r', legend=True, edgecolor='none',
        legend_kwds={'bbox_to_anchor':(0.3, 0.4)})

axis.set_axis_off()

plt.show()
Figure
Median household income by census tract in 2015 - 2019 (USCB)

Subset

For this analysis we will filter our locations to focus on refineries and tracts in the US state of Louisiana, which is home to a collection of petrochemical facilities known as Cancer Alley (Baurick 2019).

We also load and plot a file of county boundaries to provide geographic context.

tracts = us_tracts[us_tracts['ST'] == 'LA']

refineries = us_refineries[us_refineries['State'] == 'Louisiana']

us_counties = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-counties.geojson')

counties = us_counties[us_counties['ST'] == 'LA']

counties = counties.to_crs(tracts.crs)

axis = tracts.plot('Median Household Income', scheme='naturalbreaks', 
	cmap='coolwarm_r', legend=True, edgecolor='none')

counties.plot(edgecolor='#00000080', facecolor='none', ax=axis)

refineries.plot(ax = axis, color='black')

axis.set_axis_off()

plt.show()
Figure
Refineries and income by census tract in Louisiana

Buffers

One common technique for categorical proximity analysis is buffer and overlay, where buffer polygons (usually circles) of a fixed size or radius are placed around locations and those buffers are overlaid over another set of areas to find areas that are adjacent to the buffered locations.

Oil refineries emit a wide variety of toxic chemicals (UCAR 2023). Accordingly, we might hypothesize that the areas around oil refineries would be considered undesirable places to live, reducing the property values and rents in those areas and attracting marginalized populations without the economic means to live in more desirable locations.

To represent the presumed undesirable area around refineries, we will create ten mile (16,000 meter) buffers corresponding to research finding increased cancer risk in the ten miles around refineries (Williams et al. 2020).

The GeoPandas buffer() method can be used to create buffers.

refinery_buffers = refineries.copy()

refinery_buffers['geometry'] = refinery_buffers.buffer(16000)

axis = refinery_buffers.plot(edgecolor='black', facecolor='none')

counties.plot(edgecolor='#00000040', facecolor='none', ax=axis)

axis.set_axis_off()

plt.show()
Figure
10 mile buffers around refineries in Louisiana

The small size of the buffers relative to the size of the state makes them indistinguishable from dots. We can use the set_xlim() and set_ylim() with the total_bounds bounding box of a transformed set of coordinates that define a zoom area that makes the buffers more visible around the stop locations.

zoom = counties[counties['Name'] == 'St. Charles']

axis = tracts.plot(edgecolor='#00008040', facecolor='none')

refinery_buffers.plot(edgecolor='black', facecolor='none', ax=axis)

axis.set_xlim(zoom.total_bounds[0], zoom.total_bounds[2])

axis.set_ylim(zoom.total_bounds[1], zoom.total_bounds[3])

axis.set_axis_off()

plt.show()
Figure
Close-up of refinery buffers in St. Charles Parish

Overlay

To flag the census tracts that intersect buffers, we use the sjoin() method.

count_buffers = refinery_buffers[['geometry']].copy()

count_buffers['Refinery'] = 'Refinery'

refinery_tracts = tracts.sjoin(count_buffers, how='left')

refinery_tracts = refinery_tracts.drop_duplicates('FactFinder GEOID')

refinery_tracts['Refinery'] = refinery_tracts['Refinery'].fillna('Non-Refinery')

axis = refinery_tracts.plot('Refinery', cmap='bwr_r', legend=True, edgecolor='none')

counties.plot(edgecolor='#00000020', facecolor='none', ax=axis)

refinery_buffers.plot(ax = axis, edgecolor='black', facecolor='none')

axis.set_axis_off()

plt.show()
Figure
Census tracts within 10 miles of a refinery

Group Totals

Count variables (like population) can be aggregated using groupby().

print(refinery_tracts[['Total Population','Refinery']].groupby('Refinery').sum())

print((refinery_tracts[['Total Population','Refinery']].groupby('Refinery').sum() \
	* 100 / refinery_tracts['Total Population'].sum()).round())

print(refinery_tracts[['geometry','Refinery']].groupby('Refinery').count())
              Total Population
Refinery                      
Non-Refinery         2768488.0
Refinery             1895874.0

              Total Population
Refinery                      
Non-Refinery         59.0
Refinery             41.0

              geometry
Refinery              
Non-Refinery       609
Refinery           531

Group Comparison

To explore demographic differences between tracts surrounding refineries and all other tracts in the state, we can create a sorted table based on the percentage difference in the variable medians between the two groups.

import numpy

refinery_values = refinery_tracts.select_dtypes(include=numpy.number)

refinery_values['Refinery'] = refinery_tracts['Refinery']

refinery_values = refinery_values.groupby('Refinery').median().T.reset_index()

refinery_values['Diff'] = round(100 * ((refinery_values['Refinery'] / refinery_values['Non-Refinery']) - 1))

refinery_values = refinery_values.sort_values('Diff', ascending=False)

print(refinery_values.head(n=15))

Counterintuitively, refinery tracts tend to be more urban (Pop per Square Mile), with more immigrants (Percent Foreign-Born), renters (Percent Renters), and residents with college degrees (Percent Graduate Degree, Percent Bachelors Degree).

Refinery                     index  Non-Refinery     Refinery   Diff
27             Pop per Square Mile    355.500000  3055.830000  760.0
5             Percent Foreign-Born      1.700000     3.350000   97.0
20                 Percent Renters     28.100000    42.900000   53.0
21              Percent No Vehicle      6.200000     9.100000   47.0
9          Percent Graduate Degree      5.100000     6.750000   32.0
8         Percent Bachelors Degree     11.000000    14.200000   29.0
3           Percent Single Mothers     38.100000    46.400000   22.0
16             Median Monthly Rent    769.000000   927.500000   21.0
13            Percent Unemployment      3.200000     3.700000   16.0
15         Median Monthly Mortgage   1134.000000  1287.500000   14.0
17            Percent Vacant Units     15.000000    15.300000    2.0
10               Percent Broadband     74.200000    74.300000    0.0
18           Percent Pre-War Units      3.200000     3.200000    0.0
14        Percent Health Insurance     90.600000    90.400000   -0.0
28                        Latitude     30.484552    30.030161   -1.0

Likewise, the refinery tracts are smaller (Square Miles, Total Population, Total Households) with lower birth rates (Annual Births per 1K Women) and lower commute times (Mean Minutes to Work).

print(refinery_values.tail(n=15))
Refinery                       index  Non-Refinery      Refinery  Diff
29                         Longitude    -91.846109    -90.268526  -2.0
11           Median Household Income  45907.000000  44898.500000  -2.0
26                        Median Age     38.400000     37.300000  -3.0
2             Average Household Size      2.600000      2.530000  -3.0
24                  Percent Under 18     23.400000     22.050000  -6.0
6                   Percent Veterans      6.600000      6.100000  -8.0
25                   Percent 65 Plus     16.000000     14.500000  -9.0
12              Mean Minutes To Work     25.800000     23.200000 -10.0
7                   Percent Disabled     16.100000     14.500000 -10.0
4         Annual Births per 1K Women     51.000000     44.000000 -14.0
1                   Total Households   1489.000000   1178.000000 -21.0
19                Percent Homeowners     71.900000     57.100000 -21.0
22                  Total Population   4044.000000   3113.000000 -23.0
0                       Square Miles     12.873035      0.998627 -92.0
30                       index_right           NaN     48.000000   NaN

However, when viewing a box plot, we see that general tendencies are not always predictive. In the case of the percent of foreign-born residents, There are census tracts with high percentages of foreign-born residents regardless of whether there is a refinery nearby.

refinery_tracts.boxplot('Percent Foreign-Born', 'Refinery')

plt.show()
Figure
Distribution of percent foreign-born by census tract refinery proximity

Limitations of Buffer and Overlay

Use of simple circular buffers operates on the assumption of influence of or accessibility to a point evenly around the point.

In the case of dispersions of toxic emissions a transport model is needed to model wind and water flow patterns and how they transport the toxics. These models can be quite complex (Beyea and Hatch 1999).

In the case of transit accessibility, circular buffers assume direct access to locations, while actual walking distance and effort is affected by street grids, physical obstacles (like roads, buildings, and private property), and topographical barriers (like creeks or steep hills).

Weighted Proximity Analysis

Weighted proximity analysis uses distance between features to quantify proximity. These continuous distance values (wieghts) can then be used with techniques like multiple regression.

This example will analyze county proximity to the nearest Amtrak passenger rail station in order to answer these research questions:

Location Data: Amtrak Stations

The destination points used for these examples will be Amtrak stations originally downloaded from the US Bureau of Transportation Statistics Geospatial at BTS Open Data Catalog.

Amtrak is a quasi-public US passenger railroad corporation that was formed by the US federal government in 1971 to take over US intercity passenger service when private railroads were facing economic crisis and were eager to discontinue unprofitable passenger operations (Minn 2016).

Unlike transit which has utility only when within a convenient walking or driving distance, Amtrak carries a significant number of leisure riders where the value is the journey as much or more than the destination (Losada-Rojas et al. 2019). This hub and spoke analysis assumes that some Amtrak riders may be willing to drive considerable distance to access service, although longer distances would make the trip less desirable (distance decay).

import geopandas

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [9, 6]

amtrak_stations = geopandas.read_file('http://michaelminn.net/tutorials/python-proximity/2023-amtrak-stations.geojson')

amtrak_stations = amtrak_stations.to_crs('ESRI:102008')

axis = amtrak_stations.plot(markersize=4, color='red')

axis.set_axis_off()

plt.show()
Figure
Amtrak stations

Area Data: Counties

The source areas for riders will be US counties, as loaded above with boundaries and 2015-2019 ACS demographic data originally sourced from the US Census Bureau. Counties in the United States are geographical divisions of states for government administration.

The code below loads a GeoJSON file of counties that have been prejoined to commonly-used variables from the 2015 - 2019 American Community survey five-year estimates. We also load a GeoJSON of states to provide geographic context.

To make the maps clearer, we subset only counties and states in the 48 continental states where Amtrak provides service.

us_counties = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-counties.geojson')

us_counties = us_counties[~us_counties['ST'].isin(['AK', 'HI', 'PR'])]

us_counties = us_counties.to_crs(amtrak_stations.crs)

states = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-states.geojson')

states = states[~states['ST'].isin(['AK', 'HI', 'PR'])]

states = states.to_crs(amtrak_stations.crs)

axis = us_counties.plot('Median Household Income', scheme='naturalbreaks', 
	cmap='coolwarm_r', legend=True, edgecolor='none',
        legend_kwds={'bbox_to_anchor':(0.3, 0.4)})

states.plot(edgecolor='#00000080', facecolor='none', ax=axis)

amtrak_stations.plot(markersize=4, color='black', ax=axis)

axis.set_axis_off()

plt.show()
Figure
Amtrak stations and median household income by county from the 2015 - 2019 ACS five-year estimates

Hub and Spoke Analysis

Hub and spoke analysis is a weighted proximity analysis technique that analyzes accessibility to a small number of centralized hubs where distance from peripheral points to the closest hub is calculated with imaginary spoke lines. Hub-and-spoke is useful for estimating or optimizing accessibility for hubs like distribution centers or transit stops.

Figure
Hub-and-spoke analysis
amtrak_counties = us_counties.copy()

amtrak_counties['geometry'] = amtrak_counties.centroid

amtrak_counties = amtrak_counties.sjoin_nearest(amtrak_stations, how='left', distance_col='Meters to Station')

amtrak_counties = amtrak_counties.drop_duplicates('FactFinder GEOID')

amtrak_counties['Km to Station'] = (amtrak_counties['Meters to Station'] / 1000).round()

amtrak_counties = us_counties.merge(amtrak_counties[['FactFinder GEOID', 'StationNam', 'Km to Station']], how='left')

axis = amtrak_counties.plot('Km to Station', scheme='naturalbreaks',
        cmap='coolwarm', legend=True, edgecolor='none',
        legend_kwds={'bbox_to_anchor':(0.2, 0.4)})

states.plot(edgecolor='#00000080', facecolor='none', ax=axis)

axis.set_axis_off()

plt.show()
Figure
Km to nearest Amtrak station by county

Subsetting

Subsetting can be used to display specific states.

analysis_state = 'IL'

state_amtrak = amtrak_counties[amtrak_counties['ST'] == analysis_state]

axis = state_amtrak.plot('Km to Station', cmap='coolwarm', scheme='naturalbreaks', 
	legend=True, legend_kwds={'bbox_to_anchor':(0.2, 0.4)})

axis.set_axis_off()

plt.show()
Figure
Km to nearest Amtrak station by Illinois county

Spoke Lines

Spoke lines can be added to visualize nearest station relationships.

hubs = amtrak_stations.copy()

hubs['endx'] = hubs.centroid.x

hubs['endy'] = hubs.centroid.y

spoke_lines = amtrak_counties.copy()

spoke_lines['startx'] = spoke_lines.centroid.x

spoke_lines['starty'] = spoke_lines.centroid.y

spoke_lines = spoke_lines.merge(hubs[['endx', 'endy', 'StationNam']])

spoke_lines['wkt'] = 'LINESTRING(' + spoke_lines.startx.astype(str) + ' ' + \
	spoke_lines.starty.astype(str) + ', ' + \
	spoke_lines.endx.astype(str) + ' ' + \
	spoke_lines.endy.astype(str) + ')'

lines = geopandas.GeoSeries.from_wkt(spoke_lines['wkt'])

spoke_lines = geopandas.GeoDataFrame(spoke_lines, geometry = lines, crs=amtrak_counties.crs)

axis = spoke_lines.plot(color='darkred', linewidth=0.5)

amtrak_stations.plot(markersize=4, ax=axis)

us_counties.plot(edgecolor='#00000080', facecolor='none', linewidth=0.5, ax=axis)

axis.set_axis_off()

plt.show()
Figure
County spoke lines to Amtrak station hubs

Ordinal Ranking

Underserved counties that have high populations but distant service can be identified by multiplying population by distance. Counties can then be sorted by that score to find the areas that are most underserved.

Two groups of counties stand out on the list: moderate population counties (such as the counties in the Texas Rio Grande valley) that are far from an Amtrak station, and high population counties (such as Los Angeles or San Diego) that have a limited number of large stations.

amtrak_underserved = amtrak_counties.copy()

amtrak_underserved['Person Km'] = amtrak_underserved['Total Population'] * amtrak_underserved['Km to Station']

amtrak_underserved = amtrak_underserved.sort_values('Person Km', ascending=False)

print(amtrak_underserved[['Name', 'ST', 'Total Population', 'Km to Station', 'Person Km']].head(n=10))
                Name  ST  Total Population  Km to Station    Person Km
2596         Hidalgo  TX          855176.0          348.0  297601248.0
74          Maricopa  AZ         4328810.0           42.0  181810020.0
2519         Cameron  TX          421666.0          387.0  163184742.0
193        San Diego  CA         3316073.0           48.0  159171504.0
192   San Bernardino  CA         2149031.0           72.0  154730232.0
2640         Lubbock  TX          304808.0          375.0  114303000.0
333       Miami-Dade  FL         2699428.0           40.0  107977120.0
2168           Tulsa  OK          646419.0          155.0  100194945.0
175      Los Angeles  CA        10081570.0            9.0   90734130.0
2440            Knox  TN          461104.0          174.0   80232096.0
top_underserved = amtrak_underserved.copy()

top_underserved = top_underserved.iloc[0:10]

axis = top_underserved.plot(facecolor='red')

states.plot(edgecolor='#00000080', facecolor='none', ax=axis)

amtrak_stations.plot(markersize=4, ax=axis)

axis.set_axis_off()

plt.show()
Figure
Map of 20 counties most underserved by Amtrak

Aggregated Areas

County-level travel distances can be aggregated by state to identify the best and worst states for Amtrak access.

def weighted_mean_distance(x):
	return (x['Km to Station'] * x['Total Population']).sum() / x['Total Population'].sum()

amtrak_states = amtrak_counties.groupby('ST').apply(weighted_mean_distance)

amtrak_states = amtrak_states.sort_values(ascending=False).reset_index()

amtrak_states = amtrak_states.rename(columns={0: 'Weighted Mean Km'})

print(amtrak_states.head())
   ST  Weighted Mean Km
0  NH          0.000000
1  OR          0.288263
2  CA          0.378485
3  MA          0.762972
4  RI          0.774991
print(weighted_states.tail())
    ST  Weighted Mean Km
43  OK         69.325894
44  IA         70.195880
45  AR         71.169017
46  MT        115.531178
47  SD        226.585449

We can map a choropleth by merging with state polygons.

amtrak_states = states.merge(amtrak_states)

axis = amtrak_states.plot('Weighted Mean Km', legend=True, scheme='naturalbreaks',
        cmap = 'coolwarm', edgecolor = '#00000020')

axis.set_axis_off()

plt.show()
Figure
Population weighted mean distance to Amtrak stations by state

Centrographics

Centrography is "statistical analyses concerned with centers of population, median centers, median points, and related methods" (Sviatlovsky and Eells 1939). Much like general statistical measures of central tendency, centrographic measures can be useful for summarizing large amounts of data, assessing change over time, and optimizing proximity.

Location Data: Major League Ballparks

This example will use the locations of major league ballparks in the US.

ballparks = geopandas.read_file('https://michaelminn.net/tutorials/data/2019-mlb-ballparks.geojson')

ballparks = ballparks.to_crs(states.crs)

axis = ballparks.plot()

states.plot(edgecolor='#00000080', facecolor='none', ax=axis)

axis.set_axis_off()

plt.show()
Figure
US Major League baseball parks (2019)

The mean center is the point at the mean of all latitudes and mean of all longitudes for a group of features.

When working with points of interest, the mean center represents an optimal location with the minimum amount of Euclidean distance to each point of interest.

One use for a mean center is to find a central location that would minimize travel to a centralized location from a group of geographically dispersed cities. If baseball general managers were planning on having a group meeting, or if a championship game were planned at a neutral site, the mean center might be a good choice.

The mean center of US major league ballparks is in Missouri between Kansas City and St. Louis at lat/long 39.18905, -92.63477. This is slightly north of the 2020 census center of population in south central Missouri at 37.415725, -92.346525 (USCB 2021).

Because any meeting of any size would presumably need to be in a major city with copious lodging and a major airport, St. Louis or Kansas City would be suitable locations.

x = ballparks['geometry'].centroid.x.mean()

y = ballparks['geometry'].centroid.y.mean()

mean_center = geopandas.GeoDataFrame(geometry = geopandas.points_from_xy([x], [y], crs=states.crs))

axis = ballparks.plot()

states.plot(edgecolor='#00000080', facecolor='none', ax=axis)

mean_center.plot(marker='*', color='red', ax=axis, markersize=96)

axis.set_axis_off()

plt.show()

print(mean_center.to_crs('EPSG:4326')['geometry'])
POINT (-92.63477 39.18905)
Figure
The mean center of US Major League baseball parks

Limitations

The major caveat with mean centers is that they may not represent optimal practical travel distance since transportation networks or physical obstacles may impose indirect routing and access.