Proximity Analysis in Python
Proximity analysis is the extraction of information from geospatial data based on distance between features. Proximity analysis is especially useful for evaluating two characteristics:
- Accessibility (periphery to core): Ability to reach desirable destinations like transit hubs or green space
- Influence (core to periphery): Projection of materials or power from central locations outward, such as political control, chemical emissions, or habitat extent
Proximity analysis assumes distance decay, where the power of the relationship between locations decreases as distance between locations increases. Distance decay is a fundamental assumption behind gravity models.
- With desirable characteristics (like appeal to tourists), distance decay means that distant locations are less desirable than closer locations (closer locations require less travel effort).
- With undesirable characteristics (like toxic emissions), this means that distant locations are more desirable than closer locations (pollution disperses over space and is less intense the further you are from the source).
Three methods involving proximity are covered in this tutorial.
Categorical Proximity
Categorical proximity analysis divides areas into two classes (adjacent and non-adjacent) based on proximity to specific locations.
This example will analyze census tract proximity to petroleum refineries to answer these research questions:
- Which neighborhoods are adjacent to refineries in the state of Louisiana?
- How are neighborhoods adjacent to refineries demographically different from neighborhoods that are not adjacent to refineries?
Location Data: Refineries
Proximity analysis can be performed on two sets of features, either of which can be points, lines, or polygons.
For this example, we will analyze proximity between points and areas.
The example point locations will be petroleum refinery locations from the Energy Information Administration (EIA) U.S. Energy Atlas.
Oil refineries are large facilities that transform crude oil pumped from the ground into gasoline and a wide variety of other products that are essential to contemporary life in the United States.
- GeoPandas is a Python package for working with geospatial data.
- To create plots you will need the pyplot module from Matplotlib. The as keyword creates an alias (plt) that saves you from having to type the whole module name (matplotlib.pyplot) when referencing a function from that module.
- Points can be loaded directly as GeoJSON from the atlas' feature service using geopandas.read_file().
- To facilitate calculating distances, we reproject the data into the North America Albers Equal Area Conic Projection (ESRI 102008) (in meters) using the to_crs() method.
import geopandas import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = [9, 6] us_refineries = geopandas.read_file('https://services7.arcgis.com/FGr1D95XCGALKXqM/arcgis/rest/services/Petroleum_Refineries_US_EIA/FeatureServer/22/query?outFields=*&where=1%3D1&f=geojson') us_refineries = us_refineries.to_crs("ESRI:102008") axis = us_refineries.plot() axis.set_axis_off() plt.show()
Area Data: Census Tracts
The areas we use or this example will be census tracts in Louisiana with demographic data.
The American Community Survey (ACS), an ongoing survey by the US Census Bureau that provides information about people in the United States beyond the basic information collected in the decennial census. The ACS is commonly used by a wide variety of researchers when they need information about the general public.
Census tracts are subdivisions of counties that are drawn by the US Census Bureau based on clearly identifiable features to ideally contain around 4,000 residents, although in practice the range of population is usually between 1,200 and 8,000 (USCB 2019). The census tract is the smallest reliable area of aggregation in data made public by the US Census Bureau.
The code below loads GeoJSON files of areas that have been prejoined to commonly-used variables from the 2015 - 2019 American Community survey five-year estimates.
Note that there are over 70K census tracts in the US and this GeoJSON file may take a minute or two to load, depending on the speed of your internet connection.
us_tracts = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-tracts.geojson') us_tracts = us_tracts.to_crs(us_refineries.crs) axis = us_tracts.plot('Median Household Income', scheme='naturalbreaks', cmap='coolwarm_r', legend=True, edgecolor='none', legend_kwds={'bbox_to_anchor':(0.3, 0.4)}) axis.set_axis_off() plt.show()
Subset
For this analysis we will filter our locations to focus on refineries and tracts in the US state of Louisiana, which is home to a collection of petrochemical facilities known as Cancer Alley (Baurick 2019).
We also load and plot a file of county boundaries to provide geographic context.
tracts = us_tracts[us_tracts['ST'] == 'LA'] refineries = us_refineries[us_refineries['State'] == 'Louisiana'] us_counties = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-counties.geojson') counties = us_counties[us_counties['ST'] == 'LA'] counties = counties.to_crs(tracts.crs) axis = tracts.plot('Median Household Income', scheme='naturalbreaks', cmap='coolwarm_r', legend=True, edgecolor='none') counties.plot(edgecolor='#00000080', facecolor='none', ax=axis) refineries.plot(ax = axis, color='black') axis.set_axis_off() plt.show()
Buffers
One common technique for categorical proximity analysis is buffer and overlay, where buffer polygons (usually circles) of a fixed size or radius are placed around locations and those buffers are overlaid over another set of areas to find areas that are adjacent to the buffered locations.
Oil refineries emit a wide variety of toxic chemicals (UCAR 2023). Accordingly, we might hypothesize that the areas around oil refineries would be considered undesirable places to live, reducing the property values and rents in those areas and attracting marginalized populations without the economic means to live in more desirable locations.
To represent the presumed undesirable area around refineries, we will create ten mile (16,000 meter) buffers corresponding to research finding increased cancer risk in the ten miles around refineries (Williams et al. 2020).
The GeoPandas buffer() method can be used to create buffers.
refinery_buffers = refineries.copy() refinery_buffers['geometry'] = refinery_buffers.buffer(16000) axis = refinery_buffers.plot(edgecolor='black', facecolor='none') counties.plot(edgecolor='#00000040', facecolor='none', ax=axis) axis.set_axis_off() plt.show()
The small size of the buffers relative to the size of the state makes them indistinguishable from dots. We can use the set_xlim() and set_ylim() with the total_bounds bounding box of a transformed set of coordinates that define a zoom area that makes the buffers more visible around the stop locations.
zoom = counties[counties['Name'] == 'St. Charles'] axis = tracts.plot(edgecolor='#00008040', facecolor='none') refinery_buffers.plot(edgecolor='black', facecolor='none', ax=axis) axis.set_xlim(zoom.total_bounds[0], zoom.total_bounds[2]) axis.set_ylim(zoom.total_bounds[1], zoom.total_bounds[3]) axis.set_axis_off() plt.show()
Overlay
To flag the census tracts that intersect buffers, we use the sjoin() method.
- We use a left join (rather than the default inner join) so that all tracts are returned from the join even if there was no intersecting refinery buffer.
- We use the fillna() method to fill blank (NA) values with with zero in tracts that did not intersect refinery buffers.
- We use the Pandas drop_duplicates() method to leave only one record for tracts that intersect with multiple refinery buffers.
count_buffers = refinery_buffers[['geometry']].copy() count_buffers['Refinery'] = 'Refinery' refinery_tracts = tracts.sjoin(count_buffers, how='left') refinery_tracts = refinery_tracts.drop_duplicates('FactFinder GEOID') refinery_tracts['Refinery'] = refinery_tracts['Refinery'].fillna('Non-Refinery') axis = refinery_tracts.plot('Refinery', cmap='bwr_r', legend=True, edgecolor='none') counties.plot(edgecolor='#00000020', facecolor='none', ax=axis) refinery_buffers.plot(ax = axis, edgecolor='black', facecolor='none') axis.set_axis_off() plt.show()
Group Totals
Count variables (like population) can be aggregated using groupby().
- sum() can be used to get totals within groups (total state population in refinery vs. non-refinery tracts).
- Sums can be divided by the overall sum and multiplied by 100 to get percentages (percent of state population in refinery vs. non-refinery tracts).
- count() can be used to find the number of features (number of refinery vs. non-refinery tracts).
print(refinery_tracts[['Total Population','Refinery']].groupby('Refinery').sum()) print((refinery_tracts[['Total Population','Refinery']].groupby('Refinery').sum() \ * 100 / refinery_tracts['Total Population'].sum()).round()) print(refinery_tracts[['geometry','Refinery']].groupby('Refinery').count())
Total Population Refinery Non-Refinery 2768488.0 Refinery 1895874.0 Total Population Refinery Non-Refinery 59.0 Refinery 41.0 geometry Refinery Non-Refinery 609 Refinery 531
Group Comparison
To explore demographic differences between tracts surrounding refineries and all other tracts in the state, we can create a sorted table based on the percentage difference in the variable medians between the two groups.
- The Pandas select_dtypes() methods selects only the numeric columns, excluding text and geometry columns.
- We groupby() presence of a refinery and take the median() for each variable.
- The Pandas T property is a transposed version of the list so the output is long rather than wide.
- The Pandas reset_index() method converts the list to a DataFrame and copies the variable names into a column.
- We divide the refinery tract medians by the non-refinery tract medians and multiply by 100 to get the percentage difference between the two groups.
- We sort_values() so the variables with positive differences are at the top of the list and negative differences are at the bottom of the list.
- We print() the head() and tail() of the list to see the variables with the greatest differences.
import numpy refinery_values = refinery_tracts.select_dtypes(include=numpy.number) refinery_values['Refinery'] = refinery_tracts['Refinery'] refinery_values = refinery_values.groupby('Refinery').median().T.reset_index() refinery_values['Diff'] = round(100 * ((refinery_values['Refinery'] / refinery_values['Non-Refinery']) - 1)) refinery_values = refinery_values.sort_values('Diff', ascending=False) print(refinery_values.head(n=15))
Counterintuitively, refinery tracts tend to be more urban (Pop per Square Mile), with more immigrants (Percent Foreign-Born), renters (Percent Renters), and residents with college degrees (Percent Graduate Degree, Percent Bachelors Degree).
Refinery index Non-Refinery Refinery Diff 27 Pop per Square Mile 355.500000 3055.830000 760.0 5 Percent Foreign-Born 1.700000 3.350000 97.0 20 Percent Renters 28.100000 42.900000 53.0 21 Percent No Vehicle 6.200000 9.100000 47.0 9 Percent Graduate Degree 5.100000 6.750000 32.0 8 Percent Bachelors Degree 11.000000 14.200000 29.0 3 Percent Single Mothers 38.100000 46.400000 22.0 16 Median Monthly Rent 769.000000 927.500000 21.0 13 Percent Unemployment 3.200000 3.700000 16.0 15 Median Monthly Mortgage 1134.000000 1287.500000 14.0 17 Percent Vacant Units 15.000000 15.300000 2.0 10 Percent Broadband 74.200000 74.300000 0.0 18 Percent Pre-War Units 3.200000 3.200000 0.0 14 Percent Health Insurance 90.600000 90.400000 -0.0 28 Latitude 30.484552 30.030161 -1.0
Likewise, the refinery tracts are smaller (Square Miles, Total Population, Total Households) with lower birth rates (Annual Births per 1K Women) and lower commute times (Mean Minutes to Work).
print(refinery_values.tail(n=15))
Refinery index Non-Refinery Refinery Diff 29 Longitude -91.846109 -90.268526 -2.0 11 Median Household Income 45907.000000 44898.500000 -2.0 26 Median Age 38.400000 37.300000 -3.0 2 Average Household Size 2.600000 2.530000 -3.0 24 Percent Under 18 23.400000 22.050000 -6.0 6 Percent Veterans 6.600000 6.100000 -8.0 25 Percent 65 Plus 16.000000 14.500000 -9.0 12 Mean Minutes To Work 25.800000 23.200000 -10.0 7 Percent Disabled 16.100000 14.500000 -10.0 4 Annual Births per 1K Women 51.000000 44.000000 -14.0 1 Total Households 1489.000000 1178.000000 -21.0 19 Percent Homeowners 71.900000 57.100000 -21.0 22 Total Population 4044.000000 3113.000000 -23.0 0 Square Miles 12.873035 0.998627 -92.0 30 index_right NaN 48.000000 NaN
However, when viewing a box plot, we see that general tendencies are not always predictive. In the case of the percent of foreign-born residents, There are census tracts with high percentages of foreign-born residents regardless of whether there is a refinery nearby.
refinery_tracts.boxplot('Percent Foreign-Born', 'Refinery') plt.show()
Limitations of Buffer and Overlay
Use of simple circular buffers operates on the assumption of influence of or accessibility to a point evenly around the point.
In the case of dispersions of toxic emissions a transport model is needed to model wind and water flow patterns and how they transport the toxics. These models can be quite complex (Beyea and Hatch 1999).
In the case of transit accessibility, circular buffers assume direct access to locations, while actual walking distance and effort is affected by street grids, physical obstacles (like roads, buildings, and private property), and topographical barriers (like creeks or steep hills).
Weighted Proximity Analysis
Weighted proximity analysis uses distance between features to quantify proximity. These continuous distance values (wieghts) can then be used with techniques like multiple regression.
This example will analyze county proximity to the nearest Amtrak passenger rail station in order to answer these research questions:
- Which counties have the greatest and least accessibility to Amtrak stations?
- Which high-population counties could potentially merit added Amtrak stations?
Location Data: Amtrak Stations
The destination points used for these examples will be Amtrak stations originally downloaded from the US Bureau of Transportation Statistics Geospatial at BTS Open Data Catalog.
Amtrak is a quasi-public US passenger railroad corporation that was formed by the US federal government in 1971 to take over US intercity passenger service when private railroads were facing economic crisis and were eager to discontinue unprofitable passenger operations (Minn 2016).
Unlike transit which has utility only when within a convenient walking or driving distance, Amtrak carries a significant number of leisure riders where the value is the journey as much or more than the destination (Losada-Rojas et al. 2019). This hub and spoke analysis assumes that some Amtrak riders may be willing to drive considerable distance to access service, although longer distances would make the trip less desirable (distance decay).
import geopandas import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = [9, 6] amtrak_stations = geopandas.read_file('http://michaelminn.net/tutorials/python-proximity/2023-amtrak-stations.geojson') amtrak_stations = amtrak_stations.to_crs('ESRI:102008') axis = amtrak_stations.plot(markersize=4, color='red') axis.set_axis_off() plt.show()
Area Data: Counties
The source areas for riders will be US counties, as loaded above with boundaries and 2015-2019 ACS demographic data originally sourced from the US Census Bureau. Counties in the United States are geographical divisions of states for government administration.
The code below loads a GeoJSON file of counties that have been prejoined to commonly-used variables from the 2015 - 2019 American Community survey five-year estimates. We also load a GeoJSON of states to provide geographic context.
To make the maps clearer, we subset only counties and states in the 48 continental states where Amtrak provides service.
us_counties = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-counties.geojson') us_counties = us_counties[~us_counties['ST'].isin(['AK', 'HI', 'PR'])] us_counties = us_counties.to_crs(amtrak_stations.crs) states = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-states.geojson') states = states[~states['ST'].isin(['AK', 'HI', 'PR'])] states = states.to_crs(amtrak_stations.crs) axis = us_counties.plot('Median Household Income', scheme='naturalbreaks', cmap='coolwarm_r', legend=True, edgecolor='none', legend_kwds={'bbox_to_anchor':(0.3, 0.4)}) states.plot(edgecolor='#00000080', facecolor='none', ax=axis) amtrak_stations.plot(markersize=4, color='black', ax=axis) axis.set_axis_off() plt.show()
Hub and Spoke Analysis
Hub and spoke analysis is a weighted proximity analysis technique that analyzes accessibility to a small number of centralized hubs where distance from peripheral points to the closest hub is calculated with imaginary spoke lines. Hub-and-spoke is useful for estimating or optimizing accessibility for hubs like distribution centers or transit stops.
- The county geometries are first converted to their centroid points so the distances to stations are calculated from the centers of the county.
- The GeoPandas sjoin_nearest() function finds the nearest Amtrak stop to each county.
- The distance_col parameter indicates the name of the attribute to add for the distance, which is in meters with this projection.
- The Pandas drop_duplicates() method is needed because sjoin_nearest() results "will include multiple output records for a single input record where there are multiple equidistant nearest or intersected neighbors."
- The nearest station name and distance are joined back into the county polygons with the Pandas merge() method.
amtrak_counties = us_counties.copy() amtrak_counties['geometry'] = amtrak_counties.centroid amtrak_counties = amtrak_counties.sjoin_nearest(amtrak_stations, how='left', distance_col='Meters to Station') amtrak_counties = amtrak_counties.drop_duplicates('FactFinder GEOID') amtrak_counties['Km to Station'] = (amtrak_counties['Meters to Station'] / 1000).round() amtrak_counties = us_counties.merge(amtrak_counties[['FactFinder GEOID', 'StationNam', 'Km to Station']], how='left') axis = amtrak_counties.plot('Km to Station', scheme='naturalbreaks', cmap='coolwarm', legend=True, edgecolor='none', legend_kwds={'bbox_to_anchor':(0.2, 0.4)}) states.plot(edgecolor='#00000080', facecolor='none', ax=axis) axis.set_axis_off() plt.show()
Subsetting
Subsetting can be used to display specific states.
analysis_state = 'IL' state_amtrak = amtrak_counties[amtrak_counties['ST'] == analysis_state] axis = state_amtrak.plot('Km to Station', cmap='coolwarm', scheme='naturalbreaks', legend=True, legend_kwds={'bbox_to_anchor':(0.2, 0.4)}) axis.set_axis_off() plt.show()
Spoke Lines
Spoke lines can be added to visualize nearest station relationships.
hubs = amtrak_stations.copy() hubs['endx'] = hubs.centroid.x hubs['endy'] = hubs.centroid.y spoke_lines = amtrak_counties.copy() spoke_lines['startx'] = spoke_lines.centroid.x spoke_lines['starty'] = spoke_lines.centroid.y spoke_lines = spoke_lines.merge(hubs[['endx', 'endy', 'StationNam']]) spoke_lines['wkt'] = 'LINESTRING(' + spoke_lines.startx.astype(str) + ' ' + \ spoke_lines.starty.astype(str) + ', ' + \ spoke_lines.endx.astype(str) + ' ' + \ spoke_lines.endy.astype(str) + ')' lines = geopandas.GeoSeries.from_wkt(spoke_lines['wkt']) spoke_lines = geopandas.GeoDataFrame(spoke_lines, geometry = lines, crs=amtrak_counties.crs) axis = spoke_lines.plot(color='darkred', linewidth=0.5) amtrak_stations.plot(markersize=4, ax=axis) us_counties.plot(edgecolor='#00000080', facecolor='none', linewidth=0.5, ax=axis) axis.set_axis_off() plt.show()
Ordinal Ranking
Underserved counties that have high populations but distant service can be identified by multiplying population by distance. Counties can then be sorted by that score to find the areas that are most underserved.
Two groups of counties stand out on the list: moderate population counties (such as the counties in the Texas Rio Grande valley) that are far from an Amtrak station, and high population counties (such as Los Angeles or San Diego) that have a limited number of large stations.
amtrak_underserved = amtrak_counties.copy() amtrak_underserved['Person Km'] = amtrak_underserved['Total Population'] * amtrak_underserved['Km to Station'] amtrak_underserved = amtrak_underserved.sort_values('Person Km', ascending=False) print(amtrak_underserved[['Name', 'ST', 'Total Population', 'Km to Station', 'Person Km']].head(n=10))
Name ST Total Population Km to Station Person Km 2596 Hidalgo TX 855176.0 348.0 297601248.0 74 Maricopa AZ 4328810.0 42.0 181810020.0 2519 Cameron TX 421666.0 387.0 163184742.0 193 San Diego CA 3316073.0 48.0 159171504.0 192 San Bernardino CA 2149031.0 72.0 154730232.0 2640 Lubbock TX 304808.0 375.0 114303000.0 333 Miami-Dade FL 2699428.0 40.0 107977120.0 2168 Tulsa OK 646419.0 155.0 100194945.0 175 Los Angeles CA 10081570.0 9.0 90734130.0 2440 Knox TN 461104.0 174.0 80232096.0
top_underserved = amtrak_underserved.copy() top_underserved = top_underserved.iloc[0:10] axis = top_underserved.plot(facecolor='red') states.plot(edgecolor='#00000080', facecolor='none', ax=axis) amtrak_stations.plot(markersize=4, ax=axis) axis.set_axis_off() plt.show()
Aggregated Areas
County-level travel distances can be aggregated by state to identify the best and worst states for Amtrak access.
- The mean (average) of the distances of all counties in a state are weighted by population so that the access in counties where most people live are considered more strongly than less-populous counties.
- Higher values (more people traveling further to stations) give an indication of which states are underserved compared to other states.
- Weighted mean distance is calculated by multiplying the distance for all state counties by state population, then dividing the sum of those values by the sum of total population. This needs to be placed in a function because the functions used for groupby like mean() or sum() only accept one value, and we need to perform the calculation with both the distance and population fields.
- We use the Pandas groupby() method to group the county values by state and then apply() the function defined above to each grouping of counties by state.
- sort_values() to make it easier to see the stateswith the highest and lowest weighted mean distances.
- Because the apply() results in a column named "0," we rename() before displaying the top five values with head().
def weighted_mean_distance(x): return (x['Km to Station'] * x['Total Population']).sum() / x['Total Population'].sum() amtrak_states = amtrak_counties.groupby('ST').apply(weighted_mean_distance) amtrak_states = amtrak_states.sort_values(ascending=False).reset_index() amtrak_states = amtrak_states.rename(columns={0: 'Weighted Mean Km'}) print(amtrak_states.head())
ST Weighted Mean Km 0 NH 0.000000 1 OR 0.288263 2 CA 0.378485 3 MA 0.762972 4 RI 0.774991
print(weighted_states.tail())
ST Weighted Mean Km 43 OK 69.325894 44 IA 70.195880 45 AR 71.169017 46 MT 115.531178 47 SD 226.585449
We can map a choropleth by merging with state polygons.
amtrak_states = states.merge(amtrak_states) axis = amtrak_states.plot('Weighted Mean Km', legend=True, scheme='naturalbreaks', cmap = 'coolwarm', edgecolor = '#00000020') axis.set_axis_off() plt.show()
Centrographics
Centrography is "statistical analyses concerned with centers of population, median centers, median points, and related methods" (Sviatlovsky and Eells 1939). Much like general statistical measures of central tendency, centrographic measures can be useful for summarizing large amounts of data, assessing change over time, and optimizing proximity.
Location Data: Major League Ballparks
This example will use the locations of major league ballparks in the US.
ballparks = geopandas.read_file('https://michaelminn.net/tutorials/data/2019-mlb-ballparks.geojson') ballparks = ballparks.to_crs(states.crs) axis = ballparks.plot() states.plot(edgecolor='#00000080', facecolor='none', ax=axis) axis.set_axis_off() plt.show()
The mean center is the point at the mean of all latitudes and mean of all longitudes for a group of features.
When working with points of interest, the mean center represents an optimal location with the minimum amount of Euclidean distance to each point of interest.
One use for a mean center is to find a central location that would minimize travel to a centralized location from a group of geographically dispersed cities. If baseball general managers were planning on having a group meeting, or if a championship game were planned at a neutral site, the mean center might be a good choice.
The mean center of US major league ballparks is in Missouri between Kansas City and St. Louis at lat/long 39.18905, -92.63477. This is slightly north of the 2020 census center of population in south central Missouri at 37.415725, -92.346525 (USCB 2021).
Because any meeting of any size would presumably need to be in a major city with copious lodging and a major airport, St. Louis or Kansas City would be suitable locations.
x = ballparks['geometry'].centroid.x.mean() y = ballparks['geometry'].centroid.y.mean() mean_center = geopandas.GeoDataFrame(geometry = geopandas.points_from_xy([x], [y], crs=states.crs)) axis = ballparks.plot() states.plot(edgecolor='#00000080', facecolor='none', ax=axis) mean_center.plot(marker='*', color='red', ax=axis, markersize=96) axis.set_axis_off() plt.show() print(mean_center.to_crs('EPSG:4326')['geometry'])
POINT (-92.63477 39.18905)
Limitations
The major caveat with mean centers is that they may not represent optimal practical travel distance since transportation networks or physical obstacles may impose indirect routing and access.