Proximity Analysis in Python

Proximity analysis is the extraction of information from geospatial data based on distance between features. Proximity analysis is especially useful for evaluating two characteristics:

Accessibility (periphery to core): Ability to reach desirable destinations like transit hubs or green space
Influence (core to periphery): Projection of materials or power from central locations outward, such as political control, chemical emissions, or habitat extent

Proximity analysis assumes distance decay, where the power of the relationship between locations decreases as distance between locations increases. Distance decay is a fundamental assumption behind gravity models.

With desirable characteristics (like appeal to tourists), distance decay means that distant locations are less desirable than closer locations (closer locations require less travel effort).
With undesirable characteristics (like toxic emissions), this means that distant locations are more desirable than closer locations (pollution disperses over space and is less intense the further you are from the source).

Three methods involving proximity are covered in this tutorial.

Categorical proximity (buffer and overlay)
Weighted proximity (hub and spoke)
Centrographics

Categorical Proximity

Categorical proximity analysis divides areas into two classes (adjacent and non-adjacent) based on proximity to specific locations.

This example will analyze census tract proximity to petroleum refineries to answer these research questions:

Which neighborhoods are adjacent to refineries in the state of Louisiana?
How are neighborhoods adjacent to refineries demographically different from neighborhoods that are not adjacent to refineries?

Location Data: Refineries

Proximity analysis can be performed on two sets of features, either of which can be points, lines, or polygons.

For this example, we will analyze proximity between points and areas.

The example point locations will be petroleum refinery locations from the Energy Information Administration (EIA) U.S. Energy Atlas.

Oil refineries are large facilities that transform crude oil pumped from the ground into gasoline and a wide variety of other products that are essential to contemporary life in the United States.

GeoPandas is a Python package for working with geospatial data.
To create plots you will need the pyplot module from Matplotlib. The as keyword creates an alias (plt) that saves you from having to type the whole module name (matplotlib.pyplot) when referencing a function from that module.
Points can be loaded directly as GeoJSON from the atlas' feature service using geopandas.read_file().
To facilitate calculating distances, we reproject the data into the North America Albers Equal Area Conic Projection (ESRI 102008) (in meters) using the to_crs() method.

import geopandas

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [9, 6]

us_refineries = geopandas.read_file('https://services7.arcgis.com/FGr1D95XCGALKXqM/arcgis/rest/services/Petroleum_Refineries_US_EIA/FeatureServer/22/query?outFields=*&where=1%3D1&f=geojson')

us_refineries = us_refineries.to_crs("ESRI:102008")

axis = us_refineries.plot()

axis.set_axis_off()

plt.show()

Area Data: Census Tracts

The areas we use or this example will be census tracts in Louisiana with demographic data.

The American Community Survey (ACS), an ongoing survey by the US Census Bureau that provides information about people in the United States beyond the basic information collected in the decennial census. The ACS is commonly used by a wide variety of researchers when they need information about the general public.

Census tracts are subdivisions of counties that are drawn by the US Census Bureau based on clearly identifiable features to ideally contain around 4,000 residents, although in practice the range of population is usually between 1,200 and 8,000 (USCB 2019). The census tract is the smallest reliable area of aggregation in data made public by the US Census Bureau.

The code below loads GeoJSON files of areas that have been prejoined to commonly-used variables from the 2015 - 2019 American Community survey five-year estimates.

Note that there are over 70K census tracts in the US and this GeoJSON file may take a minute or two to load, depending on the speed of your internet connection.

us_tracts = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-tracts.geojson')

us_tracts = us_tracts.to_crs(us_refineries.crs)

axis = us_tracts.plot('Median Household Income', scheme='naturalbreaks', 
	cmap='coolwarm_r', legend=True, edgecolor='none',
        legend_kwds={'bbox_to_anchor':(0.3, 0.4)})

axis.set_axis_off()

plt.show()

Median household income by census tract in 2015 - 2019 (USCB)

Subset

For this analysis we will filter our locations to focus on refineries and tracts in the US state of Louisiana, which is home to a collection of petrochemical facilities known as Cancer Alley (Baurick 2019).

We also load and plot a file of county boundaries to provide geographic context.

tracts = us_tracts[us_tracts['ST'] == 'LA']

refineries = us_refineries[us_refineries['State'] == 'Louisiana']

us_counties = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-counties.geojson')

counties = us_counties[us_counties['ST'] == 'LA']

counties = counties.to_crs(tracts.crs)

axis = tracts.plot('Median Household Income', scheme='naturalbreaks', 
	cmap='coolwarm_r', legend=True, edgecolor='none')

counties.plot(edgecolor='#00000080', facecolor='none', ax=axis)

refineries.plot(ax = axis, color='black')

axis.set_axis_off()

plt.show()

Refineries and income by census tract in Louisiana

Buffers

One common technique for categorical proximity analysis is buffer and overlay, where buffer polygons (usually circles) of a fixed size or radius are placed around locations and those buffers are overlaid over another set of areas to find areas that are adjacent to the buffered locations.

Oil refineries emit a wide variety of toxic chemicals (UCAR 2023). Accordingly, we might hypothesize that the areas around oil refineries would be considered undesirable places to live, reducing the property values and rents in those areas and attracting marginalized populations without the economic means to live in more desirable locations.

To represent the presumed undesirable area around refineries, we will create ten mile (16,000 meter) buffers corresponding to research finding increased cancer risk in the ten miles around refineries (Williams et al. 2020).

The GeoPandas buffer() method can be used to create buffers.

refinery_buffers = refineries.copy()

refinery_buffers['geometry'] = refinery_buffers.buffer(16000)

axis = refinery_buffers.plot(edgecolor='black', facecolor='none')

counties.plot(edgecolor='#00000040', facecolor='none', ax=axis)

axis.set_axis_off()

plt.show()

10 mile buffers around refineries in Louisiana

The small size of the buffers relative to the size of the state makes them indistinguishable from dots. We can use the set_xlim() and set_ylim() with the total_bounds bounding box of a transformed set of coordinates that define a zoom area that makes the buffers more visible around the stop locations.

zoom = counties[counties['Name'] == 'St. Charles']

axis = tracts.plot(edgecolor='#00008040', facecolor='none')

refinery_buffers.plot(edgecolor='black', facecolor='none', ax=axis)

axis.set_xlim(zoom.total_bounds[0], zoom.total_bounds[2])

axis.set_ylim(zoom.total_bounds[1], zoom.total_bounds[3])

axis.set_axis_off()

plt.show()

Close-up of refinery buffers in St. Charles Parish

Overlay

To flag the census tracts that intersect buffers, we use the sjoin() method.

We use a left join (rather than the default inner join) so that all tracts are returned from the join even if there was no intersecting refinery buffer.
We use the fillna() method to fill blank (NA) values with with zero in tracts that did not intersect refinery buffers.
We use the Pandas drop_duplicates() method to leave only one record for tracts that intersect with multiple refinery buffers.

count_buffers = refinery_buffers[['geometry']].copy()

count_buffers['Refinery'] = 'Refinery'

refinery_tracts = tracts.sjoin(count_buffers, how='left')

refinery_tracts = refinery_tracts.drop_duplicates('FactFinder GEOID')

refinery_tracts['Refinery'] = refinery_tracts['Refinery'].fillna('Non-Refinery')

axis = refinery_tracts.plot('Refinery', cmap='bwr_r', legend=True, edgecolor='none')

counties.plot(edgecolor='#00000020', facecolor='none', ax=axis)

refinery_buffers.plot(ax = axis, edgecolor='black', facecolor='none')

axis.set_axis_off()

plt.show()

Census tracts within 10 miles of a refinery

Group Totals

Count variables (like population) can be aggregated using groupby().

sum() can be used to get totals within groups (total state population in refinery vs. non-refinery tracts).
Sums can be divided by the overall sum and multiplied by 100 to get percentages (percent of state population in refinery vs. non-refinery tracts).
count() can be used to find the number of features (number of refinery vs. non-refinery tracts).

print(refinery_tracts[['Total Population','Refinery']].groupby('Refinery').sum())

print((refinery_tracts[['Total Population','Refinery']].groupby('Refinery').sum() \
	* 100 / refinery_tracts['Total Population'].sum()).round())

print(refinery_tracts[['geometry','Refinery']].groupby('Refinery').count())

              Total Population
Refinery                      
Non-Refinery         2768488.0
Refinery             1895874.0

              Total Population
Refinery                      
Non-Refinery         59.0
Refinery             41.0

              geometry
Refinery              
Non-Refinery       609
Refinery           531

Group Comparison

To explore demographic differences between tracts surrounding refineries and all other tracts in the state, we can create a sorted table based on the percentage difference in the variable medians between the two groups.

The Pandas select_dtypes() methods selects only the numeric columns, excluding text and geometry columns.
We groupby() presence of a refinery and take the median() for each variable.
The Pandas T property is a transposed version of the list so the output is long rather than wide.
The Pandas reset_index() method converts the list to a DataFrame and copies the variable names into a column.
We divide the refinery tract medians by the non-refinery tract medians and multiply by 100 to get the percentage difference between the two groups.
We sort_values() so the variables with positive differences are at the top of the list and negative differences are at the bottom of the list.
We print() the head() and tail() of the list to see the variables with the greatest differences.

import numpy

refinery_values = refinery_tracts.select_dtypes(include=numpy.number)

refinery_values['Refinery'] = refinery_tracts['Refinery']

refinery_values = refinery_values.groupby('Refinery').median().T.reset_index()

refinery_values['Diff'] = round(100 * ((refinery_values['Refinery'] / refinery_values['Non-Refinery']) - 1))

refinery_values = refinery_values.sort_values('Diff', ascending=False)

print(refinery_values.head(n=15))

Counterintuitively, refinery tracts tend to be more urban (Pop per Square Mile), with more immigrants (Percent Foreign-Born), renters (Percent Renters), and residents with college degrees (Percent Graduate Degree, Percent Bachelors Degree).

Refinery                     index  Non-Refinery     Refinery   Diff
27             Pop per Square Mile    355.500000  3055.830000  760.0
5             Percent Foreign-Born      1.700000     3.350000   97.0
20                 Percent Renters     28.100000    42.900000   53.0
21              Percent No Vehicle      6.200000     9.100000   47.0
9          Percent Graduate Degree      5.100000     6.750000   32.0
8         Percent Bachelors Degree     11.000000    14.200000   29.0
3           Percent Single Mothers     38.100000    46.400000   22.0
16             Median Monthly Rent    769.000000   927.500000   21.0
13            Percent Unemployment      3.200000     3.700000   16.0
15         Median Monthly Mortgage   1134.000000  1287.500000   14.0
17            Percent Vacant Units     15.000000    15.300000    2.0
10               Percent Broadband     74.200000    74.300000    0.0
18           Percent Pre-War Units      3.200000     3.200000    0.0
14        Percent Health Insurance     90.600000    90.400000   -0.0
28                        Latitude     30.484552    30.030161   -1.0

Likewise, the refinery tracts are smaller (Square Miles, Total Population, Total Households) with lower birth rates (Annual Births per 1K Women) and lower commute times (Mean Minutes to Work).

print(refinery_values.tail(n=15))

Refinery                       index  Non-Refinery      Refinery  Diff
29                         Longitude    -91.846109    -90.268526  -2.0
11           Median Household Income  45907.000000  44898.500000  -2.0
26                        Median Age     38.400000     37.300000  -3.0
2             Average Household Size      2.600000      2.530000  -3.0
24                  Percent Under 18     23.400000     22.050000  -6.0
6                   Percent Veterans      6.600000      6.100000  -8.0
25                   Percent 65 Plus     16.000000     14.500000  -9.0
12              Mean Minutes To Work     25.800000     23.200000 -10.0
7                   Percent Disabled     16.100000     14.500000 -10.0
4         Annual Births per 1K Women     51.000000     44.000000 -14.0
1                   Total Households   1489.000000   1178.000000 -21.0
19                Percent Homeowners     71.900000     57.100000 -21.0
22                  Total Population   4044.000000   3113.000000 -23.0
0                       Square Miles     12.873035      0.998627 -92.0
30                       index_right           NaN     48.000000   NaN

However, when viewing a box plot, we see that general tendencies are not always predictive. In the case of the percent of foreign-born residents, There are census tracts with high percentages of foreign-born residents regardless of whether there is a refinery nearby.

refinery_tracts.boxplot('Percent Foreign-Born', 'Refinery')

plt.show()

Distribution of percent foreign-born by census tract refinery proximity

Limitations of Buffer and Overlay

Use of simple circular buffers operates on the assumption of influence of or accessibility to a point evenly around the point.

In the case of dispersions of toxic emissions a transport model is needed to model wind and water flow patterns and how they transport the toxics. These models can be quite complex (Beyea and Hatch 1999).

In the case of transit accessibility, circular buffers assume direct access to locations, while actual walking distance and effort is affected by street grids, physical obstacles (like roads, buildings, and private property), and topographical barriers (like creeks or steep hills).

Weighted Proximity Analysis

Weighted proximity analysis uses distance between features to quantify proximity. These continuous distance values (wieghts) can then be used with techniques like multiple regression.

This example will analyze county proximity to the nearest Amtrak passenger rail station in order to answer these research questions:

Which counties have the greatest and least accessibility to Amtrak stations?
Which high-population counties could potentially merit added Amtrak stations?

Location Data: Amtrak Stations

The destination points used for these examples will be Amtrak stations originally downloaded from the US Bureau of Transportation Statistics Geospatial at BTS Open Data Catalog.

Amtrak is a quasi-public US passenger railroad corporation that was formed by the US federal government in 1971 to take over US intercity passenger service when private railroads were facing economic crisis and were eager to discontinue unprofitable passenger operations (Minn 2016).

Unlike transit which has utility only when within a convenient walking or driving distance, Amtrak carries a significant number of leisure riders where the value is the journey as much or more than the destination (Losada-Rojas et al. 2019). This hub and spoke analysis assumes that some Amtrak riders may be willing to drive considerable distance to access service, although longer distances would make the trip less desirable (distance decay).

import geopandas

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [9, 6]

amtrak_stations = geopandas.read_file('http://michaelminn.net/tutorials/python-proximity/2023-amtrak-stations.geojson')

amtrak_stations = amtrak_stations.to_crs('ESRI:102008')

axis = amtrak_stations.plot(markersize=4, color='red')

axis.set_axis_off()

plt.show()

Area Data: Counties

The source areas for riders will be US counties, as loaded above with boundaries and 2015-2019 ACS demographic data originally sourced from the US Census Bureau. Counties in the United States are geographical divisions of states for government administration.

The code below loads a GeoJSON file of counties that have been prejoined to commonly-used variables from the 2015 - 2019 American Community survey five-year estimates. We also load a GeoJSON of states to provide geographic context.

To make the maps clearer, we subset only counties and states in the 48 continental states where Amtrak provides service.

us_counties = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-counties.geojson')

us_counties = us_counties[~us_counties['ST'].isin(['AK', 'HI', 'PR'])]

us_counties = us_counties.to_crs(amtrak_stations.crs)

states = geopandas.read_file('https://michaelminn.net/tutorials/data/2015-2019-acs-states.geojson')

states = states[~states['ST'].isin(['AK', 'HI', 'PR'])]

states = states.to_crs(amtrak_stations.crs)

axis = us_counties.plot('Median Household Income', scheme='naturalbreaks', 
	cmap='coolwarm_r', legend=True, edgecolor='none',
        legend_kwds={'bbox_to_anchor':(0.3, 0.4)})

states.plot(edgecolor='#00000080', facecolor='none', ax=axis)

amtrak_stations.plot(markersize=4, color='black', ax=axis)

axis.set_axis_off()

plt.show()

Amtrak stations and median household income by county from the 2015 - 2019 ACS five-year estimates

Hub and Spoke Analysis

Hub and spoke analysis is a weighted proximity analysis technique that analyzes accessibility to a small number of centralized hubs where distance from peripheral points to the closest hub is calculated with imaginary spoke lines. Hub-and-spoke is useful for estimating or optimizing accessibility for hubs like distribution centers or transit stops.

The county geometries are first converted to their centroid points so the distances to stations are calculated from the centers of the county.
The GeoPandas sjoin_nearest() function finds the nearest Amtrak stop to each county.
The distance_col parameter indicates the name of the attribute to add for the distance, which is in meters with this projection.
The Pandas drop_duplicates() method is needed because sjoin_nearest() results "will include multiple output records for a single input record where there are multiple equidistant nearest or intersected neighbors."
The nearest station name and distance are joined back into the county polygons with the Pandas merge() method.

amtrak_counties = us_counties.copy()

amtrak_counties['geometry'] = amtrak_counties.centroid

amtrak_counties = amtrak_counties.sjoin_nearest(amtrak_stations, how='left', distance_col='Meters to Station')

amtrak_counties = amtrak_counties.drop_duplicates('FactFinder GEOID')

amtrak_counties['Km to Station'] = (amtrak_counties['Meters to Station'] / 1000).round()

amtrak_counties = us_counties.merge(amtrak_counties[['FactFinder GEOID', 'StationNam', 'Km to Station']], how='left')

axis = amtrak_counties.plot('Km to Station', scheme='naturalbreaks',
        cmap='coolwarm', legend=True, edgecolor='none',
        legend_kwds={'bbox_to_anchor':(0.2, 0.4)})

states.plot(edgecolor='#00000080', facecolor='none', ax=axis)

axis.set_axis_off()

plt.show()

Subsetting

Subsetting can be used to display specific states.

analysis_state = 'IL'

state_amtrak = amtrak_counties[amtrak_counties['ST'] == analysis_state]

axis = state_amtrak.plot('Km to Station', cmap='coolwarm', scheme='naturalbreaks', 
	legend=True, legend_kwds={'bbox_to_anchor':(0.2, 0.4)})

axis.set_axis_off()

plt.show()

Km to nearest Amtrak station by Illinois county

Spoke Lines

Spoke lines can be added to visualize nearest station relationships.

hubs = amtrak_stations.copy()

hubs['endx'] = hubs.centroid.x

hubs['endy'] = hubs.centroid.y

spoke_lines = amtrak_counties.copy()

spoke_lines['startx'] = spoke_lines.centroid.x

spoke_lines['starty'] = spoke_lines.centroid.y

spoke_lines = spoke_lines.merge(hubs[['endx', 'endy', 'StationNam']])

spoke_lines['wkt'] = 'LINESTRING(' + spoke_lines.startx.astype(str) + ' ' + \
	spoke_lines.starty.astype(str) + ', ' + \
	spoke_lines.endx.astype(str) + ' ' + \
	spoke_lines.endy.astype(str) + ')'

lines = geopandas.GeoSeries.from_wkt(spoke_lines['wkt'])

spoke_lines = geopandas.GeoDataFrame(spoke_lines, geometry = lines, crs=amtrak_counties.crs)

axis = spoke_lines.plot(color='darkred', linewidth=0.5)

amtrak_stations.plot(markersize=4, ax=axis)

us_counties.plot(edgecolor='#00000080', facecolor='none', linewidth=0.5, ax=axis)

axis.set_axis_off()

plt.show()

County spoke lines to Amtrak station hubs

Ordinal Ranking

Underserved counties that have high populations but distant service can be identified by multiplying population by distance. Counties can then be sorted by that score to find the areas that are most underserved.

Two groups of counties stand out on the list: moderate population counties (such as the counties in the Texas Rio Grande valley) that are far from an Amtrak station, and high population counties (such as Los Angeles or San Diego) that have a limited number of large stations.

amtrak_underserved = amtrak_counties.copy()

amtrak_underserved['Person Km'] = amtrak_underserved['Total Population'] * amtrak_underserved['Km to Station']

amtrak_underserved = amtrak_underserved.sort_values('Person Km', ascending=False)

print(amtrak_underserved[['Name', 'ST', 'Total Population', 'Km to Station', 'Person Km']].head(n=10))

                Name  ST  Total Population  Km to Station    Person Km
2596         Hidalgo  TX          855176.0          348.0  297601248.0
74          Maricopa  AZ         4328810.0           42.0  181810020.0
2519         Cameron  TX          421666.0          387.0  163184742.0
193        San Diego  CA         3316073.0           48.0  159171504.0
192   San Bernardino  CA         2149031.0           72.0  154730232.0
2640         Lubbock  TX          304808.0          375.0  114303000.0
333       Miami-Dade  FL         2699428.0           40.0  107977120.0
2168           Tulsa  OK          646419.0          155.0  100194945.0
175      Los Angeles  CA        10081570.0            9.0   90734130.0
2440            Knox  TN          461104.0          174.0   80232096.0

top_underserved = amtrak_underserved.copy()

top_underserved = top_underserved.iloc[0:10]

axis = top_underserved.plot(facecolor='red')

states.plot(edgecolor='#00000080', facecolor='none', ax=axis)

amtrak_stations.plot(markersize=4, ax=axis)

axis.set_axis_off()

plt.show()

Map of 20 counties most underserved by Amtrak

Aggregated Areas

County-level travel distances can be aggregated by state to identify the best and worst states for Amtrak access.

The mean (average) of the distances of all counties in a state are weighted by population so that the access in counties where most people live are considered more strongly than less-populous counties.
Higher values (more people traveling further to stations) give an indication of which states are underserved compared to other states.
Weighted mean distance is calculated by multiplying the distance for all state counties by state population, then dividing the sum of those values by the sum of total population. This needs to be placed in a function because the functions used for groupby like mean() or sum() only accept one value, and we need to perform the calculation with both the distance and population fields.
We use the Pandas groupby() method to group the county values by state and then apply() the function defined above to each grouping of counties by state.
sort_values() to make it easier to see the stateswith the highest and lowest weighted mean distances.
Because the apply() results in a column named "0," we rename() before displaying the top five values with head().

def weighted_mean_distance(x):
	return (x['Km to Station'] * x['Total Population']).sum() / x['Total Population'].sum()

amtrak_states = amtrak_counties.groupby('ST').apply(weighted_mean_distance)

amtrak_states = amtrak_states.sort_values(ascending=False).reset_index()

amtrak_states = amtrak_states.rename(columns={0: 'Weighted Mean Km'})

print(amtrak_states.head())

   ST  Weighted Mean Km
0  NH          0.000000
1  OR          0.288263
2  CA          0.378485
3  MA          0.762972
4  RI          0.774991

print(weighted_states.tail())

    ST  Weighted Mean Km
43  OK         69.325894
44  IA         70.195880
45  AR         71.169017
46  MT        115.531178
47  SD        226.585449

We can map a choropleth by merging with state polygons.

amtrak_states = states.merge(amtrak_states)

axis = amtrak_states.plot('Weighted Mean Km', legend=True, scheme='naturalbreaks',
        cmap = 'coolwarm', edgecolor = '#00000020')

axis.set_axis_off()

plt.show()

Population weighted mean distance to Amtrak stations by state

Centrographics

Centrography is "statistical analyses concerned with centers of population, median centers, median points, and related methods" (Sviatlovsky and Eells 1939). Much like general statistical measures of central tendency, centrographic measures can be useful for summarizing large amounts of data, assessing change over time, and optimizing proximity.

Location Data: Major League Ballparks

This example will use the locations of major league ballparks in the US.

ballparks = geopandas.read_file('https://michaelminn.net/tutorials/data/2019-mlb-ballparks.geojson')

ballparks = ballparks.to_crs(states.crs)

axis = ballparks.plot()

states.plot(edgecolor='#00000080', facecolor='none', ax=axis)

axis.set_axis_off()

plt.show()

The mean center is the point at the mean of all latitudes and mean of all longitudes for a group of features.

When working with points of interest, the mean center represents an optimal location with the minimum amount of Euclidean distance to each point of interest.

One use for a mean center is to find a central location that would minimize travel to a centralized location from a group of geographically dispersed cities. If baseball general managers were planning on having a group meeting, or if a championship game were planned at a neutral site, the mean center might be a good choice.

The mean center of US major league ballparks is in Missouri between Kansas City and St. Louis at lat/long 39.18905, -92.63477. This is slightly north of the 2020 census center of population in south central Missouri at 37.415725, -92.346525 (USCB 2021).

Because any meeting of any size would presumably need to be in a major city with copious lodging and a major airport, St. Louis or Kansas City would be suitable locations.

x = ballparks['geometry'].centroid.x.mean()

y = ballparks['geometry'].centroid.y.mean()

mean_center = geopandas.GeoDataFrame(geometry = geopandas.points_from_xy([x], [y], crs=states.crs))

axis = ballparks.plot()

states.plot(edgecolor='#00000080', facecolor='none', ax=axis)

mean_center.plot(marker='*', color='red', ax=axis, markersize=96)

axis.set_axis_off()

plt.show()

print(mean_center.to_crs('EPSG:4326')['geometry'])

POINT (-92.63477 39.18905)

The mean center of US Major League baseball parks

Limitations

The major caveat with mean centers is that they may not represent optimal practical travel distance since transportation networks or physical obstacles may impose indirect routing and access.