Suitability Analysis in Python

Businesses and planning organizations frequently need to consider multiple criteria when deciding where to construct or open new facilities. These decisions can cover geographic scales from individual cities or states to whole nations, or potential locations around the world.

Suitability analysis is the quantification, comparison, and ranking of of locations based on how closely they adhere to specified criteria (ESRI 2023).

Criteria used in this tutorial for suitability analysis include:

This tutorial will cover the use of Python to perform suitability analysis for evaluating which US counties are most desirable for a given profession.

Criteria Data: BLS Occupation Data

The Occupational Outlook Handbook (OOH) provides information on " what workers do, the work environment, education, training, and other qualifications, pay, the job outlook, information on state and area data, similar occupations, and sources of additional information for more than 300 occupational profiles covering about 4 out of 5 jobs in the economy" (BLS 2022).

The A-Z index is a good place to start looking for information on a specific occupation.

OOH surveying and mapping technicians

BLS Occupational Employment and Wage Statistics

The US Bureau of Labor Statistics' Occupational Employment and Wage Statistics (OEWS) program uses surveys to produces employment and wage estimates annually for approximately 830 occupations. Occupational groups are numbered with Standard Occupational Classification (SOC) codes.

The US Bureau of Labor Statistics occupation category used as the example for this tutorial will be 17-3031 Surveying and Mapping Technicians.

The smallest area of aggregation available from the OEWS are metropolitan statistical areas, each of which represents "a core area containing a substantial population nucleus, together with adjacent communities having a high degree of economic and social integration with that core" (USCB 2021). Although some of these areas are spatially large, workers commonly commute within MSAs, which makes them useful for analyzing spatial concentrations of jobs in particular occupations.

  1. Go to the Occupational Employment and Wage Statistics Query System page.
  2. Select One occupation for multiple geographic areas.
  3. Select one occupation (Architecture and Engineering Occupations, 17-3031 Surveying and Mapping Technicians)
  4. Select a geographic type: Metropolitan or Non Metropolitan Area
  5. Select one or more areas: All MSA in this list
  6. Select one or more datatypes: All data types
  7. Select one or more release dates: Use the most recent release
  8. Select an output type: Excel
Downloading occupation data from the BLS

Spreadsheet Import

The spreadsheet is formatted for display and will need some cleanup. This code and the subsequent join code are examples of the complex code commonly needed to access open data sets that are distributed in formats chosen without consideration for interoperability.

import pandas

import geopandas

bls = pandas.read_excel("https://michaelminn.net/tutorials/python-suitability/2022-05-surveying-mapping-tech.xlsx", skiprows=5)

bls["msa7"] = bls["Area Name"].str.extract(r'(\d{7})')

bls = bls[["msa7", "Area Name", "Employment(1)", "Annual median wage(2)", 
		"Employment per 1,000 jobs", "Location Quotient"]]

bls.columns = ["msa7", "msa_name", "employment", "median_wage", "employment_per_1k", "location_quotient"]

for column in ["employment", "median_wage", "employment_per_1k", "location_quotient"]:
	bls[column] = pandas.to_numeric(bls[column], errors="coerce")

bls = bls.dropna()

bls = bls.sort_values("employment", ascending=False).reset_index(drop=True)

bls[["msa_name", "employment"]].head()
                                                  name  employment
0        Houston-The Woodlands-Sugar Land, TX(0026420)      2540.0
1             Dallas-Fort Worth-Arlington, TX(0019100)      1990.0
2                  Denver-Aurora-Lakewood, CO(0019740)      1890.0
3       New York-Newark-Jersey City, NY-NJ-PA(0035620)      1660.0
4           Atlanta-Sandy Springs-Roswell, GA(0012060)      1640.0

BLS MSA Polygon Data

While the data has MSA place names, in order to create choropleths of this data, you will need to join the table with data from a shapefile of MSA boundaries that you can also download from the BLS.

A location quotient is an economic metric calculated by dividing a local value by the overall national value as a means of assessing local strength or weakness in a specific economic characteristic like employment or economic activity. This example maps the location quotient for employment in this analysis occupation calculated by dividing the MSA rate of employment in this occupation divided by the overall national rate of employment in this occupation.

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [9, 6]

msa = geopandas.read_file("https://michaelminn.net/tutorials/python-suitability/2019-msashape2.zip")

msa["msa7"] = msa["msa7"].str.zfill(7)

bls_msa = msa.merge(bls, on="msa7", how="left")

bls_msa["location_quotient"] = bls_msa["location_quotient"].fillna(0)

axis = bls_msa.plot("location_quotient", cmap="coolwarm",
	legend=True, scheme="quantiles")

axis.set_axis_off()

plt.show()
Figure
Choropleth of 2021 mapping and surveying technician location quotients by MSA

Criteria Data: ACS Homeowner Costs

We can get housing cost data for counties across the country from the US Census Bureau's American Community Survey.

counties = geopandas.read_file("https://michaelminn.net/tutorials/data/2015-2019-acs-counties.geojson")

counties = counties[~counties["ST"].isin(['AK', 'HI', 'PR'])]

counties = counties[["Name", "ST", "FactFinder GEOID", "Median Monthly Mortgage", "geometry"]]

counties = counties.to_crs("ESRI:102008")

states = geopandas.read_file("https://michaelminn.net/tutorials/data/2015-2019-acs-states.geojson")

states = states[~states["ST"].isin(['AK', 'HI', 'PR'])]

states = states.to_crs(counties.crs)

axis = counties.plot("Median Monthly Mortgage", cmap="coolwarm",
	legend=True, scheme="naturalbreaks",
        legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

axis.set_axis_off()

plt.show()
Figure
2015-2019 median monthly mortgage costs by county

Spatial Join Occupation and Housing Data

In order to analyze housing and occupational data, we need to join that data together.

A spatial join is a join where data from a join data set is copied into a target data set based on proximity of features in the two data sets.

Spatial joins for attributes

Because MSAs and counties are different types of areas where an attribute join is not possible, we will do a spatial join to transfer data from the MSAs to the counties that make up those MSAs.

bls_msa = bls_msa.to_crs(counties.crs)

bls_counties = counties[["FactFinder GEOID", "Name", "ST", "Median Monthly Mortgage", "geometry"]]

bls_counties = bls_counties.sjoin(bls_msa, how="left")

bls_counties = bls_counties.drop("index_right", axis=1)

axis = bls_counties.plot("median_wage", cmap="coolwarm_r", 
	legend=True, scheme="quantiles",
        legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

axis.set_axis_off()

plt.show()
Figure
County median wage

Housing Affordability Index

Housing costs tend to be higher in urban areas than in rural areas, although incomes also tend to be higher in urban areas than in rural areas, so housing affordability analysis needs to consider not only the absolute costs of owning a home in different areas, but also the typical income of homeowners in those areas.

A typical rule of thumb is that total housing costs with a mortgage should consume no more than 35% to 45% of your pre-tax income (Muller 2023).

In the 2021 ACS, US median household income was $68,717 and monthly US homeowner costs (mortgage payments + taxes, insurance, utilities, etc.) were $1,672 ($20,064 annually), which indicates housing costs typically consume 29% of income.

This code creates an affordability index where an index of 100 indicates the normal percentage of income going to housing, values above 100 reflect affordability better than normal across the US, and values below 100 reflect lower affordability than normal.

bls_counties["affordability"] = 100 * (bls_counties["median_wage"] * 0.29) / (bls_counties["Median Monthly Mortgage"] * 12)

bls_counties = bls_counties.dropna()

bls_counties = bls_counties.sort_values("affordability", ascending=False).reset_index(drop=True)

bls_counties[["Name", "ST", "affordability"]].head()
	Name 		ST 	affordability
0 	Quitman 	MS 	196.634146
1 	Tallahatchie 	MS 	192.181168
2 	Calhoun 	MS 	184.696449
3 	Choctaw 	MS 	184.274286
4 	Chicot 		AR 	184.063927
bls_counties[["Name", "ST", "affordability"]].tail()
	Name 		ST 	affordability
2867 	Arlington 	VA 	45.283405
2868 	Westchester 	NY 	45.042970
2869 	Morris 		NJ 	44.900000
2870 	Monroe 		FL 	44.705143
2871 	Falls Church 	VA 	39.891262
axis = bls_counties.plot("affordability", cmap="coolwarm_r",
	legend=True, scheme="naturalbreaks",
        legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

axis.set_axis_off()

plt.show()
Figure
County housing affordability

Criteria Data: Family Points

Beyond professional and economic considerations, two major factors affecting residential choice include proximity to family and proximity to lifestyle locations, such as recreational activities or particularly desirable cities.

Points can vary in importance. For example, when elderly family members need caregiving, if moving them to you is not an option, you will need to move to them. Therefore, high importance point(s) should be in a separate layer so they can be prioritized.

For this example, we will use five midwestern points for family members.

family = [ ["Chicago", 41.87906573068059, -87.6359546674838],
	["Rolla", 37.95434820224575, -91.77403115678084],
	["Conway", 35.07797668119237, -92.45756175631256],
	["Monroe", 32.52867100392165, -92.07356205299774],
	["St. Louis", 38.63948197166543, -90.26258891961275] ]

family = pandas.DataFrame(family, columns=["City", "Latitude", "Longitude"])

geometry = geopandas.points_from_xy(family["Longitude"], family["Latitude"])

family = geopandas.GeoDataFrame(family, geometry = geometry, crs="EPSG:4326")

family.info()
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   City       5 non-null      object  
 1   Latitude   5 non-null      float64 
 2   Longitude  5 non-null      float64 
 3   geometry   5 non-null      geometry
dtypes: float64(2), geometry(1), object(1)

The GeoPandas sjoin_nearest() function is used to find the distance from each county to the nearest family member.

The Pandas drop_duplicates() method is needed because sjoin_nearest() results "will include multiple output records for a single input record where there are multiple equidistant nearest or intersected neighbors."

family = family.to_crs(bls_counties.crs)

bls_counties = bls_counties.sjoin_nearest(family, how="left", distance_col="family_distance")

bls_counties["family_distance"] = bls_counties["family_distance"] / 1000

bls_counties = bls_counties.drop_duplicates("FactFinder GEOID")

bls_counties = bls_counties.drop("index_right", axis=1)

bls_counties = bls_counties.sort_values("family_distance").reset_index(drop=True)

bls_counties[["Name", "ST", "family_distance"]].head()

Mapping the calculated distance shows a clear halo around the midwestern locations.

axis = bls_counties.plot("family_distance", cmap="coolwarm",
	legend=True, scheme="quantiles",
        legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

family.plot(marker="*", color="black", ax=axis)

axis.set_axis_off()

plt.show()
Figure
Family distance map

Categorical Suitability Analysis

Categorical suitability analysis involves the selection of areas based on satisfaction of specific categorical conditions, usually involving thresholds for quantitative values.

For this example we will select counties suitable to be considered for a job search based on available employment in the specified occupation, housing costs that are affordable on the regional salary in that occupation, and proximity to family.

Employment

Suitable counties will have a location quotient of one or higher indicating that a normal or greater percentage of the available jobs are in the analyzed occupation.

bls_counties["Suitable Employment"] = (bls_counties["location_quotient"] >= 1)

axis = bls_counties.plot("Suitable Employment", cmap="coolwarm_r",
	legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

axis.set_axis_off()

plt.show()
Figure
Map of counties meeting employment criteria

Housing Affordability

Suitable counties will have an affordability index of 100 or greater.

bls_counties["Suitable Housing"] = (bls_counties["affordability"] >= 100)

axis = bls_counties.plot("Suitable Housing", cmap="coolwarm_r",
	legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

axis.set_axis_off()

plt.show()
Figure
Map of counties meeting housing criteria

Family Proximity

For proximity to family we choose an arbitrary maximum of 400 kilometers (around 250 miles).

bls_counties["Suitable Family"] = (bls_counties["family_distance"] < 400)

axis = bls_counties.plot("Suitable Family", cmap="coolwarm_r",
	legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

family.plot(marker="*", color="black", ax=axis)

axis.set_axis_off()

plt.show()
Figure
Map of counties within 250 miles of family

Composite Criteria

Finally, we can use logical AND operator (&) to determine counties that meet all criteria.

For each county, the Categorical_Suitability is set to true only when employment is suitable AND housing is suitable AND distance to family is suitable.

bls_counties["Categorical Suitability"] = bls_counties["Suitable Employment"] & \
                bls_counties["Suitable Housing"] & bls_counties["Suitable Family"]

axis = bls_counties.plot("Categorical Suitability", cmap="coolwarm_r",
        legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

family.plot(marker="*", color="black", ax=axis)

axis.set_axis_off()

plt.show()
Figure
Map of counties satisfying employment, housing, and family categorical criteria

You can display the list of suitable counties by subsetting rows based on the suitability column.

list_counties = bls_counties[bls_counties["Categorical Suitability"]]

list_counties[["Name", "ST"]]
	Name 		ST
0 	Quitman 	MS
1 	Tallahatchie 	MS
2 	Chicot 		AR
3 	Polk 		TX
4 	Choctaw 	MS
... 	... 	...
1050 	Stewart 	TN
1064 	Madison 	TN
1067 	Gladwin 	MI
1086 	Chase 		KS
1115 	St. James 	LA

Weighted Suitability Analysis

One challenge with categorical suitability is that it only shows which areas meet all minimum criteria, but does not indicate which of the possible options are the best.

Weighted suitability analysis involves the creation of a linear model that is a weighted sum of the different criteria (ESRI 2022).

Y = Β1x1 + Β2x2 ...

Criteria values are normalized to a range of zero to one, and the total of the weights is one (100%), so Final_Score is in a range of zero to one, with higher values representing higher levels of suitability.

Suitability analysis with this kind of calculation can be thought of as the opposite of multiple regression. Instead of starting with an existing dependent variable (Y) and calculating coefficients (Β) that maximize the fit of the dependent variable to the equation, in suitability analysis you specify coefficients (weights) and then use the calculated values of the dependent variable to identify the most suitable areas.

Normalized Criteria

In order to be able to use comparable weights with the different criteria, we create normalized values where all three criteria are in a range of zero to one.

bls_counties["employment_index"] = (bls_counties["employment"] - min(bls_counties["employment"])) / \
			(max(bls_counties["employment"]) - min(bls_counties["employment"]))

bls_counties["affordability_index"] = (bls_counties["affordability"] - min(bls_counties["affordability"])) / \
			(max(bls_counties["affordability"]) - min(bls_counties["affordability"]))

bls_counties["family_index"] = (max(bls_counties["family_distance"]) - bls_counties["family_distance"]) / \
			(max(bls_counties["family_distance"]) - min(bls_counties["family_distance"]))

bls_counties[["employment_index", "affordability_index", "family_index"]].head()
      employment_index  affordability_index  family_index
0             0.011952             1.000000      0.926797
1             0.011952             0.971591      0.921884
2             0.011952             0.919804      0.970642
3             0.011952             0.906550      0.984103
4             1.000000             0.453178      0.893672

We can verify normalization with the Pandas describe() method that shows the ranges and distributions for the normalized variables.

bls_counties[["employment_index", "affordability_index", "family_index"]].describe()
	employment_index 	affordability_index 	family_index
count 	2872.000000 		2872.000000 		2872.000000
mean 	0.057960 		0.384476		0.718761
std 	0.125077		0.135012		0.206112
min 	0.000000		0.000000		0.000000
25% 	0.003984		0.296570		0.641421
50% 	0.011952		0.377626		0.756629
75% 	0.051793		0.467164		0.868047
max 	1.000000		1.000000		1.000000

Composite Score

The Weighted Suitability attribute in the suitability analysis layer is the model output for suitability on a scale of zero to one, with higher values representing higher suitability.

Weights are used to indicate which criteria are more important than others. These values should sum to 1.0.

bls_counties["Weighted Suitability"] = (0.4 * bls_counties["employment_index"]) + \
		(0.2 * bls_counties["affordability_index"]) + \
		(0.4 * bls_counties["family_index"])

bls_counties = bls_counties.sort_values("Weighted Suitability", ascending=False).reset_index(drop=True)

bls_counties[["Name", "ST", "Weighted Suitability"]].head()
	Name 		ST 	Weighted Suitability
0 	Polk 		TX 	0.848104
1 	Liberty 	TX 	0.824499
2 	Hardin 		TX 	0.821633
3 	Jefferson 	TX 	0.817298
4 	Harris 		TX 	0.788002
axis = bls_counties.plot("Weighted Suitability", cmap="coolwarm_r",
	legend=True, scheme="naturalbreaks",
        legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

family.plot(marker="*", color="black", ax=axis)

axis.set_axis_off()

plt.show()
Figure
Weighted suitability map