Suitability Analysis in Python

Businesses and planning organizations frequently need to consider multiple criteria when deciding where to construct or open new facilities. These decisions can cover geographic scales from individual cities or states to whole nations, or potential locations around the world.

Suitability analysis is the quantification, comparison, and ranking of of locations based on how closely they adhere to specified criteria (ESRI 2023).

Criteria used in this tutorial for suitability analysis include:

Location quotients for employment in the analysis profession
Housing affordability relative to median salary in the analysis profession
Proximity to family

This tutorial will cover the use of Python to perform suitability analysis for evaluating which US counties are most desirable for a given profession.

Criteria Data: BLS Occupation Data

The Occupational Outlook Handbook (OOH) provides information on " what workers do, the work environment, education, training, and other qualifications, pay, the job outlook, information on state and area data, similar occupations, and sources of additional information for more than 300 occupational profiles covering about 4 out of 5 jobs in the economy" (BLS 2022).

The A-Z index is a good place to start looking for information on a specific occupation.

BLS Occupational Employment and Wage Statistics

The US Bureau of Labor Statistics' Occupational Employment and Wage Statistics (OEWS) program uses surveys to produces employment and wage estimates annually for approximately 830 occupations. Occupational groups are numbered with Standard Occupational Classification (SOC) codes.

The US Bureau of Labor Statistics occupation category used as the example for this tutorial will be 17-3031 Surveying and Mapping Technicians.

The smallest area of aggregation available from the OEWS are metropolitan statistical areas, each of which represents "a core area containing a substantial population nucleus, together with adjacent communities having a high degree of economic and social integration with that core" (USCB 2021). Although some of these areas are spatially large, workers commonly commute within MSAs, which makes them useful for analyzing spatial concentrations of jobs in particular occupations.

Go to the Occupational Employment and Wage Statistics Query System page.
Select One occupation for multiple geographic areas.
Select one occupation (Architecture and Engineering Occupations, 17-3031 Surveying and Mapping Technicians)
Select a geographic type: Metropolitan or Non Metropolitan Area
Select one or more areas: All MSA in this list
Select one or more datatypes: All data types
Select one or more release dates: Use the most recent release
Select an output type: Excel

Downloading occupation data from the BLS

Spreadsheet Import

The spreadsheet is formatted for display and will need some cleanup. This code and the subsequent join code are examples of the complex code commonly needed to access open data sets that are distributed in formats chosen without consideration for interoperability.

The BLS spreadsheet can be imported using the pandas.read_excel() function.
skipfows=5 is needed to skip the informational text at the top of the sheet.
The str.extract() regular expression function in needed to extract the seven-digit MSA code from the name field.
The numeric columns are initially imported as text, and the pandas.to_numeric() function is used to convert them to numeric values.
Some values are null, notably the lines at the bottom of the spreadsheet that are informational lines rather than data rows.
The whole data frame is sorted in order to list the highest and lowest MSAs by employment.

import pandas

import geopandas

bls = pandas.read_excel("https://michaelminn.net/tutorials/python-suitability/2022-05-surveying-mapping-tech.xlsx", skiprows=5)

bls["msa7"] = bls["Area Name"].str.extract(r'(\d{7})')

bls = bls[["msa7", "Area Name", "Employment(1)", "Annual median wage(2)", 
		"Employment per 1,000 jobs", "Location Quotient"]]

bls.columns = ["msa7", "msa_name", "employment", "median_wage", "employment_per_1k", "location_quotient"]

for column in ["employment", "median_wage", "employment_per_1k", "location_quotient"]:
	bls[column] = pandas.to_numeric(bls[column], errors="coerce")

bls = bls.dropna()

bls = bls.sort_values("employment", ascending=False).reset_index(drop=True)

bls[["msa_name", "employment"]].head()

                                                  name  employment
0        Houston-The Woodlands-Sugar Land, TX(0026420)      2540.0
1             Dallas-Fort Worth-Arlington, TX(0019100)      1990.0
2                  Denver-Aurora-Lakewood, CO(0019740)      1890.0
3       New York-Newark-Jersey City, NY-NJ-PA(0035620)      1660.0
4           Atlanta-Sandy Springs-Roswell, GA(0012060)      1640.0

BLS MSA Polygon Data

While the data has MSA place names, in order to create choropleths of this data, you will need to join the table with data from a shapefile of MSA boundaries that you can also download from the BLS.

The 2019 zipped shapefile is available from the link in the code below. You can also download the BLS MSA shapefile from the OES FAQ page. You will need to click Expand All at the top of the page to see the links under the shapefile questions.
Note that the BLS MSA areas appear to be slightly different from the MSA areas used by the US Census Bureau.
Also note that the seven-digit msa7 code in the shapefile is missing the leading zeroes used in the BLS codes, so they are added below using the Pandas zfill() method.

A location quotient is an economic metric calculated by dividing a local value by the overall national value as a means of assessing local strength or weakness in a specific economic characteristic like employment or economic activity. This example maps the location quotient for employment in this analysis occupation calculated by dividing the MSA rate of employment in this occupation divided by the overall national rate of employment in this occupation.

A location quotient of one indicates an area that employs people in the occupation at the national rate.
A value less than one indicates fewer employees in the occupation relative to the national rate.
A value greater than one indicates more employees in the occupation relative to the national rate.
Areas with higher location quotients would presumably have more job opportunities in a particular occupation than similarly-sized areas with lower location quotients.
For this occupation, location quotients are notably higher in areas of the West and Appalachia with large natural resource industries.

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [9, 6]

msa = geopandas.read_file("https://michaelminn.net/tutorials/python-suitability/2019-msashape2.zip")

msa["msa7"] = msa["msa7"].str.zfill(7)

bls_msa = msa.merge(bls, on="msa7", how="left")

bls_msa["location_quotient"] = bls_msa["location_quotient"].fillna(0)

axis = bls_msa.plot("location_quotient", cmap="coolwarm",
	legend=True, scheme="quantiles")

axis.set_axis_off()

plt.show()

Choropleth of 2021 mapping and surveying technician location quotients by MSA

Criteria Data: ACS Homeowner Costs

We can get housing cost data for counties across the country from the US Census Bureau's American Community Survey.

The American Community Survey (ACS) is an ongoing survey that provides information on an annual basis about people in the United States beyond the basic information collected in the decennial census. The ACS is commonly used by a wide variety of researchers when they need information about the general public.
For this example, we use the GeoPandas read_file() function to load an existing GeoJSON file of county-level 2015-2019 ACS data.
To simplify mapping and analysis, we subset to consider only the 48 continental states.
The homeowner cost variable we will use is Median Monthly Mortgage, the median selected monthly owner costs in dollars for housing units with a mortgage. Current values for this variable are available from the USCB in the B25088 median selected monthly owner costs (dollars) by mortgage status table.
To clarify which counties are in which states we overlay the polygons of state boundaries.
We use the to_crs() method to reproject the counties to a North America Albers Equal Area Conic projection that gives a clearer cartographic representation of the US and which uses a coordinate system in meters for more accurate measurement of relative distances.

counties = geopandas.read_file("https://michaelminn.net/tutorials/data/2015-2019-acs-counties.geojson")

counties = counties[~counties["ST"].isin(['AK', 'HI', 'PR'])]

counties = counties[["Name", "ST", "FactFinder GEOID", "Median Monthly Mortgage", "geometry"]]

counties = counties.to_crs("ESRI:102008")

states = geopandas.read_file("https://michaelminn.net/tutorials/data/2015-2019-acs-states.geojson")

states = states[~states["ST"].isin(['AK', 'HI', 'PR'])]

states = states.to_crs(counties.crs)

axis = counties.plot("Median Monthly Mortgage", cmap="coolwarm",
	legend=True, scheme="naturalbreaks",
        legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

axis.set_axis_off()

plt.show()

2015-2019 median monthly mortgage costs by county

Spatial Join Occupation and Housing Data

In order to analyze housing and occupational data, we need to join that data together.

A spatial join is a join where data from a join data set is copied into a target data set based on proximity of features in the two data sets.

Because MSAs and counties are different types of areas where an attribute join is not possible, we will do a spatial join to transfer data from the MSAs to the counties that make up those MSAs.

The GeoPandas sjoin() function finds the MSA (if any) that intersects each county.

bls_msa = bls_msa.to_crs(counties.crs)

bls_counties = counties[["FactFinder GEOID", "Name", "ST", "Median Monthly Mortgage", "geometry"]]

bls_counties = bls_counties.sjoin(bls_msa, how="left")

bls_counties = bls_counties.drop("index_right", axis=1)

axis = bls_counties.plot("median_wage", cmap="coolwarm_r", 
	legend=True, scheme="quantiles",
        legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

axis.set_axis_off()

plt.show()

Housing Affordability Index

Housing costs tend to be higher in urban areas than in rural areas, although incomes also tend to be higher in urban areas than in rural areas, so housing affordability analysis needs to consider not only the absolute costs of owning a home in different areas, but also the typical income of homeowners in those areas.

A typical rule of thumb is that total housing costs with a mortgage should consume no more than 35% to 45% of your pre-tax income (Muller 2023).

In the 2021 ACS, US median household income was $68,717 and monthly US homeowner costs (mortgage payments + taxes, insurance, utilities, etc.) were $1,672 ($20,064 annually), which indicates housing costs typically consume 29% of income.

This code creates an affordability index where an index of 100 indicates the normal percentage of income going to housing, values above 100 reflect affordability better than normal across the US, and values below 100 reflect lower affordability than normal.

bls_counties["affordability"] = 100 * (bls_counties["median_wage"] * 0.29) / (bls_counties["Median Monthly Mortgage"] * 12)

bls_counties = bls_counties.dropna()

bls_counties = bls_counties.sort_values("affordability", ascending=False).reset_index(drop=True)

bls_counties[["Name", "ST", "affordability"]].head()

	Name 		ST 	affordability
0 	Quitman 	MS 	196.634146
1 	Tallahatchie 	MS 	192.181168
2 	Calhoun 	MS 	184.696449
3 	Choctaw 	MS 	184.274286
4 	Chicot 		AR 	184.063927

bls_counties[["Name", "ST", "affordability"]].tail()

	Name 		ST 	affordability
2867 	Arlington 	VA 	45.283405
2868 	Westchester 	NY 	45.042970
2869 	Morris 		NJ 	44.900000
2870 	Monroe 		FL 	44.705143
2871 	Falls Church 	VA 	39.891262

axis = bls_counties.plot("affordability", cmap="coolwarm_r",
	legend=True, scheme="naturalbreaks",
        legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

axis.set_axis_off()

plt.show()

Criteria Data: Family Points

Beyond professional and economic considerations, two major factors affecting residential choice include proximity to family and proximity to lifestyle locations, such as recreational activities or particularly desirable cities.

Points can vary in importance. For example, when elderly family members need caregiving, if moving them to you is not an option, you will need to move to them. Therefore, high importance point(s) should be in a separate layer so they can be prioritized.

For this example, we will use five midwestern points for family members.

family = [ ["Chicago", 41.87906573068059, -87.6359546674838],
	["Rolla", 37.95434820224575, -91.77403115678084],
	["Conway", 35.07797668119237, -92.45756175631256],
	["Monroe", 32.52867100392165, -92.07356205299774],
	["St. Louis", 38.63948197166543, -90.26258891961275] ]

family = pandas.DataFrame(family, columns=["City", "Latitude", "Longitude"])

geometry = geopandas.points_from_xy(family["Longitude"], family["Latitude"])

family = geopandas.GeoDataFrame(family, geometry = geometry, crs="EPSG:4326")

family.info()

RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   City       5 non-null      object  
 1   Latitude   5 non-null      float64 
 2   Longitude  5 non-null      float64 
 3   geometry   5 non-null      geometry
dtypes: float64(2), geometry(1), object(1)

The GeoPandas sjoin_nearest() function is used to find the distance from each county to the nearest family member.

The Pandas drop_duplicates() method is needed because sjoin_nearest() results "will include multiple output records for a single input record where there are multiple equidistant nearest or intersected neighbors."

family = family.to_crs(bls_counties.crs)

bls_counties = bls_counties.sjoin_nearest(family, how="left", distance_col="family_distance")

bls_counties["family_distance"] = bls_counties["family_distance"] / 1000

bls_counties = bls_counties.drop_duplicates("FactFinder GEOID")

bls_counties = bls_counties.drop("index_right", axis=1)

bls_counties = bls_counties.sort_values("family_distance").reset_index(drop=True)

bls_counties[["Name", "ST", "family_distance"]].head()

Mapping the calculated distance shows a clear halo around the midwestern locations.

axis = bls_counties.plot("family_distance", cmap="coolwarm",
	legend=True, scheme="quantiles",
        legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

family.plot(marker="*", color="black", ax=axis)

axis.set_axis_off()

plt.show()

Categorical Suitability Analysis

Categorical suitability analysis involves the selection of areas based on satisfaction of specific categorical conditions, usually involving thresholds for quantitative values.

For this example we will select counties suitable to be considered for a job search based on available employment in the specified occupation, housing costs that are affordable on the regional salary in that occupation, and proximity to family.

Employment

Suitable counties will have a location quotient of one or higher indicating that a normal or greater percentage of the available jobs are in the analyzed occupation.

bls_counties["Suitable Employment"] = (bls_counties["location_quotient"] >= 1)

axis = bls_counties.plot("Suitable Employment", cmap="coolwarm_r",
	legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

axis.set_axis_off()

plt.show()

Map of counties meeting employment criteria

Housing Affordability

Suitable counties will have an affordability index of 100 or greater.

bls_counties["Suitable Housing"] = (bls_counties["affordability"] >= 100)

axis = bls_counties.plot("Suitable Housing", cmap="coolwarm_r",
	legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

axis.set_axis_off()

plt.show()

Map of counties meeting housing criteria

Family Proximity

For proximity to family we choose an arbitrary maximum of 400 kilometers (around 250 miles).

bls_counties["Suitable Family"] = (bls_counties["family_distance"] < 400)

axis = bls_counties.plot("Suitable Family", cmap="coolwarm_r",
	legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

family.plot(marker="*", color="black", ax=axis)

axis.set_axis_off()

plt.show()

Map of counties within 250 miles of family

Composite Criteria

Finally, we can use logical AND operator (&) to determine counties that meet all criteria.

For each county, the Categorical_Suitability is set to true only when employment is suitable AND housing is suitable AND distance to family is suitable.

bls_counties["Categorical Suitability"] = bls_counties["Suitable Employment"] & \
                bls_counties["Suitable Housing"] & bls_counties["Suitable Family"]

axis = bls_counties.plot("Categorical Suitability", cmap="coolwarm_r",
        legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

family.plot(marker="*", color="black", ax=axis)

axis.set_axis_off()

plt.show()

Map of counties satisfying employment, housing, and family categorical criteria

You can display the list of suitable counties by subsetting rows based on the suitability column.

list_counties = bls_counties[bls_counties["Categorical Suitability"]]

list_counties[["Name", "ST"]]

	Name 		ST
0 	Quitman 	MS
1 	Tallahatchie 	MS
2 	Chicot 		AR
3 	Polk 		TX
4 	Choctaw 	MS
... 	... 	...
1050 	Stewart 	TN
1064 	Madison 	TN
1067 	Gladwin 	MI
1086 	Chase 		KS
1115 	St. James 	LA

Weighted Suitability Analysis

One challenge with categorical suitability is that it only shows which areas meet all minimum criteria, but does not indicate which of the possible options are the best.

Weighted suitability analysis involves the creation of a linear model that is a weighted sum of the different criteria (ESRI 2022).

Y = Β₁x₁ + Β₂x_{2_...}

Y is the final score
Β_{1 - n} are the weights
x_{1 - n} are the criteria

Criteria values are normalized to a range of zero to one, and the total of the weights is one (100%), so Final_Score is in a range of zero to one, with higher values representing higher levels of suitability.

Suitability analysis with this kind of calculation can be thought of as the opposite of multiple regression. Instead of starting with an existing dependent variable (Y) and calculating coefficients (Β) that maximize the fit of the dependent variable to the equation, in suitability analysis you specify coefficients (weights) and then use the calculated values of the dependent variable to identify the most suitable areas.

Normalized Criteria

In order to be able to use comparable weights with the different criteria, we create normalized values where all three criteria are in a range of zero to one.

bls_counties["employment_index"] = (bls_counties["employment"] - min(bls_counties["employment"])) / \
			(max(bls_counties["employment"]) - min(bls_counties["employment"]))

bls_counties["affordability_index"] = (bls_counties["affordability"] - min(bls_counties["affordability"])) / \
			(max(bls_counties["affordability"]) - min(bls_counties["affordability"]))

bls_counties["family_index"] = (max(bls_counties["family_distance"]) - bls_counties["family_distance"]) / \
			(max(bls_counties["family_distance"]) - min(bls_counties["family_distance"]))

bls_counties[["employment_index", "affordability_index", "family_index"]].head()

      employment_index  affordability_index  family_index
0             0.011952             1.000000      0.926797
1             0.011952             0.971591      0.921884
2             0.011952             0.919804      0.970642
3             0.011952             0.906550      0.984103
4             1.000000             0.453178      0.893672

We can verify normalization with the Pandas describe() method that shows the ranges and distributions for the normalized variables.

bls_counties[["employment_index", "affordability_index", "family_index"]].describe()

	employment_index 	affordability_index 	family_index
count 	2872.000000 		2872.000000 		2872.000000
mean 	0.057960 		0.384476		0.718761
std 	0.125077		0.135012		0.206112
min 	0.000000		0.000000		0.000000
25% 	0.003984		0.296570		0.641421
50% 	0.011952		0.377626		0.756629
75% 	0.051793		0.467164		0.868047
max 	1.000000		1.000000		1.000000

Composite Score

The Weighted Suitability attribute in the suitability analysis layer is the model output for suitability on a scale of zero to one, with higher values representing higher suitability.

Weights are used to indicate which criteria are more important than others. These values should sum to 1.0.

bls_counties["Weighted Suitability"] = (0.4 * bls_counties["employment_index"]) + \
		(0.2 * bls_counties["affordability_index"]) + \
		(0.4 * bls_counties["family_index"])

bls_counties = bls_counties.sort_values("Weighted Suitability", ascending=False).reset_index(drop=True)

bls_counties[["Name", "ST", "Weighted Suitability"]].head()

	Name 		ST 	Weighted Suitability
0 	Polk 		TX 	0.848104
1 	Liberty 	TX 	0.824499
2 	Hardin 		TX 	0.821633
3 	Jefferson 	TX 	0.817298
4 	Harris 		TX 	0.788002

axis = bls_counties.plot("Weighted Suitability", cmap="coolwarm_r",
	legend=True, scheme="naturalbreaks",
        legend_kwds={"bbox_to_anchor":(0.2, 0.3)})

states.plot(edgecolor="#00000040", facecolor="none", ax=axis)

family.plot(marker="*", color="black", ax=axis)

axis.set_axis_off()

plt.show()