Suitability Analysis in Python
Businesses and planning organizations frequently need to consider multiple criteria when deciding where to construct or open new facilities. These decisions can cover geographic scales from individual cities or states to whole nations, or potential locations around the world.
Suitability analysis is the quantification, comparison, and ranking of of locations based on how closely they adhere to specified criteria (ESRI 2023).
Criteria used in this tutorial for suitability analysis include:
- Location quotients for employment in the analysis profession
- Housing affordability relative to median salary in the analysis profession
- Proximity to family
This tutorial will cover the use of Python to perform suitability analysis for evaluating which US counties are most desirable for a given profession.
Criteria Data: BLS Occupation Data
The Occupational Outlook Handbook (OOH) provides information on " what workers do, the work environment, education, training, and other qualifications, pay, the job outlook, information on state and area data, similar occupations, and sources of additional information for more than 300 occupational profiles covering about 4 out of 5 jobs in the economy" (BLS 2022).
The A-Z index is a good place to start looking for information on a specific occupation.
BLS Occupational Employment and Wage Statistics
The US Bureau of Labor Statistics' Occupational Employment and Wage Statistics (OEWS) program uses surveys to produces employment and wage estimates annually for approximately 830 occupations. Occupational groups are numbered with Standard Occupational Classification (SOC) codes.
The US Bureau of Labor Statistics occupation category used as the example for this tutorial will be 17-3031 Surveying and Mapping Technicians.
The smallest area of aggregation available from the OEWS are metropolitan statistical areas, each of which represents "a core area containing a substantial population nucleus, together with adjacent communities having a high degree of economic and social integration with that core" (USCB 2021). Although some of these areas are spatially large, workers commonly commute within MSAs, which makes them useful for analyzing spatial concentrations of jobs in particular occupations.
- Go to the Occupational Employment and Wage Statistics Query System page.
- Select One occupation for multiple geographic areas.
- Select one occupation (Architecture and Engineering Occupations, 17-3031 Surveying and Mapping Technicians)
- Select a geographic type: Metropolitan or Non Metropolitan Area
- Select one or more areas: All MSA in this list
- Select one or more datatypes: All data types
- Select one or more release dates: Use the most recent release
- Select an output type: Excel
Spreadsheet Import
The spreadsheet is formatted for display and will need some cleanup. This code and the subsequent join code are examples of the complex code commonly needed to access open data sets that are distributed in formats chosen without consideration for interoperability.
- The BLS spreadsheet can be imported using the pandas.read_excel() function.
- skipfows=5 is needed to skip the informational text at the top of the sheet.
- The str.extract() regular expression function in needed to extract the seven-digit MSA code from the name field. The column names are renamed because they are long and have unnecessary footnote marks.
- The numeric columns are initially imported as text, and the pandas.to_numeric() function is used to convert them to numeric values.
- Some values are null, notably the lines at the bottom of the spreadsheet that are informational lines rather than data rows.
- The whole data frame is sorted in order to list the highest and lowest MSAs by employment.
import pandas import geopandas bls = pandas.read_excel("https://michaelminn.net/tutorials/python-suitability/2022-05-surveying-mapping-tech.xlsx", skiprows=5) bls["msa7"] = bls["Area Name"].str.extract(r'(\d{7})') bls = bls[["msa7", "Area Name", "Employment(1)", "Annual median wage(2)", "Employment per 1,000 jobs", "Location Quotient"]] bls.columns = ["msa7", "msa_name", "employment", "median_wage", "employment_per_1k", "location_quotient"] for column in ["employment", "median_wage", "employment_per_1k", "location_quotient"]: bls[column] = pandas.to_numeric(bls[column], errors="coerce") bls = bls.dropna() bls = bls.sort_values("employment", ascending=False).reset_index(drop=True) bls[["msa_name", "employment"]].head()
name employment 0 Houston-The Woodlands-Sugar Land, TX(0026420) 2540.0 1 Dallas-Fort Worth-Arlington, TX(0019100) 1990.0 2 Denver-Aurora-Lakewood, CO(0019740) 1890.0 3 New York-Newark-Jersey City, NY-NJ-PA(0035620) 1660.0 4 Atlanta-Sandy Springs-Roswell, GA(0012060) 1640.0
BLS MSA Polygon Data
While the data has MSA place names, in order to create choropleths of this data, you will need to join the table with data from a shapefile of MSA boundaries that you can also download from the BLS.
- The 2019 zipped shapefile is available from the link in the code below. You can also download the BLS MSA shapefile from the OES FAQ page. You will need to click Expand All at the top of the page to see the links under the shapefile questions.
- Note that the BLS MSA areas appear to be slightly different from the MSA areas used by the US Census Bureau.
- Also note that the seven-digit msa7 code in the shapefile is missing the leading zeroes used in the BLS codes, so they are added below using the Pandas zfill() method.
A location quotient is an economic metric calculated by dividing a local value by the overall national value as a means of assessing local strength or weakness in a specific economic characteristic like employment or economic activity. This example maps the location quotient for employment in this analysis occupation calculated by dividing the MSA rate of employment in this occupation divided by the overall national rate of employment in this occupation.
- A location quotient of one indicates an area that employs people in the occupation at the national rate.
- A value less than one indicates fewer employees in the occupation relative to the national rate.
- A value greater than one indicates more employees in the occupation relative to the national rate.
- Areas with higher location quotients would presumably have more job opportunities in a particular occupation than similarly-sized areas with lower location quotients.
- For this occupation, location quotients are notably higher in areas of the West and Appalachia with large natural resource industries.
import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = [9, 6] msa = geopandas.read_file("https://michaelminn.net/tutorials/python-suitability/2019-msashape2.zip") msa["msa7"] = msa["msa7"].str.zfill(7) bls_msa = msa.merge(bls, on="msa7", how="left") bls_msa["location_quotient"] = bls_msa["location_quotient"].fillna(0) axis = bls_msa.plot("location_quotient", cmap="coolwarm", legend=True, scheme="quantiles") axis.set_axis_off() plt.show()
Criteria Data: ACS Homeowner Costs
We can get housing cost data for counties across the country from the US Census Bureau's American Community Survey.
- The American Community Survey (ACS) is an ongoing survey that provides information on an annual basis about people in the United States beyond the basic information collected in the decennial census. The ACS is commonly used by a wide variety of researchers when they need information about the general public.
- For this example, we use the GeoPandas read_file() function to load an existing GeoJSON file of county-level 2015-2019 ACS data.
- To simplify mapping and analysis, we subset to consider only the 48 continental states.
- The homeowner cost variable we will use is Median Monthly Mortgage, the median selected monthly owner costs in dollars for housing units with a mortgage. Current values for this variable are available from the USCB in the B25088 median selected monthly owner costs (dollars) by mortgage status table.
- To clarify which counties are in which states we overlay the polygons of state boundaries.
- We use the to_crs() method to reproject the counties to a North America Albers Equal Area Conic projection that gives a clearer cartographic representation of the US and which uses a coordinate system in meters for more accurate measurement of relative distances.
counties = geopandas.read_file("https://michaelminn.net/tutorials/data/2015-2019-acs-counties.geojson") counties = counties[~counties["ST"].isin(['AK', 'HI', 'PR'])] counties = counties[["Name", "ST", "FactFinder GEOID", "Median Monthly Mortgage", "geometry"]] counties = counties.to_crs("ESRI:102008") states = geopandas.read_file("https://michaelminn.net/tutorials/data/2015-2019-acs-states.geojson") states = states[~states["ST"].isin(['AK', 'HI', 'PR'])] states = states.to_crs(counties.crs) axis = counties.plot("Median Monthly Mortgage", cmap="coolwarm", legend=True, scheme="naturalbreaks", legend_kwds={"bbox_to_anchor":(0.2, 0.3)}) states.plot(edgecolor="#00000040", facecolor="none", ax=axis) axis.set_axis_off() plt.show()
Spatial Join Occupation and Housing Data
In order to analyze housing and occupational data, we need to join that data together.
A spatial join is a join where data from a join data set is copied into a target data set based on proximity of features in the two data sets.
Because MSAs and counties are different types of areas where an attribute join is not possible, we will do a spatial join to transfer data from the MSAs to the counties that make up those MSAs.
- The GeoPandas sjoin() function finds the MSA (if any) that intersects each county.
bls_msa = bls_msa.to_crs(counties.crs) bls_counties = counties[["FactFinder GEOID", "Name", "ST", "Median Monthly Mortgage", "geometry"]] bls_counties = bls_counties.sjoin(bls_msa, how="left") bls_counties = bls_counties.drop("index_right", axis=1) axis = bls_counties.plot("median_wage", cmap="coolwarm_r", legend=True, scheme="quantiles", legend_kwds={"bbox_to_anchor":(0.2, 0.3)}) states.plot(edgecolor="#00000040", facecolor="none", ax=axis) axis.set_axis_off() plt.show()
Housing Affordability Index
Housing costs tend to be higher in urban areas than in rural areas, although incomes also tend to be higher in urban areas than in rural areas, so housing affordability analysis needs to consider not only the absolute costs of owning a home in different areas, but also the typical income of homeowners in those areas.
A typical rule of thumb is that total housing costs with a mortgage should consume no more than 35% to 45% of your pre-tax income (Muller 2023).
In the 2021 ACS, US median household income was $68,717 and monthly US homeowner costs (mortgage payments + taxes, insurance, utilities, etc.) were $1,672 ($20,064 annually), which indicates housing costs typically consume 29% of income.
This code creates an affordability index where an index of 100 indicates the normal percentage of income going to housing, values above 100 reflect affordability better than normal across the US, and values below 100 reflect lower affordability than normal.
bls_counties["affordability"] = 100 * (bls_counties["median_wage"] * 0.29) / (bls_counties["Median Monthly Mortgage"] * 12) bls_counties = bls_counties.dropna() bls_counties = bls_counties.sort_values("affordability", ascending=False).reset_index(drop=True) bls_counties[["Name", "ST", "affordability"]].head()
Name ST affordability 0 Quitman MS 196.634146 1 Tallahatchie MS 192.181168 2 Calhoun MS 184.696449 3 Choctaw MS 184.274286 4 Chicot AR 184.063927
bls_counties[["Name", "ST", "affordability"]].tail()
Name ST affordability 2867 Arlington VA 45.283405 2868 Westchester NY 45.042970 2869 Morris NJ 44.900000 2870 Monroe FL 44.705143 2871 Falls Church VA 39.891262
axis = bls_counties.plot("affordability", cmap="coolwarm_r", legend=True, scheme="naturalbreaks", legend_kwds={"bbox_to_anchor":(0.2, 0.3)}) states.plot(edgecolor="#00000040", facecolor="none", ax=axis) axis.set_axis_off() plt.show()
Criteria Data: Family Points
Beyond professional and economic considerations, two major factors affecting residential choice include proximity to family and proximity to lifestyle locations, such as recreational activities or particularly desirable cities.
Points can vary in importance. For example, when elderly family members need caregiving, if moving them to you is not an option, you will need to move to them. Therefore, high importance point(s) should be in a separate layer so they can be prioritized.
For this example, we will use five midwestern points for family members.
family = [ ["Chicago", 41.87906573068059, -87.6359546674838], ["Rolla", 37.95434820224575, -91.77403115678084], ["Conway", 35.07797668119237, -92.45756175631256], ["Monroe", 32.52867100392165, -92.07356205299774], ["St. Louis", 38.63948197166543, -90.26258891961275] ] family = pandas.DataFrame(family, columns=["City", "Latitude", "Longitude"]) geometry = geopandas.points_from_xy(family["Longitude"], family["Latitude"]) family = geopandas.GeoDataFrame(family, geometry = geometry, crs="EPSG:4326") family.info()
RangeIndex: 5 entries, 0 to 4 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 City 5 non-null object 1 Latitude 5 non-null float64 2 Longitude 5 non-null float64 3 geometry 5 non-null geometry dtypes: float64(2), geometry(1), object(1)
The GeoPandas sjoin_nearest() function is used to find the distance from each county to the nearest family member.
The Pandas drop_duplicates() method is needed because sjoin_nearest() results "will include multiple output records for a single input record where there are multiple equidistant nearest or intersected neighbors."
family = family.to_crs(bls_counties.crs) bls_counties = bls_counties.sjoin_nearest(family, how="left", distance_col="family_distance") bls_counties["family_distance"] = bls_counties["family_distance"] / 1000 bls_counties = bls_counties.drop_duplicates("FactFinder GEOID") bls_counties = bls_counties.drop("index_right", axis=1) bls_counties = bls_counties.sort_values("family_distance").reset_index(drop=True) bls_counties[["Name", "ST", "family_distance"]].head()
Mapping the calculated distance shows a clear halo around the midwestern locations.
axis = bls_counties.plot("family_distance", cmap="coolwarm", legend=True, scheme="quantiles", legend_kwds={"bbox_to_anchor":(0.2, 0.3)}) states.plot(edgecolor="#00000040", facecolor="none", ax=axis) family.plot(marker="*", color="black", ax=axis) axis.set_axis_off() plt.show()
Categorical Suitability Analysis
Categorical suitability analysis involves the selection of areas based on satisfaction of specific categorical conditions, usually involving thresholds for quantitative values.
For this example we will select counties suitable to be considered for a job search based on available employment in the specified occupation, housing costs that are affordable on the regional salary in that occupation, and proximity to family.
Employment
Suitable counties will have a location quotient of one or higher indicating that a normal or greater percentage of the available jobs are in the analyzed occupation.
bls_counties["Suitable Employment"] = (bls_counties["location_quotient"] >= 1) axis = bls_counties.plot("Suitable Employment", cmap="coolwarm_r", legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.3)}) states.plot(edgecolor="#00000040", facecolor="none", ax=axis) axis.set_axis_off() plt.show()
Housing Affordability
Suitable counties will have an affordability index of 100 or greater.
bls_counties["Suitable Housing"] = (bls_counties["affordability"] >= 100) axis = bls_counties.plot("Suitable Housing", cmap="coolwarm_r", legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.3)}) states.plot(edgecolor="#00000040", facecolor="none", ax=axis) axis.set_axis_off() plt.show()
Family Proximity
For proximity to family we choose an arbitrary maximum of 400 kilometers (around 250 miles).
bls_counties["Suitable Family"] = (bls_counties["family_distance"] < 400) axis = bls_counties.plot("Suitable Family", cmap="coolwarm_r", legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.3)}) states.plot(edgecolor="#00000040", facecolor="none", ax=axis) family.plot(marker="*", color="black", ax=axis) axis.set_axis_off() plt.show()
Composite Criteria
Finally, we can use logical AND operator (&) to determine counties that meet all criteria.
For each county, the Categorical_Suitability is set to true only when employment is suitable AND housing is suitable AND distance to family is suitable.
bls_counties["Categorical Suitability"] = bls_counties["Suitable Employment"] & \ bls_counties["Suitable Housing"] & bls_counties["Suitable Family"] axis = bls_counties.plot("Categorical Suitability", cmap="coolwarm_r", legend=True, legend_kwds={"bbox_to_anchor":(0.2, 0.3)}) states.plot(edgecolor="#00000040", facecolor="none", ax=axis) family.plot(marker="*", color="black", ax=axis) axis.set_axis_off() plt.show()
You can display the list of suitable counties by subsetting rows based on the suitability column.
list_counties = bls_counties[bls_counties["Categorical Suitability"]] list_counties[["Name", "ST"]]
Name ST 0 Quitman MS 1 Tallahatchie MS 2 Chicot AR 3 Polk TX 4 Choctaw MS ... ... ... 1050 Stewart TN 1064 Madison TN 1067 Gladwin MI 1086 Chase KS 1115 St. James LA
Weighted Suitability Analysis
One challenge with categorical suitability is that it only shows which areas meet all minimum criteria, but does not indicate which of the possible options are the best.
Weighted suitability analysis involves the creation of a linear model that is a weighted sum of the different criteria (ESRI 2022).
Y = Β1x1 + Β2x2 ...
- Y is the final score
- Β1 - n are the weights
- x1 - n are the criteria
Criteria values are normalized to a range of zero to one, and the total of the weights is one (100%), so Final_Score is in a range of zero to one, with higher values representing higher levels of suitability.
Suitability analysis with this kind of calculation can be thought of as the opposite of multiple regression. Instead of starting with an existing dependent variable (Y) and calculating coefficients (Β) that maximize the fit of the dependent variable to the equation, in suitability analysis you specify coefficients (weights) and then use the calculated values of the dependent variable to identify the most suitable areas.
Normalized Criteria
In order to be able to use comparable weights with the different criteria, we create normalized values where all three criteria are in a range of zero to one.
bls_counties["employment_index"] = (bls_counties["employment"] - min(bls_counties["employment"])) / \ (max(bls_counties["employment"]) - min(bls_counties["employment"])) bls_counties["affordability_index"] = (bls_counties["affordability"] - min(bls_counties["affordability"])) / \ (max(bls_counties["affordability"]) - min(bls_counties["affordability"])) bls_counties["family_index"] = (max(bls_counties["family_distance"]) - bls_counties["family_distance"]) / \ (max(bls_counties["family_distance"]) - min(bls_counties["family_distance"])) bls_counties[["employment_index", "affordability_index", "family_index"]].head()
employment_index affordability_index family_index 0 0.011952 1.000000 0.926797 1 0.011952 0.971591 0.921884 2 0.011952 0.919804 0.970642 3 0.011952 0.906550 0.984103 4 1.000000 0.453178 0.893672
We can verify normalization with the Pandas describe() method that shows the ranges and distributions for the normalized variables.
bls_counties[["employment_index", "affordability_index", "family_index"]].describe()
employment_index affordability_index family_index count 2872.000000 2872.000000 2872.000000 mean 0.057960 0.384476 0.718761 std 0.125077 0.135012 0.206112 min 0.000000 0.000000 0.000000 25% 0.003984 0.296570 0.641421 50% 0.011952 0.377626 0.756629 75% 0.051793 0.467164 0.868047 max 1.000000 1.000000 1.000000
Composite Score
The Weighted Suitability attribute in the suitability analysis layer is the model output for suitability on a scale of zero to one, with higher values representing higher suitability.
Weights are used to indicate which criteria are more important than others. These values should sum to 1.0.
bls_counties["Weighted Suitability"] = (0.4 * bls_counties["employment_index"]) + \ (0.2 * bls_counties["affordability_index"]) + \ (0.4 * bls_counties["family_index"]) bls_counties = bls_counties.sort_values("Weighted Suitability", ascending=False).reset_index(drop=True) bls_counties[["Name", "ST", "Weighted Suitability"]].head()
Name ST Weighted Suitability 0 Polk TX 0.848104 1 Liberty TX 0.824499 2 Hardin TX 0.821633 3 Jefferson TX 0.817298 4 Harris TX 0.788002
axis = bls_counties.plot("Weighted Suitability", cmap="coolwarm_r", legend=True, scheme="naturalbreaks", legend_kwds={"bbox_to_anchor":(0.2, 0.3)}) states.plot(edgecolor="#00000040", facecolor="none", ax=axis) family.plot(marker="*", color="black", ax=axis) axis.set_axis_off() plt.show()