Analyzing Areas of Influence in ArcGIS Pro
We often have questions about areas around a set of locations. For example:
- Which of a specific set of potential retail locations will maximize access to targeted customers?
- Which proposed transit stops will serve the largest possible current or future number of people?
- Are specific sources of emissions associated with negative health effects in surrounding neighborhoods?
This tutorial will describe how to perform a buffer and overlay (spatial join) in ArcGIS Pro, which is a technique that can help you answer questions like those above. For this example, we will investigate whether counties around coal-fired power plants in Illinois have higher rates of lung cancer than other counties in the state.
Literature Review and Hypothesis
Cancer is a general name given to a variety of different diseases where the body's cells divide without stopping. In normal life, new cells divide, grow, and die in an orderly process that renews the various parts of the body. However, cancerous cells survive when they should die, and extra cells divide without stopping, forming tumors. Cancer can affect all organs and systems of the body, and types of cancer are usually named after the tissues where the cancers form (NCI 2018).
Particulates generated by coal-fired power plants are associated with long-term health consequences. Lin et al. (2019) found an "association between lung cancer incidence and increased reliance on coal for energy generation" at the national level.
Therefore, our hypothesis is that counties around coal-fired power plants in Illinois will have higher rates of lung cancer than other Illinois counties.
National Cancer Institute (NCI). 2018. "Cancer Statistics." Accessed 12 October 2018. https://www.cancer.gov/about-cancer/understanding/statistics.
Lin, Cheng-Kuan, Ro-Ting Lin, Tom Chen, Corwin Zigler, Yaguang Wei, and David C. Christiani. 2019. "A global perspective on coal-fired power plants and burden of lung cancer." Environmental Health 18 (1): 9. https://doi.org/10.1186/s12940-019-0448-8.
Methods and Data Sources
Counties within 25 miles of specific energy plants were considered potential areas of influence. The means of cancer rates and survey margins of error were compared between areas of influence and all other US counties.
Our power plant data for 2018 is from the US Energy Information Administration. The data may be accessed here.
Our cancer data source is the "State Cancer Profiles, Incidence Rate Report for United States by County, 2010-2014," from the Centers for Disease Control and Prevention and The National Cancer Institute. The data may be accessed here.
The first step in analyzing the geospatial distribution of a phenomena is generally to map it and look for patterns.
The video below demonstrates the creation of a choropleth using an existing feature service with lung cancer incidence rates by county, overlaid with coal-fired power plant locations.
- Create a new project in ArcGIS Online and give it a meaningful name.
- Add Data the cancer layer
(Minn 2013-2017 County Cancer Profiles).
- Symbolize for the disease of interest. In this example, we use lung cancer incidence per 100K, and use a single color scheme from gray to navy.
- Go to the Appearance tab and adjust the transparency so you can see the base map
- You might consider changing the base map to a less complex map so the data layer is easier to see.
- Add the power plant layer (Minn 2018 Power Plants).
- Right clock on the layer and select Properties -> Definition Query
- Add a New definition query where Primary Fuel Source is equal to the type of plant you are analyzing (in this example, coal).
- Symbolize the points so they stand out over the disease layer. In this case we use red dots that contrast with the blue disease layer
Although there are hot spots (clusters) of cancer around the country, there is a notable cancer belt that extends from the deep south through Appalachia.
National Map Print Layout
- Insert -> New Layout to create a new print layout.
- Add a map frame.
- Right click and Activate the map to center the map contents if needed.
- Add a meaningful title.
- Change the text symbol appearance to make the font bold and larger.
- Change the text symbol position to center the title in the text box.
- Add a legend.
- Rename the layers so the legend displays correct information.
- Format the legend display to add a background and border with a gap.
- Modify the legend items properties to show only needed headings.
- Right click on the cancer layer and select Symbology -> Advanced Symbol Options to reduce the decimal points to an appropriate precision.
- Add a text box with marginalia (cartographer, date, data source).
- Export to a PDF and proof.
State Map Using Definition Queries
To take a subset of data in ArcGIS Pro, you use a definition query. A definition query allows you to to determine which features from the feature class should be used based on criteria that you specify. A definition query is analogous to a filter (in ArcGIS Online) or a SELECT when working with databases.
- Rename the national map and layout so you distinguish between maps in your project.
- Open the catalog pane and duplicate (copy and paste) the national map as a second map in your project for the state map.
- Right click on the cancer layer and select Properties, Definition Query
- Add a New definition query to select the state you are analyzing. This data does not have a state name field, but does have a STATEFP Federal Information Processing Standard (FIPS) code. For states, these are two-digit numbers. You can find the FIPS code for your analysis state by looking at the attributes for one of the counties in the analysis state. For Illinois, this code is 17.
- Modify the expression for the power plant layer to similarly limit display only to power plants in the analysis state. In this data, there is a state name field.
Areas of Influence With Buffers
To find which counties might be effected by the emissions from the power plants, we create buffers. Buffers are rings or polygons created around features to represent area surrounding those features. For this example, we create buffers with an arbitrarily chosen distance of 25-mile around each plant to assess the influence that the plants may have on the health of the communities surrounding the plants.
- On the Analysis tab, open the Tools and search for the Buffer tool.
- The Input Features that will be buffered in this example are the power plants.
- Give the Output Feature Class a meaningful name. You should use only letters in this name (no spaces or punctuation)
- The Distance is the radius of the buffers (25 miles for this example).
- Run the tool.
- Change the buffer Symbology and Appearance -> Transparency so they stand out against the choropleth.
Spatial Join Overlay
To flag which counties are within the area of influence, we perform a spatial join. Like an attribute join that copies data from one layer to another, the spatial join, copies attributes from the target layer to features in the output layer that overlap spatially.
For this analysis, we will simply use a Join Count attribute that is also added to the output layer by the join tool to indicate how many features from the target layer (the areas around power plants) overlapped.
- Right click the Target Features layer (the county cancer layer) and select Joins and Relates -> Spatial Join.
- The Join Features should be the layer of buffers around the plants.
- Leave the Output Feature Class name as the default because the name will not matter.
- Run the tool.
- Symbolize the layer by the Join Count. If the join worked as expected, the areas under the buffers should have a join count greater than zero.
- If you need a map to proof, you can export it here.
Analyze The Areas of Influence
The Join Count attribute associated with each county can now be used to distinguish between counties that are within the area of influence around each power plant (Join Count > 0) or outside the area (Join County = 0).
Using a box plot chart we can see whether the incidence rates tend to be lower, higher, or around the same. Each bar in the quartile plot is divided into four parts. The outsides of the whiskers show the highest and lowest values, while the thick bar in the middle shows the middle 50% of the values. If one bar is generally higher than the other, this means the values are generally higher.
- In the Contents pane, click on the layer created with the join, and in the Feature Layer -> Data tab, select Create Chart -> Box Plot.
- Select the variable you are analyzing. For this example it is lung cancer incidence.
- In the Category box, select the Join Count variable. This will create separate bars for different numbers of plants that overlap the counties.
- On the General pane, change the chart title and X and Y axes titles to meaningful names
- Hide the joined layer because you will not need to include it on your map.
- View the Catalog Pane and duplicate (copy and paste) the national map layout for your state layout.
- Right click on the map frame and select Properties. Change the map displayed in the Map Frame to your state map.
- Activate the map frame and zoom in to your state.
- Add a Dynamic Text -> Service Layer Credits box to your layout to remove the distracting credit text from the map, and drag that text box off the printable area of the map.
- Right click the legend and select Properties -> Legend Items -> Show Properties to hide the cancer layer heading.
- In the legend Drawing Order drag the coal plant points and buffers below the choropleth legend items.
- Insert a Chart Frame with the chart you created above.
- If the chart frame only displays the text "Chart Frame," right click to format the Properties and unclick the Options to Only show chart data visible in the map frame since the layer you are analyzing is not actually visible on the map.
- Move and resize the different map elements to maximize the use of space in an aesthetically pleasing way. Depending on the shape of your analysis state, you may want to change the orientation of your print layout to portrait (taller than wider).
- Modify the map title as appropriate.
- Export the layout to a PDF.
Save Your Project
You should save your project as a project package to ArcGIS Online so you can open the project on another machine, and so you have a backup.
- Go to the Share tab and select Project.
- Provide a name to save the project under. The default is the name of the current project.
- Copy the name into the Tags and Summary fields.
- Click the Share outside of organization box so your project database containing your buffer and join layers is included in your project package.
- Unclick the Include Toolboxes and Include History Items check boxes so that history or toolbox errors to not cause your upload to fail.
- Analyze the project to find any problems.
- Package the project to upload it to ArcGIS Online.
A general look at the spatial distribution of the highest and lowest lung cancer incidence rates, irrespective of power plants, shows high cancer rates throughout the state, although, oddly, Chicagoland is not a high lung cancer area..
Looking at the box plot, the counties near one or more coal-fired power plants generally do seem to have slightly higher lung cancer incidence rates, although the one county near five plants has a fairly low rate.
Since there is not a clear indication that lung cancer rates are consistently higher in areas around coal-fired power plants, this analysis only weakly corroborates our hypothesis that that counties around coal-fired power plants in Illinois will have higher rates of lung cancer than other Illinois counties.
While this comparatively simple type of analysis gives intuitive results with a minimum amount of effort, it is subject to a number of limitations that make it primarily useful for exploratory data analysis that can inform more rigorous work.
Covariates are additional independent variables that might influence the dependent variable.
Cancer has numerous risk factors, but in this example, we only looked at one of those variables (proximity to power plants). A more sophisticated analysis would consider covariates in an attempt to determine how important each one of those variables are in explaining different county cancer rates. Common analytical techniques used to do this include multiple linear regression analysis, principal component analysis, and factor analysis.
For example, a regression model might consider not only whether a county is near a power plant, but also incorporate covariates that quantify lifestyle risk factors like obesity rates, rates of regular consumption of fruits and vegetable, and rates of smoking. The resulting model would contain coefficients that indicate the relative percentage of influence on cancer rates that can be attributed to each of the variables.
A confounder is a third variable that correlates with both the presumed cause and effect variables.
For example, power plants are often located in poor areas where residents have little political power. Because lifestyle and medical care issues associated with poverty could also increase cancer rates, those factors could increase cancer rates around power plants while the emissions from the power plant actually have no carcinogenic effect.
The Modifiable Areal Unit Problem
Using different types of areas to aggregate data can result in different results. Any spatial analysis that uses areas is subject to the MAUP.
For example, the cancer data used in this analysis is aggregated by county. If we had used a smaller areas such as zip codes or census tracts, we might have found more-meaningful differences in cancer rates in the areas closest to the plants.
Similarly, the use of a 25-mile area of influence (buffer) around plants is arbitrary. Larger or smaller areas might have yielded different results.
Point-source pollutants are often spread by wind and water currents far beyond the source. This can both reduce the effects of those emissions (dilution is the solution to pollution) or concentrate those pollutants downwind or downstream. Pollution can become concentrated in rivers as those rivers flow past multiple sources of pollution.
Human mobility is also a problem. Because people usually change locations during the day (home, work, school) and occasionally relocate residences, the effects of carcinogen exposure may not become visible in the area where the actual exposure took place.
This is a variation on the modifiable areal unit problem (MAUP), where changing the areas used for analysis can change the results of the analysis, even if the underlying phenomena is the same.
There are numerous ways of minimizing (but not eliminating) issues with spatial lag:
- Areas of influence can be based on analysis of air and water flows rather than arbitrary 25-mile areas of influence
- The research could focus on where diagnosed individuals have lived and worked to investigate the exposure to environmental carcinogens in the past
- Spatial lag regression is variation on simple regression that takes into account spatial autocorrelation
Cancer often takes months or years to develop. Accordingly, there may be a temporal (time) lag between exposure(s) and diagnoses of cancer. Many people may also move between the period of the exposure and the onset of disease. Such temporal lag can hide clear connections between carcinogens and cancer.
With long-lived assets like power plants that persist in communities for decades, this effect is reduced. However, when studying the effects of new facilities, observations over extended periods of time (longitudinal studies) and statistical techniques that take temporal lag into account are needed to compensate for temporal lag. Indeed, temporal lag can become a variable in the analysis when increased cancer rates follow the opening of a new plant.
The Small Numbers Problem
In areas with small populations, the occurance (or absence) of one rare health condition can have an outsized influence on the incidence or prevalance rate for that area. Because there are a large number of counties with small populations in the US, these extremes can distort the descriptive statistics derived from those values.
The ecological fallacy is the use of aggregated statistics to explain individual situations.
For this example, the causes of cancer are complex and often specific to individuals (such as genetics or lifestyle). If someone near a power plant gets cancer, using this aggregated data to make an argument that the power plant had nothing to do with their cancer would fall into the ecological fallacy.
Similarly, a small group of dirtier power plants might be emitting carcinogens that affect people in their vicinity, but since those numbers are aggregated with all other power plants, the effects of those specific plants would not show up in the aggregated mean. Using this analysis to argue that no natural-gas-fired power plants increase the rate of cancer in their vicinity would also fall into the ecological fallacy.