Analyzing Areas of Influence in ArcGIS Online
This tutorial describes one type of analysis of areas of influence that can be performed using ArcGIS Online and Google Sheets.
This example examines potential differences in incidence rates for all cancers in counties surrounding nuclear power plants in the USA versus counties outside those areas.
Our question is, "Are incidence rates for all types of cancer higher near nuclear power plants?"
Our hypothesis is that cancer rates will be higher near nuclear power plants.
Map the Areas
The first step in analyzing the geospatial distribution of a phenomena is generally to map it and look for patterns. The cancer rate by county data for this analysis comes from the State Cancer Registry maintained by the National Institutes of Health and the Centers for Disease Control (CSV file, Metadata).
The video below demonstrates the creation of a choropleth using a cancer rate feature layer already created in this ArcGIS Online organization. In this case we symbolize by the incidence rate (per 100,000 population) for all cancers.
Although there are hot spots (clusters) of cancer around the country, there is a notable cancer belt that extends from the deep south through Appalachia.
Map the Points
Areas of influence are centered around specific points. For this example, we use point locations for nuclear power plants from the US Energy Information Administration (CSV file, Metadata). The video below demonstrates adding points from a power plant feature layer already created in this ArcGIS Online organization.
In some cases you may need to filter your points to isolate a specific type of influence. For example, this feature layer contains points for all types of power plants. The video below demonstrates how to filter the layer so that only the plants with nuclear as their primary fuel source are displayed.
We also color the points black so they will stand out over the completed map.
Find Nearest Locations
To isolate the counties around the power plants that we hypothesize may have higher cancer rates, we use the Find Locations, Find Existing Locations analysis tool.
- For this example, we ADD EXPRESSION to select counties that have at least some area within a distance of 25 miles (arbitrarily chosen) from a plant
- Give the resulting layer a meaningful name. You may want to include your last name in the name to avoid duplicating a name used by someone else in your organization
- Uhcheck Use current map extent so that any areas outside the area currently being mapped will be selected
- Show credits to make sure you haven't made a mistake that will use up your credit quota. Credits required over 10 generally indicate some problem
- We symbolize them by cancer rate with a red color ramp so the will stand out in the map
To isolate counties outside that area, we repeat the operation, but choose not within a distance of. We symbolize them blue so they are visually subdued compared to the red areas of influence.
Now that we have a layer of areas that should be influenced by the points, and an area that should not be influenced, we can compare the statistics to see if there is any difference in the dependent variable - in this case incidence rates for all cancers.
Share the Map and Layers
Export to Excel Files
To perform further analysis on this data, we will need to export both layers to Excel files and import them into a spreadsheet.
Creating Excel files requires two steps: exporting the data to a file in ArcGIS Online, and then downloading that ArcGIS Online file as an Excel file to your hard drive.
Import Into Google Sheets
Once the CSV files are on your hard drive, you can import them into Google Sheets as separate worksheets in a single workbook.
Descriptive Statistics Table
Once you have your data in a spreadsheet, you can perform descriptive statistical analysis. For example:
- Counties: Use the ROWS() function to count the number of rows in each sheet of counties
- Population: If your data has a population column, use the SUM() function to sum the population in each sheet of counties
- Maximum incidence rate: Use the MAX() function with the incidence rate column
- Minimum incidence rate: Use the MIN() function with the incidence rate column
- Mean incidence rate: Use the AVERAGE() function to get the mean incidence rates
- Standard deviation: Use the STDEV() function to get the standard deviations of the incidence rates, which describes the width of the distribution better than the the min and max outliers
- Percent difference: Get the percent difference by dividing, subtracting on (100%), and formatting the cell as a percentage
- Additional incidence: Multiply the additional incidence times the population to estimate the additional number of new diagnoses that could be attributed to the analyzed effect
Give the sheet a meaningful title.
Turn off the spreadsheet gridlines and add formatting.
One way of visualizing a distribution is a histogram, that divides the distribution into a set of ranges, and then displays the number of values in each range.
To compare the two distributions, we can add a second data series. While the height of the bars is different due to the larger number of counties outside the nuclear plant areas of influence, the peak values of the nuclear plant counties are slightly to the right of the non-nuclear counties.
Configuring the histogram to place the outlier 1% in the outside bars keeps the chart from being too wide.
Configuring the bucket size to a smaller value (10 in this case) reduces the number of bars and may make the curve clearer, depending on your distribution.
If you are collaborating with others, or wish to include a spreadsheed and chart in a Story Map, you need to publish the chart. Note that you can publish the whole workbook or just the current spreadsheet or chart. You should choose the latter if placing the chart in a Story Map as a visualization.
While this comparatively simple type of analysis gives intuitive results with a minimum amount of effort, it is subject to a number of limitations that should be noted when attempting to interpret the results of the analysis.
- Confounding - the third variable problem: The factors that went into deciding where to to place the points of influence may also be factors that contribute to the effect. For this example, power plants are probably built in less-affluent areas to avoid community opposition, and affluence is inversely correlated with cancer risk. Power plants may also depress surrounding property values, attracting low-income residents who already have high cancer risk. And, power plants may be built in areas with existing industrial infrastructure that is the source of carcinogens.
- Modifiable areal unit problem (MAUP): Counties are irregularly sized both in terms of area and population. Accordingly, large, sparsely-populated areas might dilute the influences while small densely-populated areas might overstate the effect.
- Missing variables: The causes of conditions like cancer are complex and can involve multiple factors like genetics, lifestyle, and other environmental issues beyond the observed variable. A more robust analysis would need to consider these other factors and attempt to control for them before making a definitive attribution of cause to one specific influence in the area.
- Small numbers problem: In areas with small populations, the occurance (or absence) of one rare health condition can have an outsized influence on the incidence or prevalance rate for that area. Because there are a large number of counties with small populations in the US, these extremes can distort the descriptive statistics derived from those values.