Geographic Correlation Analysis With ArcGIS Online
One of the useful capabilities of GIS software is the ability to overlay multiple layers of data so you can observe where characteristics correlate with each other in space.
This tutorial will describe how to perform simple geographic correlation analysis with ArcGIS Online, and export data to Google Sheets for visualization. The example will use stroke incidence rates as the dependent variable and a variety of independent variables derived from stroke risk factors.
Stroke is a disease that affects the arteries leading to and within the brain. A stroke occurs when a blood vessel that carries oxygen and nutrients to the brain is either blocked by a clot or ruptures, resulting in failure of parts of the brain to get needed oxygen, and death of brain cells. Around 800,000 people in the USA have a stroke each year, and around 130,000 Americans are killed by stroke, making stroke the fifth leading cause of death in the United States (American Stroke Association 2018a; 2018b).
The symptoms of stroke vary in severity based on the scale of the blockage or hemmorage, as well as the specific location in the brain where that event occurs. Symptoms of stroke can include (American Stroke Association 2018a):
- Mild to severe paralysis
- Loss or impairment of speech and communication skills
- Memory loss
- Vision impairment or loss
- Emotional and behavioral changes
The risk factors for stroke are similar to those for heart disease, and stroke often has co-morbidity with other health conditions common in contemporary American life. Risk factors include (American Stroke Association 2018c):
- Diets high in saturated fats, trans fat, cholesterol, and sodium
- Physical inactivity
- High blood pressure
- Family history
This research will focus on geographic correlates of stroke in the United States. While stroke is common throughout the country, there are especially high rates of stroke incidence in a number of states of the old Confederacy in the southeast portion of the country (Wikipedia 2018).
American Stroke Association. 2018a. "About Stroke." Accessed 30 September 2018. http://www.strokeassociation.org/STROKEORG/AboutStroke/About-Stroke_UCM_308529_SubHomePage.jsp
American Stroke Association. 2018b. "Impact of Stroke (Stroke statistics)." Accessed 30 September 2018. http://www.strokeassociation.org/STROKEORG/AboutStroke/Impact-of-Stroke-Stroke-statistics_UCM_310728_Article.jsp
American Stroke Association. 2018c. "Stroke Risk." Accessed 27 October 2018. http://www.strokeassociation.org/STROKEORG/AboutStroke/UnderstandingRisk/Understanding-Stroke-Risk_UCM_308539_SubHomePage.jsp
Wikipedia. 2018. "Stroke Belt." Accessed 30 September 2018. https://en.wikipedia.org/wiki/Stroke_Belt
Questions and Hypotheses
Given a stereotypical association of many stroke risk factors with poverty, our first research question is: Do areas that have lower income also have higher prevalence of stroke?
Our hypothesis is that stroke prevalence will be inversely correlated with median household income by state.
Since smoking is commonly named as a risk factor for stroke, our second research question is: Do areas that have higher smoking rates also have higher prevalence of stroke.
Our hypothesis is that stroke prevalence will be correlated with smoking rates by state.
Fast food has a reputation for being high in saturated fats, cholesterol and sodium. Therefore, our third question is: Do areas that have high levels of fast food consumption also have higher levels of stroke?
Our hypothesis is that areas where a higher percentage of residents eat fast food regularly will also have higher prevalence of stroke.
Methods and Data Sources
Independent variables were tested for correlation with the dependent variable using least-squares regression in Google Sheets.
Our dependent variable is the prevalence of stroke among US adults (18+) as a percentage of the population. The data is for 2016 from the Centers for Disease Control and Prevention's Behavioral Risk Factor Surveillance System.
Our independent variable for median household income is from the US Census Bureau's American Community Survey five-year estimates, 2011-2016.
Our independent variable for smoking prevalence is also for 2016 from the Centers for Disease Control and Prevention's Behavioral Risk Factor Surveillance System.
Our independent variable for fast food consumption is the index "Went to fast food/drive-in restaurant 9+ times/mo" from Esri and the market research firm Gfk MRI. The index value indicates the relative rate to the national average with 100 used to represent the national average.
A CSV file of these example variables is available here.
When dealing with causation, the causes are measured with independent variables and the effects are measured with dependent variables. Effects are dependent upon the independent causes.
For this tutorial, we use as an example dependent variable prevalence of stroke among US adults (18+) for 2015 from the CDC's Behavioral Risk Factor Surveillance System (BRFSS). A CSV file of data used in this tutorial is here.
This data was uploaded to an ArcGIS Online hosted feature layer and this video shows how to create a map of the stroke variable.
Independent Variables: Joining Data From Another Layer
While ArcGIS Online does not have a feature for creating X/Y scatter charts directly, we can create a layer containing all dependent and independent variables, export it to a CSV file, import it into Google Sheets, and create charts there.
However, we first must get all the variables into one layer.
If one of your variables is in an existing layer, you can perform a spatial join of the layer with the independent variable(s) to the layer with the dependent variable to get a combined layer.
For this example, we use a layer of median household income from the 2016 American Community Survey. Example data can be downloaded here as a CSV.
- Add the layer that you will be joining to the map
- Select Summarize Data, Join Features from the analysis icon under the layer containing the dependent variable
- The Target layer is the layer with the dependent variable, and the layer to join to the target layer is the layer with the independent variable(s)
- The type of join is Choose a spatial relationship, Identical to
- Give the layer a meaningful name
- Uncheck Use current map extent so that any features that are outside the current map view will also get joined
- Select Show credits to make sure your operation is reasonable. Joins over 5 credits should be examined to make sure you actually want to do what you are about to do
- Run Analysis
- Examine the attribute table of the resulting layer to make sure all attributes are there and look like you expect
- Symbolize by the independent variable to make sure the joined data is what you expect
Independent Variables: Layer Enrichment
Another option for getting an independent variable is the ArcGIS Online Enrich Layer tool, which gives you access to a wide variety of demographic data that includes not only publicly-available data like that available from the USCB's American Community Survey, but also proprietry data on purchasing habits that can be useful to businesses.
An enrichment operation is similar to an spatial join, except it saves you the tedium of having to download, process and join the data by hand.
- Select the Data Enrichment, Enrich Layer tool from the analysis icon underneath the layer you want to enrich
- Click the Select Variables icon to choose variables from groups of variables. For this example, we will enrich the data layer with 2018 Food Away From Home - Dinner at Fast Food Restaurant: Index
- Give a meaningful name for the layer that will result from enrichment
ArcGIS Online is a cloud service where you pay only for what you use. The software keeps track of your usage by charging you a certain number of credits for each action you take in ArcGIS Online. Some operations like displaying layers require only fractions of a credit, while other operations can cost considerably more.
Before running your analysis, you should Show credits to find out how many credits an operation will take. Enrichment can be a very credit-consumptive process if you have a large number of features, a large number of variables, or need to repeat the operation multiple times. If you exhaust your credits, you will need to contact an administrator (if you have an institutional account) or purchase more credits (if you are on a personal account) to be able to continue working.
If your enrichment is going to consume more than five credits, you may want to reduce the number of variables you are using, enrich a smaller number of features, or double-check your choices so you do not have to repeat a credit-expensive enrichment task again.
To transfer your data to Google Sheets for visualization, you need to export the attributes of your combined layer to a CSV file, and then download that CSV file to your computer.
- View the item details for your layer containing your dependent and independent variables
- Select Export Data -> Export to CSV file
- When given the CSV file page, Download a zip archive of the CSV file
- Extract the CSV file from the zip archive and copy it to your desktop
Store all files associated with this analysis in a separate directory so you can keep track of what files go with what project.
- Save and share your map
- On your Content page, create a new, meaningfully-named directory
- Delete any layers that were created during unsuccessful operations
- Move all files and maps associated with this analysis into that directory to keep everything organized
Correlation Analysis: X/Y Scatter Chart
Correlation is a relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or occur together in a way not expected on the basis of chance alone.
R2 is the coefficient of determination that measures the strength of the correlation. The range is from 0.000 (no correlation) to 1.000 (perfect correlation).
Exactly how R2 should be evaluated depends on the type of data being studied. In the natural sciences, values above 0.600 are often expected. However, in the social sciences where relationships often involve the complex interplay of ambiguous factors, values in the 0.200s or 0.300s can be considered meaningful for further investigation.
To create an X/Y scatter chart in Google Sheets and find the R2:
- Create a new Google Sheets spreadsheet in Google Drive
- Import the CSV into your spreadsheet
- Delete unneeded columns
- Rename any cryptically-named columns to something more meaningful
- Give the spreadsheet a meaningful name
- Select two columns you want to compare (ctrl-click to select the second column)
- Insert, Chart to create a new chart with the two columns
- Add axis titles so you know what the chart shows
- Series, Add Trendline to add a trendline
- Adjust the color and thickness for the trendline as desired
- Display the R2 value to measure the strength of the correlation
- Move to own sheet
Median household income has a fairly strong inverse correlation with stroke prevalence by state, with an R-squared of 0.410. This corroborates our hypothesis that income would be inversely correlated with stroke prevalence.
Because the relationship between income and stroke risk is complex, care should be taken to avoid the ecological fallacy in assuming that all poor people are at high risk of stroke, or that wealthy people are at low risk. The aggregation of data at the state level presents the modifiable areal unit problem that may cause this spatial correlation to be higher or lower than the relationship at an individual level. Income should also be seen as a proxy for the actual risk factors that lead to stroke.
Smoking has a very strong correlation with stroke prevalence by state, with an R-squared of 0.558. This corroborates our hypothesis that income would be correlated with stroke prevalence by state.
The aggregation of data at the state level presents the modifiable areal unit problem that may cause this spatial correlation to be higher or lower than the relationship at an individual level.
Fast food consumption has a weak negative correlation with stroke prevalence by state, with an R-squared of 0.284. This unexpected result fails to corroborates our hypothesis that fast food consumption would be positively correlated with stroke prevalence by state.
The high level of heteroskedacity as well as the fairly narrow range of values for the fast food consumption index calls into question the quality of the fast food consumption data. This may also be less of a declaration of the health value of fast food than the presence of confounding variables, such as highly active lifestyles and careers that are both associated with better health and inadequate time to cook at home or take time for a sit-down dinner. In addition, the modifiable areal unit problem may be a factor where analysis with state-level statistics misses important distinctions that might be clearer with smaller areal units like zip codes or counties.
Empirical Limitations: Problems With Data
The use of simple correlation between geographic variables is a very useful tool for exploratory data analysis, but this technique has numerous weaknesses that require more-sophisticated statistical techniques to overcome.
The Modifiable Area Unit Problem
The modifiable area unit problem (MAUP) arises when working with data aggregated by areas or regions. If different area boundaries are used, that analysis can yield different results, even if the underlying phenomena is the same.
For example, county boundaries reflect historical and political processes rather than clear boundaries between areas where health outcomes are different. Therefore, use of these boundaries, which is often necessary to preserve confidentiality, limits the accuracy of analysis that is performed using data aggregated within those boundaries.
Survey data subject to a number of issues that affect its accuracy. For example, areas with sparse populations limited medical facilities might underreport stroke diagnoses to researchers. People in some areas might also be suspicious, unresposive, or untruthful to researchers, resulting in less-accurate data.
Data quality can also be affected by the methodology by which it is created. In this example, data on fast food consumption is (apparently) modeled marketing research data which may reflect questionable modeling assumptions or limited availability of accurate data for model calibration.
Temporal and Spatial Lag
Stroke can be the result of years of unhealthy lifestyle choices that may have been made in locations other than the home of the stroke victim at the time of onset. Indeed, a stroke victim might be making very different lifestyle choices in retirement than they made when they were younger and began accumulating comorbidities. Accordingly, there may be a temporal (time) lag between data that would reflect the risk factors and the data that measures stroke. Such temporal lag can hide clear connections between risk factors and stroke incidence.
Human mobility is also a problem. Because people usually change locations during the day (home, work, school) and occasionally relocate residences, there may be a spatial lag between data reflecting risk factors and data on where people are when they have strokes.
Rational Limitations: Problems Of Interpretation
While correlation may be interesting, what we are usually more interested in is causation - whether one of the phenomena we are observing causes the other phenomena. In this example, we are interested in whether being close to a natural-gas-fired power plant causes cancer. Such knowledge could be used to justify tighter emissions regulations for power plants, or to help promote public funding to accelerate alternative energy research and deployment.
While correlation is empirical (based on observation), causation is based on reason. When we observe two phenomena occurring together and we observe that there is some mechanism connecting the two phenomena, we use reason and logic to tie those two phenomena together in a cause and effect relationship.
While correlation can be an indicator of causation, assuming that correlation proves causation is the post hoc fallacy, from the Latin phrase post hoc ergo propter hoc (after this, therefore because of this). A logical fallacy is an often plausible argument using false or invalid inference.
In this example, the strong correlation between smoking rates and stroke rates is reflective of a well-documented causal mechanism connecting smoking with a wide variety of health maladies. Correlation corroborates our understanding that smoking increases the risk of stroke.
However, the negative correlation between fast food consumption and stroke rates runs counter to problematic associations between the nutritional value of fast food and stroke risk factors. It seems unlikely that increased fast food consumption reduces stroke risk. Something else is likely involved here, which is discussed below.
Correlation points to possible causal relationships, but does not prove them.
Covariates are additional independent variables that might influence the dependent variable.
For example, smoking correlates with a wide variety of other lifestyle choices that are also risk factors for smoking. Separating out how much of the increased stroke risk is due to smoking and how much is due to those other factors requires more complex techniques like regression analysis, principal component analysis, factor analysis.
Confounders: The Third Variable Problem
Two variables that correlate with each other may individually have causation rooted in a third variable. In complex systems, these indirect relationships are often hard to identify and are referred to as confounding.
In this example, the counterintuitive negative relationship between fast food consumption and stroke rates may have a confounder in active lifestyles. People with busy lives may be more inclined to eat fast food because they do not have time to prepare healthier foods. It is also possible that stroke-prone senior citizens are less inclined to eat fast food, resulting in areas with large numbers of senior citizens having reduced market potential for fast food.
Reverse Causation and Simultaneity
Reverse causation is when the assumed cause in a cause-and-effect relationship is actually the effect. For example, if a recreational drug user also has symptoms of mental illness, the assumption that the drug use is causing the mental illness may be reversed - the person may be self-medicating to attempt to ameliorate the symptoms of pre-existing mental illness.
Simultaneity occurs when cause and effect occur together, often in a feedback loop. For example, children from wealthy families often have opportunities to get into more-presitgious schools which improves their chances of being wealthy and passing on those advantages to their own children. Wealth causes improved educational opportunity which causes wealth.
In this example, the relationship between stroke and fast food consumption may actually be reversed. Stroke victims may be wary of continuing the lifestyle choices that led to their strokes, and may eschew fast food, lowering market demand.
The Ecological Fallacy
The ecological fallacy is making an assumption about individuals in an area based on aggregated statistics. For example, just because the median household income in a neighborhood is high does not mean that every individual person in that neighborhood lives in a high-income household.
Accordingly, making assumptions about causation in individual situations based on geographic correlation is fallacious. While areas that have above average numbers of smokers also tend to have high levels of stroke mortality, there are smokers that never have strokes, and people who have strokes who have never smoked.
Abraham Lincoln is a Science, Technology, and Society major in the Department of Science, Technology, and Society at Farmingdale State College, with a concentration in Health, Wellness, and Society. He currently lives in Washington, DC. His research interests include human trafficking and military history. He plans to be come a politician and save the union.