Geographic Correlation Analysis With ArcGIS Online
One of the useful capabilities of GIS software is the ability to overlay multiple layers of data so you can observe where characteristics correlate with each other in space.
This tutorial will describe how to perform simple geographic correlation analysis with ArcGIS Online, and export data to Google Sheets for visualization.
Correlation is a relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or occur together in a way not expected on the basis of chance alone.
For example, gross domestic product per capita (the total amount of economic activity in a country divided by the number of people) correlates with electricty use per capita. When those two variables are compared with each other on an X/Y scatter chart, the points on the chart form a rough line from left to right. More economic activity usually means more electricity use.
Correlation is empirical - based on observation. In contrast causation is rational - we draw a conclusion in our mind about something being a cause of some kind of effect. We can observe correlation, but we are usually interested in understanding causation, which can explain the phenomena we see around us and also enable us to manipulate causes to achieve desired effects - such as helping people quit smoking (cause) to reduce lung cancer deaths (effect).
Dependent and Independent Variables
When dealing with causation, the causes are measured with independent variables and the effects are measured with dependent variables. Effects are dependent upon the independent causes.
For this tutorial, we use as an example dependent variable Stroke Mortality Data Among US Adults (35+) by State/Territory and County by county for New York State in 2014 from the Centers for Disease Control and Prevention. Downloading and importing CDC data is describe in greater detail in this tutorial.
A zipped shapefile of the data used in this tutorial is here. This video shows how to import a shapefile into an ArcGIS Online map.
To have an understanding of what you are mapping, you should know what fields your layers contains.
When working with fields that are absolute values (such as population), bubble maps are more appropriate visualizations than choropleths, where coloring of large but sparsely-populated areas can give a misleading impression of the data.
When working with fields that are relative values (such as rates or proportions) choropleths are often a good visualization. In this case, the number of annual stroke deaths per 100,000 people is a relative value that can be used for a choropleth.
When you click the Show Table icon under the layer name, a pane pops up at the bottom of the screen showing the attributes. Selecting sort ascending or sort descending by the field name can allow you to see the range of values.
Joining Data From Another Layer
While ArcGIS Online does not have a feature for creating X/Y scatter charts directly, we can create a layer containing all dependent and independent variables, export it to a CSV file, import it into Google Sheets, and create charts there.
However, we first must get all the variables into one layer.
If one of your variables is in an existing layer, you can perform a spatial join of the layer with the independent variable(s) to the layer with the dependent variable to get a combined layer.
For this example, we use a layer of median household income from the 2016 American Community Survey. Example data can be downloaded here as a zipped shapefile.
- Select Summarize Data, Join Features from the analysis icon under the layer containing the dependent variable
- The Target layer is the layer with the dependent variable, and the layer to join to the target layer is the layer with the independent variable(s)
- The type of join is Choose a spatial relationship, Identical to
- Give the layer a meaningful name
- Uncheck Use current map extent so that any features that are outside the current map view will also get joined
- Select Show credits to make sure your operation is reasonable. Joins over 5 credits should be examined to make sure you actually want to do what you are about to do
- Run Analysis
- Examine the attribute table of the resulting layer to make sure all attributes are there and look like you expect
- Symbolize by the independent variable to make sure the joined data is what you expect
The Enrich Layer tool gives you access to a wide variety of demographic data that includes not only publicly-available data like that available from the USCB's American Community Survey, but also proprietry data on purchasing habits that can be useful to businesses.
An enrichment operation is similar to an spatial join, except it saves you the tedium of having to download, process and join the data by hand.
- Select the Data Enrichment, Enrich Layer tool from the analysis icon underneath the layer you want to enrich
- Click the Select Variables icon to choose variables from
groups of variables. For this example, we will enrich the data layer
with three independent variables that we think might be geographically
correlated with stroke mortality:
- Smoked cigarettes in the last 12 months - Index
- Bought athletic shoes in the last 12 months (proxy for active lifestyle)
- Dinner at a fast food restaurant (proxy for poor diet)
- Give a meaningful name for the layer that will result from enrichment
ArcGIS Online is a cloud service where you pay only for what you use. The software keeps track of your usage by charging you a certain number of credits for each action you take in ArcGIS Online. Some operations like displaying layers require only fractions of a credit, while other operations can cost considerably more.
Before running your analysis, you should Show credits to find out how many credits an operation will take. Enrichment can be a very credit-consumptive process if you have a large number of features, a large number of variables, or need to repeat the operation multiple times. If you exhaust your credits, you will need to contact an administrator (if you have an institutional account) or purchase more credits (if you are on a personal account) to be able to continue working.
If your enrichment is going to consume more than five credits, you may want to reduce the number of variables you are using, enrich a smaller number of features, or double-check your choices so you do not have to repeat a credit-expensive enrichment task again.
When the tool completes, you should symbolize the layer by one of the enriched variables to make sure it looks like you expect.
To transfer your data to Google Sheets for visualization, you need to export the attributes of your combined layer to a CSV file, and then download that CSV file to your computer.
Store all files associated with this analysis in a separate directory so you can keep track of what files go with what project.
- Save and share your map
- On your Content page, create a new, meaningfully-named directory
- Delete any layers that were created during unsuccessful operations
- Move all files and maps associated with this analysis into that directory to keep everything organized
Create X/Y Scatter Charts
- Create a new Google Sheets spreadsheet in Google Drive
- Import the CSV into your spreadsheet
- Delete unneeded columns
- Rename any cryptically-named columns to something more meaningful
- Give the spreadsheet a meaningful name
- Select two columns you want to compare (ctrl-click to select the second column)
- Insert, Chart to create a new chart with the two columns
- Series, Add Trendline to add a trendline
- Adjust the color and thickness for the trendline as desired
- Display the R2 value to measure the strength of the correlation
- Move to own sheet
R2 is the coefficient of determination that measures the strength of the correlation. The range is from 0.000 (no correlation) to 1.000 (perfect correlation).
Exactly how R2 should be evaluated depends on the type of data being studied. In the natural sciences, values above 0.600 are often expected. However, in the social sciences where relationships often involve the complex interplay of ambiguous factors, values in the 0.200s or 0.300s can be considered meaningful for further investigation.
In this example, the fairly strong positive correlation (R2 = 0.417) between smoking rates (100 = US average) and annual stroke deaths per 100,000 population follows the expectation that areas where there are higher levels of smoking would also be areas where more people die of strokes.
However, the fairly strong negative correlation (R2 = 0.360) shows the unexpected finding that areas where large numbers of people have fast food for dinner also have low rates of death by stroke. This may be less a declaration of the health value of fast food than the presence of confounding variables, such as highly active lifestyles and careers that are both associated with better health and inadequate time to cook at home or take time for a sit-down dinner.
Problems With Geographic Correlation
The use of simple correlation between geographic variables is a very useful tool for exploratory data analysis, but this technique has numerous weaknesses that require more-sophisticated statistical techniques to overcome.
The Post Hoc Fallacy
A logical fallacy is an often plausible argument using false or invalid inference.
Assuming that correlation represents causation is the post hoc fallacy, from the Latin phrase post hoc ergo propter hoc (after this, therefore because of this).
For example, neighborhood percentages of college graduates correlates with neighborhood household income. But using this to assert that college graduates have higher incomes because of what they learned in college would be a fallacy that ignores the complex role that higher education plays in our society.
Correlation points to possible causal relationships, but does not prove them.
Confounding: The Third Variable Problem
Two variables that correlate with each other may individually have causation rooted in a third variable. In complex systems, these indirect relationships are often hard to identify and are referred to as confounding.
For example, national per capita health expenditures is strongly correlated with per capita electricity use. In this case the electricity use is not caused by health expenditures or vice versa. Both of those two variables are correlated with economic development - a third variable which is more-reflective of the causes of increase electricity use and increased health care expenditures.
The Ecological Fallacy
The ecological fallacy is making an assumption about individuals in an area based on aggregated statistics. For example, just because the median household income in a neighborhood is high does not mean that every individual person in that neighborhood lives in a high-income household.
Accordingly, making assumptions about causation in individual situations based on geographic correlation is fallacious. While areas that have above average numbers of smokers also tend to have high levels of stroke mortality, that does not prove that smoking causes strokes. Geographic correlations can point out possible causal relationships that need more investigation. But the causes of health conditions are complex, and drawing a causal connection between smoking and strokes requires much more-careful study of individuals to unpack that complexity and attribute the extent to which individual factors contribute to health conditions.
The Modifiable Area Unit Problem
The modifiable area unit problem (MAUP) arises when working with data aggregated by areas or regions. If different area boundaries are used, that analysis can yield different results, even if the underlying phenomena is the same.
For example, county boundaries reflect historical and political processes rather than clear boundaries between areas where health outcomes are different. Therefore, use of these boundaries, which is often necessary to preserve confidentiality, limits the accuracy of analysis that is performed using data aggregated within those boundaries.
Reverse Causation and Simultaneity
Reverse causation is when the assumed cause in a cause-and-effect relationship is actually the effect. For example, if a recreational drug user also has symptoms of mental illness, the assumption that the drug use is causing the mental illness may be reversed - the person may be self-medicating to attempt to ameliorate the symptoms of pre-existing mental illness.
Simultaneity occurs when cause and effect occur together, often in a feedback loop. For example, children from wealthy families often have opportunities to get into more-presitgious schools which improves their chances of being wealthy and passing on those advantages to their own children. Wealth causes improved educational opportunity which causes wealth.
An issue related to the MAUP is that many human phenomena tend to correlate spatially with themselves. For example, areas of high income often cluster together as a reflection of a variety of social processes, including a desire by wealthy homeowners to maintain property values. That clustering can make observed correlations overestimate or underestimate the strength of the actual relationship between the phenomena under investigation.
In geospatial demographic data, people are commonly given a location at the place where they live. However, most people follow daily life paths that take them to many places in the region around their home. Many people have jobs that required them to travel and, sometimes live, far from their official place of residence or country of citizenship.
Accordingly, spatial analysis that assumes people are static entities within fixed regions can be inaccurate. This can be especially problematic when dealing with issues like disease vectors or environmental toxins, where exposure can occur outside the home, but the effects of that exposure are geolocated at a home address.
Reducing phenomena to relationships between two variables often fails to capture the complex web of causes that influence a studied effect.
Correlation analysis is simple linear regression, where a simple function based on a line that draws the closest fit between the dependent and independent variable.
y = Β x + ε
Multiple linear regression involves creating a mathematical function fitting multiple independent variables to a single dependent variable.
y = Β0 x0 + Β1 x1 ... + ε
When dealing with individual (rather than aggregated data), the Neyman-Rubin causal model involves finding pairs of people that are identical by all considered characteristics other than the characteristic that is suspected of being a cause, and observing them over time to see if the characteristic of interest affects the probability of the studied effect.
These and a wide variety of other statistical techniques can be used to investigate relationships between variables, but these techniques require tools more complex than ArcGIS Online and Google Sheets.