Geographic Correlation and Causation
Correlation is the strength of the relationship between two variables. The measurement of correlation is one of the most common and useful tools in statistics.
Causation refers to a cause-and-effect relationship between to things. Correlation can be used to find causal relationships, but causation is a more complex concept that a simple binary relationship. This tutorial introduces correlation and issues that distinguish correlation from causation.
Correlation and X/Y Scatter Charts
X/Y Scatter Charts With the Google Public Data Explorer
One simple and effective way to visualize the presence or absence of correlation between two variables is with an X/Y scatter chart that places one variable on the x-axis and the other variable on the y-axis.
The Google Public Data Explorer (GPDE) web app provides the capability for creating interactive visualizations of data from a variety of public sources. These examples use country-level indicator data collected by the World Bank.
One powerful feature of the GPDE is the ability to create X/Y scatter charts that can be used to visualize geographic correlation between two variables.
For this example, we will map GDP per capita against mortality rate, under-five, which is the number of children per 1,000 children who do not make it to their fifth birthday.
- From the GPDE home screen, select World Development Indicators and Explore the data
- Select an indicator of interest in the left panel. Note that some indicators have only limited years of data and/or data for only a small number of countries and you may need to dig around a bit to find an indicator with enough information to be useful
- Select the bubble chart icon at the right top of the screen
- Find the indicator you want to compare to, click on it and select Y axis
- Select Compare by and Country to get bubbles for all countries
- If the bubbles tend to bunch up on an axis, you might change the axis to a Log scale from the settings. This bunching is common when using absolute values with countries since there are a large number of small countries but a handful of very large countries, and size is distributed unevenly across that range of sizes
- Select Region and Color by this to add some geographic context and aesthetic zest to the graph
- When comparing by country, if you want to highlight one or more countries, select them in the country list at the bottom of the left panel
- To share a GPDE graph with someone, click the icon in the top right corner that looks like a chain link, and copy the (long) URL given in the Paste link in email or IM box
This X/Y scatter chart shows a clear negative correlation between childhood mortality and GDP per capita. The wealthier the country, the lower the chances that a child will die before their fifth birthday.
Animating the chart shows the general worldwide trend toward improving child mortality over the past half century.
A positive correlation means that as one variable goes up, the other goes up as well. When two variables with a positive correlation are plotted on the two axes of an X/Y scatter chart, the points form a rough line or curve upward from left to right.
For example, there is a positive geographic correlation between GDP per capita and electricity consumption per capita. Countries at a higher level of development use more electricity for things like air conditioning, home appliances, street lights, etc.
A negative correlation means that as one variable goes up, the other goes down. When two variables with a negative correlation are plotted on the two axes of an X/Y scatter chart, the points form a rough line or curve downward from left to right.
Using our prior example, there is a negative geographic correlation between GDP per capita and mortality of children under the age of five. Wealthier countries tend to have better nutrition, medical care, and social order than poorer countries. The wealthier the country, the lower the chance that a child will die before the age of five.
When two variables with no correlation are plotted on the two axes of an X/Y scatter chart, the points form a diffuse cloud.
For example, there is no correlation between GDP per capita and the proportion of seats held by women in the national legislature. There are wealthy countries such as Sweden that have a high percentage of female legislators and there are poor countries like Rwanda that also have a high percentage of female legislators.
In the positive and negative correlation examples given above, the correlation between the two variables is fairly strong, as indicated by the fairly narrow band of dots forming a fairly clear line.
In some cases, the correlation is not quite as strong, meaning that there is a general upward or downward pattern on the X/Y scatter chart, but there are numerous outliers that are exceptions to the trend.
For example, the graph below shows that there is a weak correlation between GDP per capita and the adolescent fertility rate (births per 1,000 women ages 15-19 = teen pregnancy). While women in wealthy countries generally wait until their 20s or 30s to have their children, women in poorer countries often begin having children much earlier in life, reflecting characteristics like the traditions of that country, education levels for women, the ability of women to control their own destiny, etc.
However, there are a number of notable exceptions, such as the Sub-Saharan country of Burundi, which is poor but has an adolescent fertility rate similar to that of the wealthy United States.
A graph of weak correlation often exhibits heteroskedasticity where the correlation is weaker on one side of the graph than the other. The dots tend to spread out on one side of the graph more than they do on the other side.
Causation means that there is a cause (or causes) that result(s) in an effect. For example dropping a ball causes it to fall.
While correlation can be used to help find a cause or causes for some phenomena, they are two separate things. Correlation is not causation. The following table shows distinctions between correlation and causation:
|Correlation is empirical. A correlation is a summary of observations in the material world||Causation is rational. A cause is an idea or interpretation in our head based on observations|
|Correlation makes statements about specific things that happened in the past at a specific place or set of places||Causation involves looking for patterns in observations and offering general explanations about the present or predictions about what might happen in the future and/or at different locations under the same circumstances|
|Correlations represent regularities, things that are observed coinciding in time or in space||Causation can be expressed as a counterfactual. If the cause had not happened, the effect either would not have happened, or would have had less of a chance of happening|
Issues Differentiating Correlation From Causation
In some simple situations, the effect resulting from a cause is immediate and clear, such as dropping a ball (cause) resulting in the ball falling to the ground (effect). These are called deterministic causal relationships. The cause always determines the effect and a correlation between cause and effect is readily observable as correlation.
However, in most real-world situations, especially those involving people, effects have multiple causes, causes do not always result in effects, and/or effects follow causes by varying lengths of time (if at all). Causes increase the probability of effects, so these are called probabilistic causal relationships.
As a geographic example: Proximity to a former industrial site (brownfield) that is contaminated with carcinogenic polycyclic aromatic hydrocarbons correlates with increased rates of cancer for people that live close to that site. However, there are people who live close to that site that do not get cancer, and there are people who do not live close to any source of that carcinogen who will get that same type of cancer. In addition, carcinogens often take years to show their negative affects, and different genetic characteristics make some people more vulnerable than others to the effects of that carcinogen.
Techniques more sophisticated than simple correlation are needed for for clearly identifying probabilistic causes and separating all the myriad factors that could also be causes. For example, the Neyman-Rubin causal model involves finding pairs of people that are identical by all considered characteristics other than the characteristic that is suspected of being a cause, and observing them over time to see if the characteristic of interest affects the probability of the studied effect.
Chains of Causation
Effects are commonly the result of chains of causation. The causes most directly associated with the effect are called proximate causes while the causes further up the chain are called ultimate causes or root causes.
Since correlations are simple comparison of two values, correlation will not identify these chains of causation, often leading to an oversimplified understanding of what exactly is going on.
- High levels of childhood death are often caused by diarrhoea, which results in fatal levels of under-nutrition and dehydration (proximate cause)
- Childhood diarrhoea can be caused by dysentery, which is a bacterial infection
- The dysentery bacteria is transmitted through contaminated food and water
- Water contamination with bacteria-infested human waste is caused by inadequate water management
- Inadequate water management is caused by a variety of factors, such as uncontrolled urban population growth, inadequate financial resources, and ineffectual or corrupt governments that divert resources to personal projects and defense
- Corrupt government structures were often set up by colonial powers to keep the colonies subservient as a source of raw materials and inexpensive labor for those colonial powers rather than responsive to the needs of their own citizens (ultimate cause)
In cases like this, correlations may provide evidence for proximate causes, but may not identify what are often more important ultimate causes that need to be addressed to solve problematic effects further down the chain of causation.
Because causal relationships are often probabilistic and part of complex chains of causation, the absence of a correlation does not mean there is no causal relationship:
- Causes may not always result in specific effects. For example, smoking and lung disease (probabilistic causation)
- Causes may take time to manifest as an effect. For example, environmental carcinogen exposure and cancer diagnosis (temporal lag)
- Causes may result in effects that occur in different geographical areas. For example: A West Nile virus infection that was contracted from a mosquito bite in a recreational area may be diagnosed in a separate residential area (spatial lag)
Confounding: The Third Variable Problem
Two variables that correlate with each other may individually have causation rooted in a third variable. In complex systems, these indirect relationships are often hard to identify and are referred to as confounding.
For example, per capita health expenditures is strongly correlated with per capita electricity use. In this case the electricity use is not caused by health expenditures or vice versa. Both of those two variables are correlated with economic development - a third variable which is more-reflective of the causes of increase electricity use and increased health care expenditures.
Absolute vs Relative Values
When looking for correlations between characteristics in geographic area like countries that have a wide range of sizes, you will generally want to compare relative rather than absolute values.
Comparisons of countries using absolute values will often show correlation, but that correlation is between the size of the countries rather than the phenomena being analyzed. This is a specific case of confounding.
For example, when comparing absolute values of overall GNI (gross national income, or GDP plus income from overseas activities) to population, you see a strong correlation that implies that more-populous countries are also wealthier countries.
However, that comparison does not consider how many people that wealth is distributed over. In analyzing the lives of people in a country, the overall wealth of the country is usually less important than how much of that wealth is available to individual people and families in that country.
If you compare relative values of GDP per capita with population, we see that there is no correlation. There are large countries where the residents (on average) are poor and small countries where the people (on average) are rich. The population of a country is not related to how wealthy individual people in a country are.
Asymmetry: Which is the Cause and Which is the Effect?
Correlation shows the regularity of the relationship between two variables, but does not indicate which is the cause and which is the effect. This can often be seen as the classic chicken and egg problem: which came first, the chicken or the egg?
Using the correlation between GDP per capita and childhood death as an example, does poverty cause childhood death or does childhood death cause poverty? Societies respond to high levels of childhood death by having more children to increase the chances of having some children survive, and those high birth rates place burdens on families that prevent more-active participation in economic growth. In such a case, we could say that wealth and childhood survival are mutually constitutive.
Irrelevance: Spurious Correlations
There are numerous instances where correlations between variables are clearly either accidents or have a root cause that is so indirect as to be meaningless. Indeed, Tyler Vigen has devoted a website to celebrating these statistical oddities.
The Problem of Induction
By offering an explanation in the form of a cause and effect relationship, we are making an inference that when that cause or set of causes exists, the effect will occur or has an increased probability of occurring.
But while correlation can tell us that this was true in a specific set of places in the past, that does not offer any logical proof that has to continue to be true in the future or in a different set of locations, since we cannot directly observe all possible locations in all possible futures.
This is the problem of induction identified by the Scottish philosopher David Hume (1711 - 1776).
This is similar to the statement, you cannot prove a negative. While you can offer empirical evidence from the past that something has never happened, you cannot offer direct evidence that something will never happen in a future you haven't experienced yet.
One way around this problem is the concept of falsification proposed by the Austrian philosopher Karl Popper (1902 - 1994).
While you cannot prove a hypothesis will always be true, you only have to find one situation where a hypothesis is false to show that it is not true and needs to be modified or abandoned. If you fail to falsify a theory, you corroborate that theory. Corroboration does not prove a hypothesis is true, but multiple corroborations increase your confidence that your theory may be applicable in other times and places. This approach is the basis of the concept of the null hypothesis used in inferential statistics.
The Post Hoc Fallacy
A logical fallacy is an often plausible argument using false or invalid inference.
Assuming that correlation represents causation is the post hoc fallacy, from the Latin phrase post hoc ergo propter hoc (after this, therefore because of this).
For example, Asians tend to do significantly better on the math section of the SAT than other ethnic groups. But inferring from this observed correlation that such success is caused by the genetic characteristics of Asian brains would be a post hoc fallacy because it ignores the complex set of social causes (such as wealth distribution, school quality, social norms) that result in some ethnic groups having better math skills on average than other ethnic groups.
As another example, a baseball player notices that when he doesn't shave, he gets more hits when he comes up to bat. Therefore he interprets the past correlation between not shaving and hits as a causal relationship. The past correlation was likely spurious, although the future, confounded causation may be psychological since the increasingly bearded player will come to the plate with more confidence and agressiveness based on his mystical belief in that causal relationship.
The Post Hoc Fallacy and the Ecological Fallacy
The post hoc fallacy can follow from the ecological fallacy, where an assumption about individuals in an area is made based on information about the group in that area.
For example: an ecological fallacy would be assuming that everyone (or almost everyone) living in a high crime area is a criminal. A post hoc fallacy would be that there is some inherent characteristic of people in that area that causes them prone to criminal behavior. A more nuanced explanation would involve interrelated chains of causation that considered probabalistic factors like poverty, poor preparation for limited job opportunities, negative role models, etc.
The Post Hoc Fallacy and the Exception Fallacy
Likewise, the post hoc fallacy can follow from the exception fallacy, where an assumption about a group is made based on a handful of exceptional individuals.
For example: Suppose the last two military veterans a young manager has hired have been difficult to work with. A post hoc fallacy would be to assume that something about military service causes veterans to be poor employees. However, other causes of those past problems may include poor hiring practices that do not fully evaluate how well candidates match the job, underdeveloped management skills, or simply bad luck in finding two veterans that were not suited to the jobs they were assigned.
Why Does Causation Matter?
In contemporary life we are constantly bombarded with numerous streams of narratives and contradictory information. Sorting out these narratives and recognizing fallacies is easier when you have a clearer understanding of what causation is and how it can be evaluated.
In many areas of life like management and law, determining responsibility for actions and activities is important to hold members of an organization or society accountable, and promote justice.
Forecasting and Planning
Understanding the causes of the effects that impact our lives can help us anticipate what those effects could be in the future. This can aid us both collectively and individually in planning for possible futures.
Understanding causation can often allow us to intervene and prevent undesirable effects or promote desirable effects. These interventions can be at the individual level, up to the level of public policy. For example, in health, understanding the etiology of a disease is essential for determining how that disease can be prevented or mitigated.