Geographic Correlation Analysis With The Google Public Data Explorer
Correlation is the strength of the relationship between two variables. The measurement of correlation is one of the most common and useful tools in statistics and can be used with what is where data to explore why is it there.
X/Y Scatter Charts With the Google Public Data Explorer
One simple and effective way to visualize the presence or absence of correlation between two variables is with an X/Y scatter chart that places one variable on the x-axis and the other variable on the y-axis.
The Google Public Data Explorer (GPDE) web app provides the capability for creating interactive visualizations of data from a variety of public sources. These examples use country-level indicator data collected by the World Bank.
One powerful feature of the GPDE is the ability to create X/Y scatter charts that can be used to visualize geographic correlation between two variables.
For this example, we will map GDP per capita against mortality rate, under-five, which is the number of children per 1,000 children who do not make it to their fifth birthday.
- From the GPDE home screen, select World Development Indicators and Explore the data
- Select an indicator of interest in the left panel. Note that some indicators have only limited years of data and/or data for only a small number of countries and you may need to dig around a bit to find an indicator with enough information to be useful
- Select the bubble chart icon at the right top of the screen
- Find the indicator you want to compare to, click on it and select Y axis
- Select Compare by and Country to get bubbles for all countries
- If the bubbles tend to bunch up on an axis, you might change the axis to a Log scale from the settings. This bunching is common when using absolute values with countries since there are a large number of small countries but a handful of very large countries, and size is distributed unevenly across that range of sizes
- Select Region and Color by this to add some geographic context and aesthetic zest to the graph
- When comparing by country, if you want to highlight one or more countries, select them in the country list at the bottom of the left panel
- To share a GPDE graph with someone, click the icon in the top right corner that looks like a chain link, and copy the (long) URL given in the Paste link in email or IM box
This X/Y scatter chart shows a clear negative correlation between childhood mortality and GDP per capita. The wealthier the country, the lower the chances that a child will die before their fifth birthday.
Animating the chart shows the general worldwide trend toward improving child mortality over the past half century.
Positive vs Negative Correlation
A positive correlation means that as one variable goes up, the other goes up as well. When two variables with a positive correlation are plotted on the two axes of an X/Y scatter chart, the points form a rough line or curve upward from left to right.
For example, there is a positive geographic correlation between GDP per capita and electricity consumption per capita. Countries at a higher level of development use more electricity for things like air conditioning, home appliances, street lights, etc.
A negative correlation means that as one variable goes up, the other goes down. When two variables with a negative correlation are plotted on the two axes of an X/Y scatter chart, the points form a rough line or curve downward from left to right.
Using our prior example, there is a negative geographic correlation between GDP per capita and mortality of children under the age of five. Wealthier countries tend to have better nutrition, medical care, and social order than poorer countries. The wealthier the country, the lower the chance that a child will die before the age of five.
Strong vs Weak vs No Correlation
When the relationship between two variables is extremely strong, the dots on an X/Y scatter chart will be strongly clumped together into a fairly clear line. An example of strong correlation is the aforementioned positive geographic correlation between GDP per capita and electricity consumption per capita:
In some cases, the correlation is not quite as strong, meaning that there is a general upward or downward pattern on the X/Y scatter chart, but there are numerous outliers that are exceptions to the trend.
For example, the graph below shows that there is a weak correlation between GDP per capita and the adolescent fertility rate (births per 1,000 women ages 15-19 = teen pregnancy). While women in wealthy countries generally wait until their 20s or 30s to have their children, women in poorer countries often begin having children much earlier in life, reflecting characteristics like the traditions of that country, education levels for women, the ability of women to control their own destiny, etc.
However, there are a number of notable exceptions, such as the Sub-Saharan country of Burundi, which is poor but has an adolescent fertility rate similar to that of the wealthy United States.
A graph of weak correlation often exhibits heteroskedasticity where the correlation is weaker on one side of the graph than the other. The dots tend to spread out on one side of the graph more than they do on the other side.
When two variables with no correlation are plotted on the two axes of an X/Y scatter chart, the points form a diffuse cloud.
For example, there is no correlation between GDP per capita and the proportion of seats held by women in the national legislature. There are wealthy countries such as Sweden that have a high percentage of female legislators and there are poor countries like Rwanda that also have a high percentage of female legislators.
Correlation vs. Causation
While correlation is interesting to people interested in data, we are usually more interested in causation, which refers to a cause-and-effect relationship between things. Once you understand a cause (or causes), you can change the effects by changing the causes.
For example, if you want to reduce rates of breast cancer and find that lifestyle choices (cause) seem to be behind those higher rates of breast cancer (effect) in a specific area, you can target that area with public health initiatives to increase awareness, change lifestyles, and reduce cancer rates.
While correlation can be used to help find a cause or causes for some phenomena, correlation is a mathematical relationship that may or may not reflect a simple cause-and-effect relationship, resulting in the common adage correlation is not causation. Causes are almost always complex and require complex analysis to define and interpret.
Treating simple correlation as if it proves causation is an example of the Post Hoc fallacy. This fallacy is dangerous because it can lead to falsely assigning credit (or blame) for a positive (or negative) effect and lead to unnecessary, misleading, or dangerous actions that will do little to achieve the desired outcomes, and might even be counterproductive.
For example, assuming a correlation between ethnicity and crime rates indicates that specific ethnic groups are genetically prone to criminal activity can lead to scapegoating that makes the situation worse by exacerbating the underlying causes of such associations, such as poverty, social exclusion, and the economic legacies of past discrimination.