# Geographic Correlation Analysis With The Google Public Data Explorer

Correlation is the strength of the relationship between two variables.
The measurement of correlation is one of the most common and useful
tools in statistics and can be used with *what is where* data to
explore *why is it there*.

## X/Y Scatter Charts With the Google Public Data Explorer

One simple and effective way to visualize the presence or absence
of correlation between two variables is with an **X/Y scatter chart** that places one
variable on the x-axis and the other variable on the y-axis.

The Google Public Data Explorer (GPDE) web app provides the capability for creating interactive visualizations of data from a variety of public sources. These examples use country-level indicator data collected by the World Bank.

One powerful feature of the GPDE is the ability to create X/Y scatter charts that can be used to visualize geographic correlation between two variables.

For this example, we will map GDP per capita against mortality rate, under-five, which is the number of children per 1,000 children who do not make it to their fifth birthday.

- From the GPDE home screen, select
*World Development Indicators*and*Explore the data* - Select an indicator of interest in the left panel. Note that some indicators have only limited years of data and/or data for only a small number of countries and you may need to dig around a bit to find an indicator with enough information to be useful
- Select the
**bubble chart**icon at the right top of the screen - Find the indicator you want to compare to, click on
it and select
**Y axis** - Select
*Compare by*and*Country*to get bubbles for all countries - If the bubbles tend to bunch up on an axis, you might
change the axis to a
**Log**scale from the settings. This bunching is common when using absolute values with countries since there are a large number of small countries but a handful of very large countries, and size is distributed unevenly across that range of sizes - Select
*Region*and*Color by this*to add some geographic context and aesthetic zest to the graph - When comparing by country, if you want to highlight one or more countries, select them in the country list at the bottom of the left panel
- To share a GPDE graph with someone, click the icon in the
top right corner that looks like a chain link, and copy
the (long) URL given in the
**Paste link in email or IM**box

This X/Y scatter chart shows a clear negative correlation between childhood mortality and GDP per capita. The wealthier the country, the lower the chances that a child will die before their fifth birthday.

Animating the chart shows the general worldwide trend toward improving child mortality over the past half century.

## Positive vs Negative Correlation

### Positive Correlation

A **positive correlation** means that as one variable goes up, the other
goes up as well. When two variables with a positive correlation are plotted on
the two axes of an X/Y scatter chart, the points form a rough line or curve
upward from left to right.

For example, there is a positive geographic correlation between GDP per capita and electricity consumption per capita. Countries at a higher level of development use more electricity for things like air conditioning, home appliances, street lights, etc.

### Negative Correlation

A **negative correlation** means that as one variable goes up, the other
goes down. When two variables with a negative correlation are plotted on the
two axes of an X/Y scatter chart, the points form a rough line or curve
downward from left to right.

Using our prior example, there is a negative geographic correlation between GDP per capita and mortality of children under the age of five. Wealthier countries tend to have better nutrition, medical care, and social order than poorer countries. The wealthier the country, the lower the chance that a child will die before the age of five.

## Strong vs Weak vs No Correlation

### Strong Correlation

When the relationship between two variables is extremely strong, the
dots on an X/Y scatter chart will be strongly clumped together into a
fairly clear line. An example of **strong correlation** is the
aforementioned positive geographic correlation between GDP per
capita and electricity consumption per capita:

### Weak Correlation

In some cases, the correlation is not quite as strong, meaning
that there is a general upward or downward pattern on the X/Y scatter
chart, but there are numerous **outliers** that are exceptions
to the trend.

For example, the graph below shows that there is a weak correlation between GDP per capita and the adolescent fertility rate (births per 1,000 women ages 15-19 = teen pregnancy). While women in wealthy countries generally wait until their 20s or 30s to have their children, women in poorer countries often begin having children much earlier in life, reflecting characteristics like the traditions of that country, education levels for women, the ability of women to control their own destiny, etc.

However, there are a number of notable exceptions, such as the Sub-Saharan country of Burundi, which is poor but has an adolescent fertility rate similar to that of the wealthy United States.

A graph of weak correlation often exhibits **heteroskedasticity** where the
correlation is weaker on one side of the graph than the other. The dots tend to
spread out on one side of the graph more than they do on the other side.

### No Correlation

When two variables with **no correlation** are plotted on the two axes of
an X/Y scatter chart, the points form a diffuse cloud.

For example, there is no correlation between GDP per capita and the proportion of seats held by women in the national legislature. There are wealthy countries such as Sweden that have a high percentage of female legislators and there are poor countries like Rwanda that also have a high percentage of female legislators.

## Correlation vs. Causation

While correlation is interesting to people interested in data, we are usually more interested in causation, which refers to a cause-and-effect relationship between things. Once you understand a cause (or causes), you can change the effects by changing the causes.

For example, if you want to reduce rates of breast cancer and find that lifestyle choices (cause) seem to be behind those higher rates of breast cancer (effect) in a specific area, you can target that area with public health initiatives to increase awareness, change lifestyles, and reduce cancer rates.

While correlation can be used to help find a cause or causes for some
phenomena, correlation is a mathematical relationship that may or may not
reflect a simple cause-and-effect relationship, resulting in the common adage
**correlation is not causation**. Causes are almost always complex and
require complex analysis to define and interpret.

Treating simple correlation as if it proves causation is an example of the
*Post Hoc* fallacy.
This fallacy is dangerous because it can lead to falsely assigning
credit (or blame) for a positive (or negative) effect and lead to
unnecessary, misleading, or dangerous actions that will do little to achieve the
desired outcomes, and might even be counterproductive.

For example, assuming a correlation between ethnicity and crime rates indicates that specific ethnic groups are genetically prone to criminal activity can lead to scapegoating that makes the situation worse by exacerbating the underlying causes of such associations, such as poverty, social exclusion, and the economic legacies of past discrimination.