Visualizations of data make it possible to see patterns that give us some indication of general trends and relationships within time or space. Identification of patterns can then guide further exploration that can help us understand our data and, more importantly, the phenomena that data represents.
Many of the examples in this tutorial are from the Google Public Data Explorer. R code for the simulated curves is here.
Time-series line graphs show changes in a variable over a period of time. Trends observed in time-series graphs are temporal patterns because they are related to time rather than space.
Patterns in line graphs are commonly described based on mathematical formulas for curves that roughly model their shapes.
Linear patterns form something close to a straight line.
For example, population has grown in a fairly linear fashion since 1960.
With most human or environmental phenomena, patterns of change have some irregularity. For example, there has been a nearly linear increase in the number of secure internet servers since 2001, although there was a notable spike and dip around the late 2000s financial crisis.
Exponential patterns reflect a consistent growth rate. Since growth compounds over successive years, the curve slopes upward at an increasing rate.
For example, air travel worldwide has grown exponentially since 1970.
Logarithmic curves are the mathematical reverse of exponential curves. Logarithmic curves start with a steep growth slope, then start to level off.
For example, the number of tractors per 100 square kilometers of arable (farmable) land increased swiftly during the green revolution of the 1960s, but has leveled off as mechanized farming has become the norm around the world in countries that can afford such high-energy techniques.
Sigmoid curves are commonly called S-curves because they roughly resemble the shape of the letter "S." Sigmoid curves commonly represent the life cycle of a phenomenon, starting with slow growth following the invention of a technology, exponential growth as that technology is deployed, then a logarithmic leveling off at a plateau as the technology matures or the market saturates.
For example, nuclear power saw significant growth worldwide in the 1970s and 1980s, but plateaued in the 1990s as high costs and growing public concerns made construction of new plants increasingly difficult.
Many human and environmental phenomena follow regular or irregular cycles of increases and decreases. While some of these cycles are highly predictable (such as seasonal temperature cycles), other cycles can be more erratic.
For example, growth in gross domestic product (GDP, the total amount of economic activity in a country) follows business cycles where the economy booms, then slows down, then returns to growth again.
Thematic maps are commonly used to find patterns that indicate the presence or absence of some meaningful influence on the geospatial distribution of a phenomena. For example, disease hot-spots where large numbers of cases are clustered together may indicate the presence of some negative environmental characteristic. On a more positive note, clusters of specific demographics of people may be used as a guide for businesses seeking to locate retail outlets targeting those demographics.
While there are an infinite variety of patterns, they can be grouped into three very broad categories: clustered, regular, and random.
A clustered pattern means that there are clear groupings of high and low values. This is referred to as spatial autocorrelation - the data correlates with itself (auto) in space (spatial).
For example, in the northeast United States, there are clear clusters of high income counties in the wealthy suburbs of New Jersey (adjacent to NYC) and Virginia/Maryland (adjacent to Washington, DC), while there are obvious clusters of low income in Appalachia (West Virginia, Eastern Kentucky).
A random pattern means that there is no obvious arrangement to a distribution. For example, this is the distribution of cropland in Central Illinois. Almost all the agriculture in this part of the state is devoted to alternating crops of corn and soybeans. The choice of whether to plant corn or soybeans is determined by the history of a particular field and market conditions, so there is no broad pattern to which areas are corn and which are soybeans.
A regular pattern means that the shapes or values appear in some kind of consistent geometric relationship such as a grid or spiral. For example, counties in Iowa were laid out in the 19th century in the Public Land Survey System, which was based on numeric intervals of latitudes and longitudes rather than any existing, irregular human or environmental features on the ground. The occasional (but regular) jagged edges reflect adjustments for the curvature of the earth.
One simple and effective way to visualize the relationship between two variables is with an X/Y scatter chart that places one variable on the x-axis and the other variable on the y-axis. The extent to which change in one variable is associated with change in the other variable is called correlation. Variables with strong correlation form a pattern that forms a fairly clear curve (usually a line).
A positive correlation means that as one variable goes up, the other goes up as well. When two variables with a positive correlation are plotted on the two axes of an X/Y scatter chart, the points form a rough line or curve upward from left to right.
For example, there is a positive geographic correlation between GDP per capita and electricity consumption per capita. Countries at a higher level of development use more electricity for things like air conditioning, home appliances, street lights, etc.
A negative correlation means that as one variable goes up, the other goes down. When two variables with a negative correlation are plotted on the two axes of an X/Y scatter chart, the points form a rough line or curve downward from left to right.
Using our prior example, there is a negative geographic correlation between GDP per capita and mortality of children under the age of five. Wealthier countries tend to have better nutrition, medical care, and social order than poorer countries. The wealthier the country, the lower the chance that a child will die before the age of five.
In the positive and negative correlation examples given above, the correlation between the two variables is fairly strong, as indicated by the fairly narrow band of dots forming a fairly clear line.
In some cases, the correlation is not quite as strong, meaning that there is a general upward or downward pattern on the X/Y scatter chart, but there are numerous outliers that are exceptions to the trend.
For example, the graph below shows that there is a weak correlation between GDP per capita and the adolescent fertility rate (births per 1,000 women ages 15-19 = teen pregnancy). While women in wealthy countries generally wait until their 20s or 30s to have their children, women in poorer countries often begin having children much earlier in life, reflecting characteristics like the traditions of that country, education levels for women, the ability of women to control their own destiny, etc.
However, there are a number of notable exceptions, such as the Sub-Saharan country of Burundi, which is poor but has an adolescent fertility rate similar to that of the wealthy United States.
A graph of weak correlation often exhibits heteroskedasticity where the correlation is weaker on one side of the graph than the other. The dots tend to spread out on one side of the graph more than they do on the other side.
When two variables with no correlation are plotted on the two axes of an X/Y scatter chart, the points form a diffuse cloud.
For example, there is no correlation between GDP per capita and the proportion of seats held by women in the national legislature. There are wealthy countries such as Sweden that have a high percentage of female legislators and there are poor countries like Rwanda that also have a high percentage of female legislators.