Geospatial Data Accuracy and Lying With Maps
Maps are charismatic and imply absolute, objective truth. However, when you critically evaluate how maps are created, you can see the subjectivities that are embedded in almost every element of the map's design, as well as the underlying data itself.
This tutorial addresses geospatial data issues associated with accuracy, reliability, and deception.
All data is simplified representation (model) of the complex real world. Accuracy is the extent to which that model conforms to the real world.
Kimerling et al (2016, 278) note five types of accuracy associated with geospatial data:
- Positional Accuracy: How close the coordinates are to actual features or areas on the ground. For example, coordinates captured by inexpensive GPS receivers in cellphones can often deviate significantly from the actual locations where they were captured
- Attribute Accuracy: How closely the attribute variables represent actual characteristics on the ground. For example: sampled species population data based on sightings may differ significantly from actual counts on the ground
- Conceptual Accuracy: How well the data representation and visualization techniques reflect the real world. For example, if missing values are not clearly marked on a map, viewers might make erroneous assumptions about those areas based on values in surrounding areas
- Logical Consistency: How consistent the data representation is across the data set. For example: If different line types are used for the same type of road, or the same line type is used for different types of roads, the viewer may not be able to clearly interpret the map
- Temporal Accuracy: How well does the data reflect conditions on
the ground at a particular point in time
- Temporal stability is how long characteristics on the ground change. For example, buildings usually have high temporal stability while living organisms (such as migratory birds) have very little
- Currency is how up-to-date the data is. Depending on the temporal stability of conditions on the ground, data captured at a prior time may not reflect current conditions, and should be presented as historical rather than current
- Mapping Period is the period of time over which the data was gathered. In many situations (such as surveys of people), significant time is needed to fully capture the full data set. If temporal stability is low and conditions are constantly changing during the mapping period, the resulting data will have inaccuracies
Similarly, there are a number of sources of inaccuracy:
- Factual inaccuracies: Mistakes in capture or visualization. Examples: missing features, mislabeled features
- Data source inadequacies: Use of inappropriate data for a specified purpose. Examples: Out-of-date data, incomplete data
- Processing artifacts: Errors introduced when the data is digitized or migrated from one form to another. Examples: Information lost when numbers are rounded in spreadsheets
- Natural-variation: Variability in characteristics that is not represented during the capture or presentation of data. Example: categorizing states into red or blue based on the party that won the electoral college
- Trap fictions: Non-existent features that are intentionally added by cartographers to make it possible to detect subsequent plagiarism of that data. Example: Non-existent "trap" streets that are placed on city maps by cartographers
Inaccuracies have a variety of causes, including:
- Inadequate time
- Inadequate funding
- Ideological subjectivity and bias
- Inadequate knowledge, skill or experience
Certainty is how well you understand the accuracy of your data. Being uncertain about the accuracy of your data limits the amount of confidence with which you can present that data. Likewise mistaken certainty can lead to conflict when decisions made on the basis of invalid data bump into reality.
Note that certainty and accuracy are not necessarily connected. You can be certain that your data is inaccurate, and you can be uncertain whether
Optimally, we strive for both accuracy and certainty, but there are few times where that is completely possible.
There are a number of approaches to dealing with uncertainty about data accuracy:
- Further investigate the source of the data for skill and conflicts of interest
- Compare to similar authoritative data, such as an older data set of known accuracy
- Look for internal logical consistency
- Use a large, redundant data set. This is the approach often used with big data, where individual data points are often uncertain, but an average of a large amounts of data points can help separate the signal from the noise
- Go to original source data to reconstruct derived data
- Ground truth the data by physically going to locations and verifying that a subset of the data matches reality
- Present your data with caveats, such as legend notes or error bars
- Find an alternative data source
Authoritative data is produced and or supplied by a known authority, usually a government or governing body of some kind.
The implication is that either the data can be reliably trusted to be accurate, or that it represents formally-sanctioned information that should (or must) be used.
Data that is authoritative may not be either accurate or certain. For example, official economic statistics released by totalitarian regimes are often surrounded by uncertainty, and assumed to be inaccurate in a way that makes the regime look more effective than it actually is.
Sampling and Margin of Error
It is often impossible or impractical to gather complete information about some phenomenon (such as soil types or voting preferences), random samples of information are commonly used with inferential statistics to infer estimates about the phenomenon as a whole.
Because there is a possibility that samples may not be an exact duplicate of the whole, there is a possibility that the sampled values are off. Statistical calculations based on the sampled values and the number of people sampled are used to calculate the likely range of values that we can be 95% certain contains the actual value for the full population. This range is called a margin of error and expresses the amount of uncertainty associated with values created from sampled data.
For example, the American Community Survey (ACS) is performed by the US Census Bureau on an ongoing basis to capture a wide variety of data about American people and households. Unlike the decennial census which is mandated by Article I, Section 2 of the US Constitution to capture information about every person in the US, the ACS is given to randomly selected households in communities across the US.
This margin of error increases as the number of people sampled decreases, or the number of people in the population For example, estimates of income in sparsely-populated rural counties or estimates of the number of native speakers of obscure languages can have especially high margins of error.
The animation below demonstrates how to view the margin of error for American Community Survey data provided through American Factfinder.
Lying With Maps
Even if the data used to create a map is accurate, certain and authoritative, design decisions made by the cartographer can dramatically influence how that data is interpreted by readers. While the data represented by a map may be factually correct, cartographic choices can be manipulated to inspire interpretations that may not be consistent with the facts, or which represent a specific ideological perspective on the facts.
Below are some ways that maps can be used to lie as defined by Monmonier (2005) and Kimerling et al (2016).
- Selective truth: Maps always involve generalization and abstraction, and details must be omitted to make visualizations readable
- Spurious correlations: Maps of two different variables are commonly placed side-by-side to show how values for those two variables seem to be related (correlated) to each other across an area. However, correlation is not causation. An apparent relationship may not be meaningful, and the absence of a visible mapped relationship does not mean that one is not present if the data is analyzed at greater detail
- Cut points: Choropleths of quantitative variables are often colored with a limited number of colors, requiring some method of classification that associates a range of values with a specific color. Depending on the distribution of values in a variable, some classification methods can exaggerate differences that are not significant, or hide differences that should be considered important
- Colors: Colors have rich, symbolic meanings, although those meanings are not necessarily universal across different societies. But choosing bold color palettes can reinforce negative narratives when characteristics like poverty or climate change are mapped in a color like red that is commonly associated with danger.
- Projection distortion: Representing the three-dimensional world on a two-dimensional display or piece of paper requires making some objects larger or smaller relative to each other than they are in reality. Because size conveys power, projections like the Mercator that increase the size of land masses further away from the equator makes developing, equatorial countries in Africa and Latin America seem smaller relative to northern countries, reinforcing old imperialist conceptions of the world.
Volunteered Geographic Information
Volunteered Geographic Information (VGI) is geographic data that comes from the general public. It can be passively collected, such as the location information captured from cellphone calls, or actively contributed by volunteers, such as with OpenStreetMap, a sort of Wikipedia of geographic information.
For example, while ArcGIS Online is completely controlled by ESRI, it connects a community of GIS students and professionals that make their data available to that community. The VGI in ArcGIS Online is usually just some variant on publicly-available data that you could upload yourself (if you know where to look), although occasionally there is an original dataset.
VGI can be of widely varying quality, and you should should verify the source when using volunteered data that does not come from ESRI.