Geospatial Data

Geospatial data indicates what is where. One of the special things about geospatial data is that it comes in those two parts: what and where.

Attribute (what) data and location (where) data are fundamentally different things and have different characteristics.

A distinction can be drawn between data, information and knowledge with data being raw facts or numbers and information being the interpretation of data that forms the basis of knowledge. However, the three terms are often used interchangeably and interpreted differently by different people, so you should use caution and consider context when you hear or use these words.

Location Data

Latitude and Longitude

The Earth is a spheroid - it is round like a ball or sphere but flattened by slightly by the centrifugal force of rotation. The Earth is 24,900 miles circumference, but around 27 miles wider than it is tall.

To specify locations on the surface of the earth, we use angles that describe where those locations are relative to the center of the earth. Angles across the surface of the earth are measured in degrees, which are subdivisions of circles where one degree represents 1/360th of a circle.

While we think of and experience distances on the planet in terms of length (feet, miles, kilometers) rather than angles (degrees), the earth is lumpy and it is difficult to reliably and consistently measure distance and location across undulating terrain. Even if the Earth's surface were perfectly smooth, manipulating distances across a three-dimensional surface requires fiendishly complex calculations, so specifying locations in degrees is simpler and more accurate than specifying locations in distances.

Since the earth is a three-dimensional object, two angles are required to specify locations on the surface of the earth.

Latitude is the angle that tells you how far you are north or south of the equator. The equator is zero degrees latitude and, commonly, negative numbers are south, positive numbers are north. The range is from negative 90 degrees at the South Pole to positive 90 degrees at the North Pole. Lines on maps are drawn east/west around the planet to show latitude.

Latitude (Blue Marble from NASA)

Longitude is the angle that tells you how far you are east or west from the prime meridian in Greenwich, England (longitude 0.0). Think LONG (itude) stretching across your body. Negative numbers up to negative 180 are west of England and positive numbers up to positive 180 are east of England. Lines on maps are drawn north/south between the poles to show longitude.

Longitude (Blue Marble from NASA)

The Prime Meridian and the International Date Line

Latitude is based on specific physical characteristics of the earth. Zero degrees is set at the equator, the imaginary line around the center of the spinning earth. The axis of the Earth's rotation also happens to point at Polaris (the North Star), a very bright star in the northern sky. It is possible to find your latitude in the northern hemisphere by measuring the angle of Polaris above the horizon using a device like a sextant or astrolabe.

Finding your latitude using Polaris (Wikipedia 2006)

Longitude is more difficult to find than latitude since there are no clear geological or celestial features to define where zero degrees longitude should be located.

Early naval navigators had to carry clocks on their ships to keep track of the time at a location of known longitude. When the sun was directly overhead (12:00 noon), the navigators could tell how many degrees east or west they were of that known longitude by multiplying the number of hours difference between their local 12:00 noon and the 12:00 noon at that known location by 15 degrees (360 degrees rotation of the planet divided by 24 hours in one rotation). Longitude has an interesting history.

The prime meridian is a north-south line that specifies where longitude zero degrees is located. Greenwich, England was chosen as the prime meridian by agreement at a conference of 25 nations called by US president Chester Arthur in Washington, DC in 1884. This was a time when Britain was at the height of its imperial and maritime dominance and this conference simply formalized what was already a common practice on printed nautical charts. The United States government had already started mandating use of the Greenwich meridian in 1850 for nautical purposes.

Greenwich
The prime meridian monument at the Royal Observatory, Greenwich (Daniel Case via Wikimedia Commons)

On the opposite side of the planet is the International Date Line. Unlike the prime meridian, this line is not exactly -180 (or +180) degrees longitude, but weaves its way around different islands to deal with political and economic considerations.

The International Date Line (Wikimedia Commons)

Coordinates

A specific location on the surface can be specified with a longitude and a latitude. This pair of numbers is called a set of coordinates. The challenge with geographic coordinates is that there are multiple ways of expressing them.

The traditional form of coordinates is degrees, minutes, seconds (DMS), with a minute being 1/60th of a degree, a second being 1/60th of a minute, North (N) or South (S) meaning north or south of the equator, and East (E) or West (W) means east or west of the prime meridian.

38° 53' 50" N, 77° 02' 12" W

The traditional degrees-minutes-seconds method of specifying coordinates is clear and easy to read by humans, but that type of notation is cumbersome for computers to work with and may not be understood by all GIS software.

With geospatial technology, coordinates are generally specified as decimal coordinates, with positive latitudes meaning north of the equator and negative longitudes being west of the prime meridian.

Decimal coordinates contain only numbers, decimal points, signs, and (sometimes) a comma separator.

38.897192, -77.03896

An additional challenge with decimal coordinates is that although latitude is usually given as the first number (lat-long), sometimes longitude is the first number (long-lat), which is consistent with the convention in geometry where coordinates on a two-dimensional grid are specified as X,Y. If you are plotting coordinates and your points are appearing in unexpected places, this may be due to reversed lat/long.

For example, the WKT (well-known text) representation of a point places the x (longitude) value first:

POINT(-77.03896 38.897192)

With GIS software, it is generally best to stick with latitude, longitude order. You can verify that your coordinates are lat/long by typing them into the search bar in Google Maps.

38.897192,-77.03896
(The White House 2016)

Elevation

In many situations it is adequate to reference locations with two dimensions (X and Y, or longitude and latitude) since maps and computer displays are in two dimensions. However, we live in a three-dimensional world. Elevation (altitude or Z) is sometimes given as a third coordinate.

Unlike latitude and longitude which are angles given in degrees, elevations represent linear distance above the surface of the earth in feet or meters. While elevation is often specified as height above mean sea level, different applications use different reference points. GPS uses height above the reference ellipsoid (see below).

Geospatial data that includes elevation is usually called 2.5-dimensional (2.5-D) rather than 3-dimensional (3-D) because for each X/Y (longitude/latitude) coordinate there is only one Z (elevation) value. 2.5-D data cannot fully represent structures like buildings that have multiple levels or floors stacked on top of each other at different elevations for each X/Y coordinate.

An digital elevation model (DEM) for the State of Colorado as an example of 2.5-D data (USGS)

Datums

Reality rarely quantifies exactly to simple mathematical models, so the topic of elevation also raises an issue with exactly how the pure, spherical angles of latitude and longitude reflect the rugged, three-dimensional reality of the earth. In addition to the planet being slightly flattened from a perfect sphere by the force of rotation, the surface is covered with lumpy mountains and oceans. If you want to be very precise about locations (such as when engineering a bridge or dropping a bomb on someone) these lumps can result in errors that can cause problems.

This challenge is addressed in cartography in three stages:

There are a number of different datums that are used in different parts of the world, for different purposes, and at different times. These different datums reflect:

Datums you will commonly encounter in the US include:

Profile view of the relationship between the earth, a geoid, and an ellipsoid
Global view of the relationship between the earth, a geoid, and an ellipsoid

Data Models and Geometry

There are two broad models for storing geospatial data: vector and raster. Of the other types of models, point clouds are increasingly being used in GIS. They are called models because they are simplified representations of the objects or phenomena they are used to represent.

Vector Data

Vector data stores locations as discrete geometric objects: points, lines or polygons.

Points vs Lines vs Polygons

Areas are occasionally represented with centroids, which are points in that are mathematically equidistant from all parts of the area. For example, the political boundaries of cities are best represented with an area, but on a large map covering an entire nation, individual cities may be represented with points to make the map easier to read.

Examples of Points, Lines and Polygons To Model Features in Denver, CO (Base Map from OpenStreetMap)

Raster Data

Raster data is stored in a grid of regularly-spaced pixels of attribute data that cover an area of interest. Raster data is most useful for representing data about areas where there are unclear boundaries, such as with elevation, temperature or amounts of vegetation. The best known type of data is photographic image data. The digital elevation model described above is another example of geospatial data stored in rasters.

Raster Satellite Imagery of Denver (Google)

Although many different types of data can be stored as rasters, data about discrete objects with clear boundaries is usually more appropriately and accurately stored as vector rather than raster. GIS Software allows conversion between raster and vector, although the conversion process between the two models often involves inaccuracy and uncertainty.

Point Clouds

An emerging third type of model is the point cloud, which stores geospatial information as a collection of points in three-dimensional space (latitude, longitude and elevation). Point clouds are commonly captured using a aerial laser scanning technique called lidar. Unlike vector and raster data that is analogous to a flat two-dimensional map, point clouds can be used to more-faithfully represent structures and topography, albeit at the cost of greater storage and processing demands.

Elevation Point Cloud of Downtown Spokane, WA (USGS)

Attributes

Spatial locations by themselves aren't particularly useful or interesting. You need some what to go with the where.

Attributes are text or numbers that represent the what part of what is where in geospatial data.

Attributes are also sometimes referred to as fields, variables, or columns.

Variable Types

Variables can be classified by the types of information they represent. The classification is important because it determines what kind of analysis can be performed on that attribute, and what kinds of charts or maps are appropriate for visualizing that attribute.

The classification scheme derived below is an elaboration of the levels of measurement commonly used in statistics (Stevens 1946).

Classes of Attributes

Variable Aspects

Geospatial variables have different aspects that work together to represent objects or characteristics on the surface of the earth.

Quantitative variables always have a unit, which is a name that indicates what the numbers represent.

A number without a unit is meaningless. For example, the number 70,000 means nothing by itself, but $70,000 in household income makes that number useful.

Quantitative and qualitative variables have some or all of the following aspects:

Variable Descriptors

A descriptor is "a word or phrase (such as an index term) used to identify an item (such as a subject or document) in an information retrieval system" (Merriam-Webster 2023).

For the purposes of this tutorial, a variable descriptor is a short phrase that describes the different aspects of a variable so the user can understand exactly what the variable values represent.

Data Considerations

Geospatial vs. Non-Geospatial Data

Geospatial data is distinguished by having multiple where, while non-geospatial data has one where or no where.

Because space and time are tied together, geospatial data also has a when (temporal dimension).

Geospatial (multiple where) Non-Geospatial (zero or one where)
Names of the different birthplaces of 2022 Chicago Cubs players Names of 2022 Chicago Cubs players
Number of Chinese restaurants by US metropolitan area in 2022 Number of Chinese restaurants in Chicago in 2022
Annual rainfall at each airport in Illinois in 2022 Annual rainfall for each year between 1972 and 2022 at Chicago O'Hare airport
Median household income by state in 2022 Median household income in Illinois in 2022

Accuracy vs Precision

The precision with which you express a number should be in keeping with the amount of accuracy that your data possesses.

Example:

Precision is often used to deceive people into thinking that you have a better understanding of reality than you actually do. For example:

Spatial Accuracy vs Precision

The world is thing of infinite, wondrous complexity.

When we try to understand that world as numbers, we have to simplify. Our measuring devices and techniques cannot be exactly accurate. We have to round numbers so our precision reflects our accuracy.

With latitudes and longitudes in degrees, the number of decimal places you use (precision) reflects the accuracy of that location on the ground. That accuracy can be expressed as +/- a distance on the ground.

The table below shows the approximate distances for each fraction of a degree in Manhattan:

DegreesLatitudeLongitude
0.16.91 miles5.23 miles
0.013,648 feet2,764 feet
0.001365 feet276.4 feet
0.000136.5 feet27.6 feet
0.0000143.8 inches33.2 inches
0.0000014.38 inches3.32 inches
0.000000111.1 millimeters7.02 millimeters
0.000000011.11 millimeters0.702 millimeters
Approximate distances for each fraction of a degree in Manhattan at 40.75, -74

Primary vs Secondary Data

Primary data is data you capture yourself.

Secondary data is data you recycle from someone else.

Primary data is often more expensive to obtain since you will have to do work to get it, such as by surveying with GPS devices or conducting surveys with human subjects. But if you're doing something novel, you will likely need novel data that you get yourself rather than from someone else.

Population vs Sampled Data

Generally, the more data you have, the better your conclusions when you analyze that data. If you have population data for everything you are studying, that is ideal.

An example of population data is the census conducted every ten years, where the US Federal Government attempts to find and count every person in the country and get basic information on who they are and where they live (what is where).

However, in many if not most research situations, it is too difficult or expensive to capture complete data. For example, if you are running a congressional campaign, you cannot survey every voter in your congressional district to find out how they are planning to vote.

In such cases it is usually adequate to capture a sample of the data and then use statistical techniques to make an inference from that sample about the full population. The use of sampling rather than a full census introduces uncertainty about whether your sample reflects the overall population.

However, there is always some uncertainty when gathering data (especially about humans) and the uncertainty with sampled data can be quantified and considered as part of the analytical process. The uncertainty associated with sampled data is expressed as a margin of error, which is a range of values above and below the estimated values. Margins of error are typically given that specify a 95% confidence interval, which means that we can be 95% certain that the actual value falls within the margin of error above and below the value estimated from the sample.

Individual vs Aggregated Data

Individual data is data about individual persons or objects.

Aggregated data is data that combines data from groups of individuals (often based on location in different geographic areas) into a smaller set of numbers, usually averages or medians. An example of aggregated data is US census data that is aggregated by census tract or county in order to preserve the privacy of individual census respondents.

As with sampling, aggregation introduces uncertainty as important individual distinctions can be lost when people are combined into groups and summarized.

One issue with aggregated data is the ecological fallacy, when you make assumptions about individuals based on aggregated data. For example, states are often classified as red states or blue states based on whether the majority of the voters in that state vote Republican or Democratic, respectively. However, even in very red Utah, Democratic President Obama got 25% of the vote in the 2012, so assuming that everyone you meet in Utah is conservative is incorrect.

2012 Electoral College Results Choropleth

The opposite of the ecological fallacy is the exception fallacy, where an assumption is made about a group based on a few exceptional individuals. For example, if you meet a tall basketball player from Ohio, the assumption that everyone in Ohio is tall would be incorrect.

LeBron James with the Cleveland Cavaliers in 2014 (Keith Allison via Wikimedia Commons)

Structured vs Unstructured Data

Structured data is strictly organized so that for every field or record in your data, you know what it represents. If your data is organized in a table with columns and rows, it is probably structured. Most geospatial data that you deal with in conventional geographic information systems is structured.

Unstructured data is data that is not clearly organized in a way that it can be simply processed by computers. Examples of unstructured data is text like Facebook posts, text messages or tweets. There is meaningful data there, but turning it into something useful requires analysis (often by humans or complex computer algorithms) to give it some kind of structure. Contemporary societies generate tremendous amounts of unstructured data and the analysis and use of that data is an area of heavy research (and investment) called big data.

Metadata

If you don't know what you have, then you don't have it.

Metadata is data about your data. Typical items kept in metadata include:

Students often confuse data with metadata. The following table contrasts data with metadata that could be associated with that data:

DataMetadata
Patient names and addressesThe name of the computer disk file where current patient information is kept
Names of participants in a drug trialThe names of the technicians who recorded the drug trial results
Test scoresThe range of dates when the test scores were captured
Hourly temperaturesThe location of the sensor where those temperatures were recorded
Incidence rates for diseases by stateThe names of the agencies that collected the disease data

While simply remembering what you have is adequate for small projects with short-term needs (like class projects), failure to document your data may make the data useless if you ever want to reuse that data in the future.

However, because the creation of metadata is usually separate from the capture and processing of the data itself, and is more about the future than getting the task at hand done, it is common for metadata to be missing or skeletal. While it is often possible to look at the data and make a reasonable guess as to what it represents and where it came from, a few minutes adding metadata can save a great deal of effort in the future for you and for other people who use your data.

Philosophical Considerations

Rene Descartes, 1648 (Frans Hals via Wikimedia Commond)

X/Y coordinates are called Cartesian Coordinates after the French philosopher Rene Descartes (1596 - 1650). Descartes is also remembered for is statement Cogito ergo sum (I think therefore I am) as evidence of our own existence.

Cartesian coordinates combines the principles of geometry codified by the Greek philosopher Euclid (c. 300 BC) with the techniques of algebra codified by the Arab mathemetician Muḥammad ibn Mūsā al-Khwārizmī (c. 780–850) to create a useful and powerful synthesis that is a foundation for contemporary geospatial technology.

Descartes coordinate system was part of a broader philosophical quest to find objective knowledge about the world that is independent of our perspective and preconceptions. Descartes believed that the only essential properties of matter were geometric, so an objective understanding of the world could be had through a sufficiently advanced geometry (Shand 2002, pp 72).

Descartes philosophy, to some extent, is a foundation for the geospatial technology that is based on Decartes way of representing the world in his coordinate system. This is also the foundation of a critique of geospatial technology that reducing the world to X/Y coordinates obscures not only important qualitative and subjective meanings, but also obscures the political and economic forces (and associated moral questions) underlying the development and use of geospatial technology.

Geospatial technologies with their vivid visualizations and massive scope, are charismatic, and charisma can deceive.