# Geospatial Data

Geospatial data indicates ** what is where**. One of the
special things about geospatial data is that it comes in those two
parts:

*what*and

*where*.

**Attribute** (what) data and **location** (where)
data are fundamentally different things and have different characteristics.

A distinction can be drawn between data, information and knowledge with data being raw facts or numbers and information being the interpretation of data that forms the basis of knowledge. However, the three terms are often used interchangeably and interpreted differently by different people, so you should use caution and consider context when you hear or use these words.

## Location Data

### Latitude and Longitude

The Earth is a **spheroid** - it is round like a ball or sphere
but flattened by slightly by the centrifugal force of rotation.
The Earth is 24,900 miles circumference, but around 27 miles wider than it is tall.

To specify locations on the surface of the earth, we use angles that describe where those locations are relative to the center of the earth. Angles are measured in degrees, with one degree representing 1/360th of a circle.

While we think of and experience distances on the planet in terms of length (feet, miles, kilometers) rather than angles (degrees), the earth is lumpy and actual distances across the surface of the planet depend on how high you are (distance from the center of the earth). Even if the Earth's surface were perfectly smooth, manipulating distances across a three-dimensional surface requires fiendishly complex calculations. Specifying locations in degrees is a simpler and more accurate than specifying locations in distances.

Since the earth is a three-dimensional object, two angles are required to specify locations on the surface of the earth.

**Latitude** is the angle that tells you how far you are north or south
of the Equator. The Equator is zero degrees latitude and, commonly, negative
numbers are south, positive numbers are north. The range is from negative 90
degrees at the South Pole to positive 90 degrees at the North Pole. Lines on
maps are drawn east/west around the planet to show latitude.

**Longitude** is the angle that tells you how far you are east or west from the Prime
Meridian in Greenwich, England (longitude 0.0) - think LONG stretching across your body.
Negative numbers up to negative 180 are west of England.
Positive numbers up to positive 180 are east of England. Lines on maps are drawn
north/south between the poles to show longitude.

### Prime Meridian and the International Date Line

Latitude is based on specific physical characteristics of the earth. Zero degrees is set at the Equator, the imaginary line around the center of the spinning earth. The axis of the Earth's rotation also happens to point at Polaris (the North Star), a very bright star in the northern sky. It is possible to find your latitude in the northern hemisphere by measuring the angle of Polaris above the horizon using a device like a sextant or astrolabe.

However, longitude is more difficult to define since there are no clear geological or celestial markers to define where zero degrees should be located.

Early naval navigators had to carry clocks on their ships to keep track of the time at a location of known longitude. When the sun was directly overhead (12:00 noon), the navigators could tell how many degrees east or west they were of that known longitude by multiplying the number of hours difference between their local 12:00 noon and the 12:00 noon at that known location by 15 degrees (360 degrees rotation of the planet divided by 24 hours in one rotation). Longitude has an interesting history, dramatized in this episode of Nova.

Greenwich, England was chosen as zero degrees longitude, or the **Prime
Meridian**, by agreement at
a conference of 25 nations called by US president
Chester Arthur in Washington, DC in 1884. This was a time when Britain was at
the height of its imperial and maritime dominance and this conference simply
formalized what was already a common practice on printed nautical charts.
The United States government had started mandating use of the Greenwich
meridian in 1850 for nautical purposes.

On the opposite side of the planet is the **International Date Line**.
Unlike the Prime Meridian, this line is not exactly -180 (or +180)
degrees longitude, but weaves its way around different islands
to deal with political and economic considerations.

### Coordinates

A specific location on the surface can be specified with a
longitude and a latitude. This pair of numbers is called a
set of **coordinates**.

A challenge with geographic coordinates is that there are
multiple ways of expressing them. The traditional form
is in **degrees, minutes, seconds and direction**, with a minute
being 1/60th of a degree and a second being 1/16th of a minute.
North or South means north or south of the Equator and
East or West means east or west of the Prime Meridian.

38° 53' 50" N, 77° 02' 12" W

While that traditional method of specifying coordinates
is clear and easy to read by humans, that type of notation
is cumbersome for computers to work with. With geospatial
technology, coordinates are generally specified as **decimal coordinates**,
with positive latitudes meaning North of the Equator
and negative longitudes being West of the Prime Meridian.

38.897192, -77.03896

An additional challenge with decimal coordinates is that although latitude is usually given as the first number (lat-long), sometimes longitude is the first number (long-lat). This is consistent with the convention in geometry where coordinates on a two-dimensional grid are specified as X,Y. When using coordinates from an unfamiliar source, you should make sure you know which order the coordinates are given in.

-77.03896, 38.897192

With GIS software, it is generally best to stick with
*decimal lat-long coordinates* that are easily stored and used in numeric data tables.
You can verify that your coordinates are lat/long by typing them
into the search bar in Google Maps.

### Elevation

In many situations it is adequate to reference locations with two
dimensions (X and Y, or longitude and latitude) since maps and computer
displays are in two dimensions. However, we live in a three-dimensional world.
**Elevation** (altitude or Z) is sometimes given as a third coordinate.

Unlike latitude and longitude which are angles given in degrees, elevations represent linear distance from the center of the earth, and are usually given in feet or meters. Exactly what elevation zero is depends on While elevation is often specified as height above mean sea level, different applications use different reference points. GPS uses height above the reference ellipsoid (see below).

Geospatial data that includes elevation is usually called **2.5-D**
rather than 3-D because for each X/Y (longitude/latitude) coordinate
there is only one Z (elevation) value. 2.5D data cannot fully represent
structures like buildings that have multiple levels or floors stacked
on top of each other at different elevations for each X/Y coordinate.

### Datums

Reality rarely quantifies exactly to simple mathematical models, so the topic of elevation also raises an issue with exactly how the pure, spherical angles of latitude and longitude reflect the rugged, three-dimensional reality of the earth. In addition to the planet being slightly flattened from a perfect sphere by the force of rotation, the surface is covered with lumpy mountains and oceans. If you want to be very precise about locations (such as when engineering a bridge or dropping a bomb on someone) these lumps can result in errors that can cause problems.

This challenge is addressed in cartography in three stages:

- An estimate of the lumpy shape of the earth is called a
**geoid** - The geoid is used to create a smooth mathematical estimate called
a
**reference ellipsoid** - The ellipsoid is described with a coordinate system and set
of reference points called a
**geodetic datum** - That datum is then used to convert latitude, longitude, and elevation to specific locations on the surface of the earth with high accuracy

There are a number of different datums that are used in different parts of the world, for different purposes, and at different times. These different datums reflect:

- Fitting to specific parts of the earth
- Improvements in measurement technology
- Changes in the earth's topography, such as the shifting of tectonic plates

Datums you will commonly encounter include:

**World Geodetic System of 1984 (WGS 84)**: The datum used by GPS**North American Datum of 1983 (NAD 83)**: A datum specifically tailored to closely fit North America**North American Datum of 1927 (NAD 27)**: An obsolete datum you will find used on older maps

## Data Models and Geometry

There are two broad models for storing geospatial data: Vector and Raster.
They are called **models** because they are simplified representations
of the objects or phenomena they are used to represent.

### Vector Data

**Vector** data stores locations as discrete geometric objects: points,
lines or polygons.

- Points are represented with a single coordinate pair (latitude and longitude). Points useful for representing objects like vehicles or smartphone locations that occupy little or no area
- Polygons are represented with a collection of coordinate pairs that define the outside boundary of an area. Polygons are useful for representing things like property boundaries or city boundaries that define an area
- Lines are represented by a sequence of coordinate pairs. Lines are useful for representing things like roads or paths that are long and narrow

Areas are occasionally represented with **centroids**, which
are points in that are mathematically equidistant from all parts
of the area. For example, the political boundaries of cities are best represented
with an area, but on a large map covering an entire nation, individual
cities may be represented with points to make the map easier to read.

### Raster Data

**Raster** data is stored in a grid of regularly-spaced
pixels of attribute data that cover an area of interest. Raster data is most useful for representing
data about areas where there are unclear boundaries, such as with elevation,
temperature or amounts of vegetation. The best known type of data is photographic
image data. The digital elevation model described above is another example
of geospatial data stored in rasters.

Although many different types of data can be stored as rasters, data about discrete objects with clear boundaries is usually more appropriately and accurately stored as vector rather than raster. GIS Software allows conversion between raster and vector, although the conversion process between the two models often involves inaccuracy and uncertainty.

### Point Clouds

An emerging third type of model is the **point cloud**, which stores
geospatial information as a collection of points in three-dimensional space
(latitude, longitude and elevation). Point clouds are commonly captured using
a aerial laser scanning technique called **Lidar**. Unlike vector and
raster data that is analogous to a flat two-dimensional map, point clouds can
be used to more-faithfully represent structures and topography, albeit at the
cost of greater storage and processing demands.

## Attributes

Spatial locations by themselves aren't particularly useful or interesting.
You need some *what* to go with the *where*.

Attributes are the *what* part of *what is where.*
Attributes are also sometimes referred to as *fields*, *variables*,
or *columns*.

Geospatial data is based on the **georelational model** that
connects location data (geo) with attribute data - representations
of what is there, often stored in relational databases.

### General Classifications of Attribute Data

**Quantitative data**represents characteristics as quantities or numeric values**Nominal**: Names or descriptions. Usually represented in storage as character**strings**.**Alphabetic (Character)**- Examples:- Place names (New York City)
- Institutions and organizations (The Louvre)
- Structures (The Empire State Building)
- Personal names (Barack Obama)
- Internet URLs (http://michaelminn.net)
- Descriptions (Concert hall built by industrialist Andrew Carnegie and opened in 1891)

**Numeric**: Nominal data can be made of numbers, but unlike quantitative data, you cannot meaningfully perform numeric operations like adding or averaging on these numbers. Examples:- Athletic jersey numbers
- Social Security numbers

**Categorical**: Names used to divide individual things into groups. Categorical data is often considered a specific type of nominal data.**Dichotomous**: Categories with only two possible values. Examples:- Yes / No
- High / Low
- Present / Absent

**Multichotomous**: Categories with more than two possible values. Examples:- Political affiliation (Democratic, Republican, Green, Independent, etc.)
- Religion (Muslim, Christian, Buddhist, etc.)
- Geographic Regions (Northeast, Southwest, Northwest, etc.)

**Quantitative data**represents qualities of places, people or things that are too subjective, ambiguous or irregular to be easily converted to simple numbers**Ordinal**: Ranks items in order. Examples:- Likert scales (On a scale of 1 to 5 how do you feel about...)
- Poll rankings ( AFI list of Top 100 Movies: #1 Citizen Kane, #2, Casablanca, #3 The Godfather...)
- Completion order (Rank by number of games won: #1 Gonzaga, #2 Baylor, #3 Kansas...)

**Absolute**: Numbers that stand alone and do not depend on other numbers for their meaning**Counts**: Numeric characteristics that can be counted with no fractional parts. Also referred to as**discrete**data. Often represented in storage as**integers**, although**real**numbers can be used when the possible values exceed the maximum value that can be represented with an integer. Examples:- Population (there are no half people)
- Number of votes
- Number of days

**Amounts**: Numeric characteristics that are measured. These characteristics do not necessarily fall exactly on integer boundaries and are represented with numbers that have decimal points. Also referred to as**continuous**data. Represented in storage with**real**numbers. Examples:- Household income
- Gross domestic product
- Latitude / Longitude
- Temperature

**Relative**: Numbers that have meaning only in reference to other numbers. Created by performing mathematical operations on absolute values.**Central tendency**: Mean (average), median, mode, per-capita. Examples:**Proportion**: The fraction that something is of a whole. Examples:**Spatial Density**: The amount of something within a particular unit of area. Examples:- Residents per square mile in each zip code
- Bushels of corn harvested per acre of farmland

**Growth**: Percentage change from a prior time. Examples:**Index**: Values representing different characteristics that are combined into a single number. Examples:

There are techniques called **mixed methods** for converting (coding)
qualitative data to quantitative values. A common example is the
Likert scale that asks survey participants to rate some quality
or feeling on a numeric scale of one to five. However, as with
all data conversions, this introduces uncertainty about whether
the numbers fully and meaningful capture the characteristics
being studied.

### Primary vs Secondary Data

**Primary data** is data you capture yourself.

**Secondary data** is data you recycle from someone else.

Primary data is often more expensive to obtain since you will have to do work to get it, such as by surveying with GPS devices or conducting interviews with human subjects. But if you're doing something novel, you will likely need novel data that you get yourself rather than from someone else.

### Population vs Sampled Data

Generally, the more data you have, the better your conclusions
when you analyze that data. If you have **population** data
for everything you are studying, that is ideal.

An example of population data is the census conducted every ten years, where the US Federal Government attempts to find and count every person in the country and get basic information on who they are and where they live (what is where).

However, in many if not most research situations, it is too difficult or expensive to capture complete data. For example, if you are running a congressional campaign, you cannot survey every voter in your congressional district to find out how they are planning to vote.

In such cases it is usually adequate to capture a **sample**
of the data and then use statistical techniques to make an
**inference** from that sample about the full population. The use
of sampling rather than a full census introduces uncertainty
about whether your sample reflects the overall population.
However, there is always some uncertainty when gathering data (especially
about humans) and the uncertainty with sampled data can be quantified
and considered as part of the analytical process.

### Individual vs Aggregated Data

**Individual data** is data about individual persons or objects.

**Aggregated data** is data that combines data from groups of individuals
(often based on location in different geographic areas) into a smaller
set of numbers, usually averages or medians. An example of aggregated data is
US census data that is aggregated by census tract or county in order
to preserve the privacy of individual census respondents.

As with sampling, aggregation introduces uncertainty as important individual distinctions can be lost when people are combined into groups and summarized.

One issue with aggregated data is the
**ecological fallacy**, when you make assumptions about individuals
based on aggregated data. For example, states are often classified
as red states or blue states based on whether the majority of the
voters in that state vote Republican or Democratic, respectively.
However, even in very red Utah, Democratic President Obama got
25% of the vote in the 2012, so assuming that everyone you
meet in Utah is conservative is incorrect.

The opposite of the ecological fallacy is the
**exception fallacy**, where an assumption is made about
a group based on a few exceptional individuals.
For example, if you meet a tall basketball player from
Ohio, the assumption that everyone in Ohio is tall
would be incorrect.

### Structured vs Unstructured Data

**Structured data** is strictly organized so that for every
field or record in your data, you know what it represents.
If your data is organized in a table with columns and rows,
it is probably structured. Most geospatial data that you deal with
in conventional geographic information systems is structured.

**Unstructured data** is data that is not clearly organized
in a way that it can be simply processed by computers.
Examples of unstructured data is text like Facebook posts, text messages or tweets.
There is meaningful data there, but turning it into something useful
requires analysis (often by humans or complex computer algorithms)
to give it some kind of structure. Contemporary societies
generate tremendous amounts of unstructured data and the
analysis and use of that data is an area of heavy research
(and investment) called
big data.

### Accuracy vs Precision

The precision with which you express a number should be in keeping with the amount of accuracy that your data possesses.

**Accuracy** is how close your count or measurement reflects
the reality that you are counting or measuring.

**Precision** is the amount of accuracy that you express
with the numbers you use to represent your data. Usually what
this means is the number of **significant digits** used. For example:

- 1.2345123 is more precise than 1.2
- 45,000,000 is less precise than 45,244,391.224

For example:

- The measured temperature on three days was 73, 76 and 78 degrees (two significant digits)
- A calculator might calculate and display the average (mean) temperature as 75.666666667
- Since you have only two significant digits in your source data, you do not have 11 digits of accuracy implied by the 11 digits of precision
- Rounding and writing the mean as 76 degrees or 76 2/3 degrees would keep the precision and accuracy in harmony

Precision is often used to **deceive** people into thinking
that you have a better understanding of reality than you
actually do. For example:

- You say that 12,456 people attended your campaign rally
- The assumption from that statement is that you have an accurate account of attendance and that you are a confident, well-informed candidate worthy of election
- However, that number is just describing a rough estimate between 5,000 and 15,000 since people come and go from campaign rallies and often are not being accurately counted when they enter or leave an area
- Your highly precise statement is deceptive because you don't really have a count as accurate as the precision of your statement implies

### Spatial Accuracy vs Precision

The world is thing of infinite, wondrous complexity.

When we try to understand that world as numbers, we have to simplify. Our measuring devices and techniques cannot be exactly accurate. We have to round numbers so our precision reflects our accuracy.

With latitudes and longitudes in degrees, the number of decimal places you use (precision) reflects the accuracy of that location on the ground. That accuracy can be expressed as +/- a distance on the ground.

The table below shows the approximate distances for each fraction of a degree in Manhattan:

Degrees | Latitude | Longitude |
---|---|---|

0.1 | 6.91 miles | 5.23 miles |

0.01 | 3,648 feet | 2,764 feet |

0.001 | 365 feet | 276.4 feet |

0.0001 | 36.5 feet | 27.6 feet |

0.00001 | 43.8 inches | 33.2 inches |

0.000001 | 4.38 inches | 3.32 inches |

0.0000001 | 11.1 millimeters | 7.02 millimeters |

0.00000001 | 1.11 millimeters | 0.702 millimeters |

### Levels of Measurement

A concept from statistics that you may deal with when working with geospatial data is levels of measurement, which is similar to the general categorization given above. The most commonly-used division of levels was developed by psychologist Stanley Smith Stevens, and has four general levels with a handful of specific sub-levels:

- Nominal
- Dichotomous
- Categorical

- Ordinal
- Strongly ordered
- Weakly ordered

- Interval: Difference between values is meaningful but the value of zero is arbitrary, so absolute values cannot be directly compared (Examples: year, time of day, temperature)
- Ratio: Value of zero is meaningful (Examples: distance, amounts, area)

## Metadata

If you don't know what you have, then you don't have it.

**Metadata** is data about your data. Typical items kept
in metadata include:

- Name of the data set
- Overall description of what the data set contains
- Dates of capture, processing, analysis, and/or backup
- Information about the variables (names, ranges, units, etc)
- Information about the location data
- Notes or caveats about the quality and accuracy of the data
- Source and contact information for the people who collected the data

While simply remembering what you have is adequate for small projects with short-term needs (like class projects), failure to document your data may make the data useless if you ever want to reuse that data in the future.

- Metadata helps keeping track of your inventory of data when you have a number of different data sets
- Metadata aids in sharing data by making it possible for other people to know exactly what is in your data set
- Metadata can be used as a reminder if you need to use your data in the future and have forgotten details about what the variables mean or where you got the data
- If you are collecting data for published research, you need to be able to provide details about your data so the quality of the research based on that data can be judged.

However, because the creation of metadata is usually separate from the capture and processing of the data itself, and is more about the future than getting the task at hand done, it is common for metadata to be missing or skeletal. While it is often possible to look at the data and make a reasonable guess as to what it represents and where it came from, a few minutes adding metadata can save a great deal of effort in the future for you and for other people who use your data.

## Philosophical Considerations

X/Y coordinates are called **Cartesian Coordinates**
after the French philosopher Rene Descartes (1596 - 1650).
Descartes is also remembered for is statement *Cogito ergo sum*
(I think therefore I am) as evidence of our own existence.

Cartesian coordinates combines the principles of geometry codified by the Greek philosopher Euclid (c. 300 BC) with the techniques of algebra codified by the Arab mathemetician Muḥammad ibn Mūsā al-Khwārizmī (c. 780–850) to create a useful and powerful synthesis that is a foundation for contemporary geospatial technology.

Descartes coordinate system was part of a broader philosophical quest to find objective knowledge about the world that is independent of our perspective and preconceptions. Descartes believed that the only essential properties of matter were geometric, so an objective understanding of the world could be had through a sufficiently advanced geometry (Shand 2002, pp 72).

Descartes philosophy, to some extent, is a foundation for the geospatial technology that is based on Decartes way of representing the world in his coordinate system. This is also the foundation of a critique of geospatial technology that reducing the world to X/Y coordinates obscures not only important qualitative and subjective meanings, but also obscures the political and economic forces (and associated moral questions) underlying the development and use of geospatial technology.

Geospatial technologies with their vivid visualizations and massive scope, are charismatic, and charisma can deceive.