# Geospatial Data

Geospatial data indicates what is where. One of the special things about geospatial data is that it comes in those two parts: what and where.

Attribute (what) data and location (where) data are fundamentally different things and have different characteristics.

A distinction can be drawn between data, information and knowledge with data being raw facts or numbers and information being the interpretation of data that forms the basis of knowledge. However, the three terms are often used interchangeably and interpreted differently by different people, so you should use caution and consider context when you hear or use these words.

## Location Data

### Latitude and Longitude

The Earth is a spheroid - it is round like a ball or sphere but flattened by slightly by the centrifugal force of rotation. The Earth is 24,900 miles circumference, but around 27 miles wider than it is tall.

To specify locations on the surface of the earth, we use angles that describe where those locations are relative to the center of the earth. Angles are measured in degrees, with one degree representing 1/360th of a circle.

While we think of and experience distances on the planet in terms of length (feet, miles, kilometers) rather than angles (degrees), the earth is lumpy and actual distances across the surface of the planet depend on how high you are (distance from the center of the earth). Even if the Earth's surface were perfectly smooth, manipulating distances across a three-dimensional surface requires fiendishly complex calculations. Specifying locations in degrees is simpler and more accurate than specifying locations in distances.

Since the earth is a three-dimensional object, two angles are required to specify locations on the surface of the earth.

Latitude is the angle that tells you how far you are north or south of the Equator. The Equator is zero degrees latitude and, commonly, negative numbers are south, positive numbers are north. The range is from negative 90 degrees at the South Pole to positive 90 degrees at the North Pole. Lines on maps are drawn east/west around the planet to show latitude.

Longitude is the angle that tells you how far you are east or west from the Prime Meridian in Greenwich, England (longitude 0.0) - think LONG stretching across your body. Negative numbers up to negative 180 are west of England. Positive numbers up to positive 180 are east of England. Lines on maps are drawn north/south between the poles to show longitude.

### Prime Meridian and the International Date Line

Latitude is based on specific physical characteristics of the earth. Zero degrees is set at the Equator, the imaginary line around the center of the spinning earth. The axis of the Earth's rotation also happens to point at Polaris (the North Star), a very bright star in the northern sky. It is possible to find your latitude in the northern hemisphere by measuring the angle of Polaris above the horizon using a device like a sextant or astrolabe.

However, longitude is more difficult to define since there are no clear geological or celestial markers to define where zero degrees should be located.

Early naval navigators had to carry clocks on their ships to keep track of the time at a location of known longitude. When the sun was directly overhead (12:00 noon), the navigators could tell how many degrees east or west they were of that known longitude by multiplying the number of hours difference between their local 12:00 noon and the 12:00 noon at that known location by 15 degrees (360 degrees rotation of the planet divided by 24 hours in one rotation). Longitude has an interesting history, dramatized in this episode of Nova.

Greenwich, England was chosen as zero degrees longitude, or the Prime Meridian, by agreement at a conference of 25 nations called by US president Chester Arthur in Washington, DC in 1884. This was a time when Britain was at the height of its imperial and maritime dominance and this conference simply formalized what was already a common practice on printed nautical charts. The United States government had started mandating use of the Greenwich meridian in 1850 for nautical purposes.

On the opposite side of the planet is the International Date Line. Unlike the Prime Meridian, this line is not exactly -180 (or +180) degrees longitude, but weaves its way around different islands to deal with political and economic considerations.

### Coordinates

A specific location on the surface can be specified with a longitude and a latitude. This pair of numbers is called a set of coordinates.

A challenge with geographic coordinates is that there are multiple ways of expressing them. The traditional form is in degrees, minutes, seconds and direction, with a minute being 1/60th of a degree and a second being 1/16th of a minute. North or South means north or south of the Equator and East or West means east or west of the Prime Meridian.

38° 53' 50" N, 77° 02' 12" W

While that traditional method of specifying coordinates is clear and easy to read by humans, that type of notation is cumbersome for computers to work with. With geospatial technology, coordinates are generally specified as decimal coordinates, with positive latitudes meaning north of the Equator and negative longitudes being west of the Prime Meridian.

38.897192, -77.03896

An additional challenge with decimal coordinates is that although latitude is usually given as the first number (lat-long), sometimes longitude is the first number (long-lat). This is consistent with the convention in geometry where coordinates on a two-dimensional grid are specified as X,Y. When using coordinates from an unfamiliar source, you should make sure you know which order the coordinates are given in.

-77.03896, 38.897192

With GIS software, it is generally best to stick with decimal lat-long coordinates that are easily stored and used in numeric data tables. You can verify that your coordinates are lat/long by typing them into the search bar in Google Maps.

### Elevation

In many situations it is adequate to reference locations with two dimensions (X and Y, or longitude and latitude) since maps and computer displays are in two dimensions. However, we live in a three-dimensional world. Elevation (altitude or Z) is sometimes given as a third coordinate.

Unlike latitude and longitude which are angles given in degrees, elevations represent linear distance from the center of the earth, and are usually given in feet or meters. Exactly what elevation zero is depends on While elevation is often specified as height above mean sea level, different applications use different reference points. GPS uses height above the reference ellipsoid (see below).

Geospatial data that includes elevation is usually called 2.5-D rather than 3-D because for each X/Y (longitude/latitude) coordinate there is only one Z (elevation) value. 2.5D data cannot fully represent structures like buildings that have multiple levels or floors stacked on top of each other at different elevations for each X/Y coordinate.

### Datums

Reality rarely quantifies exactly to simple mathematical models, so the topic of elevation also raises an issue with exactly how the pure, spherical angles of latitude and longitude reflect the rugged, three-dimensional reality of the earth. In addition to the planet being slightly flattened from a perfect sphere by the force of rotation, the surface is covered with lumpy mountains and oceans. If you want to be very precise about locations (such as when engineering a bridge or dropping a bomb on someone) these lumps can result in errors that can cause problems.

This challenge is addressed in cartography in three stages:

• An estimate of the lumpy shape of the earth is called a geoid
• The geoid is used to create a smooth mathematical estimate called a reference ellipsoid
• The ellipsoid is described with a coordinate system and set of reference points called a geodetic datum
• That datum is then used to convert latitude, longitude, and elevation to specific locations on the surface of the earth with high accuracy

There are a number of different datums that are used in different parts of the world, for different purposes, and at different times. These different datums reflect:

• Fitting to specific parts of the earth
• Improvements in measurement technology
• Changes in the earth's topography, such as the shifting of tectonic plates

Datums you will commonly encounter include:

• World Geodetic System of 1984 (WGS 84): The datum used by GPS
• North American Datum of 1983 (NAD 83): A datum specifically tailored to closely fit North America
• North American Datum of 1927 (NAD 27): An obsolete datum you will find used on older maps

## Data Models and Geometry

There are two broad models for storing geospatial data: Vector and Raster. They are called models because they are simplified representations of the objects or phenomena they are used to represent.

### Vector Data

Vector data stores locations as discrete geometric objects: points, lines or polygons.

• Points are represented with a single coordinate pair (latitude and longitude). Points useful for representing objects like vehicles or smartphone locations that occupy little or no area
• Polygons are represented with a collection of coordinate pairs that define the outside boundary of an area. Polygons are useful for representing things like property boundaries or city boundaries that define an area
• Lines are represented by a sequence of coordinate pairs. Lines are useful for representing things like roads or paths that are long and narrow

Areas are occasionally represented with centroids, which are points in that are mathematically equidistant from all parts of the area. For example, the political boundaries of cities are best represented with an area, but on a large map covering an entire nation, individual cities may be represented with points to make the map easier to read.

### Raster Data

Raster data is stored in a grid of regularly-spaced pixels of attribute data that cover an area of interest. Raster data is most useful for representing data about areas where there are unclear boundaries, such as with elevation, temperature or amounts of vegetation. The best known type of data is photographic image data. The digital elevation model described above is another example of geospatial data stored in rasters.

Although many different types of data can be stored as rasters, data about discrete objects with clear boundaries is usually more appropriately and accurately stored as vector rather than raster. GIS Software allows conversion between raster and vector, although the conversion process between the two models often involves inaccuracy and uncertainty.

### Point Clouds

An emerging third type of model is the point cloud, which stores geospatial information as a collection of points in three-dimensional space (latitude, longitude and elevation). Point clouds are commonly captured using a aerial laser scanning technique called Lidar. Unlike vector and raster data that is analogous to a flat two-dimensional map, point clouds can be used to more-faithfully represent structures and topography, albeit at the cost of greater storage and processing demands.

## Attributes

Spatial locations by themselves aren't particularly useful or interesting. You need some what to go with the where.

Attributes are text or numbers that represent the what part of what is where. Attributes are also sometimes referred to as fields, variables, or columns.

Geospatial data commonly uses the georelational model, which connects location data (geo) with attribute data (representations of what is there), often stored in relational database tables. While attribute columns are visible to (and changeable by) users, the spatial location columns are often hidden from the user's view and can only be changed using GIS software.

### General Classifications of Attribute Data

Attributes can be classified by the types of information they represent. The data class of an attribute is important because it determines what kind of analysis can be performed on that attribute, and what kinds of charts or maps are appropriate for visualizing that attribute.

The classification scheme derived below is an elaboration of the levels of measurement commonly used in statistics.

• Qualitative data represents qualities of places, people or things that are too subjective, ambiguous or irregular to be easily converted to simple numbers
• Nominal: Names or descriptions. Usually represented in storage as character strings.
• Alphabetic - Examples:
• Place names (New York City)
• Institutions and organizations (The Louvre)
• Structures (The Empire State Building)
• Personal names (Barack Obama)
• Internet URLs (http://michaelminn.net)
• Descriptions (Concert hall built by industrialist Andrew Carnegie and opened in 1891)
• Numeric: Nominal data can be made of numbers, but unlike quantitative data, you cannot meaningfully perform numeric operations like adding or averaging on these numbers. Examples:
• Athletic jersey numbers
• Social Security numbers
• Categorical: Names used to divide individual things into groups. Categorical data is often considered a specific type of nominal data.
• Dichotomous: Categories with only two possible values. Examples:
• Yes / No
• High / Low
• Present / Absent
• Multichotomous: Categories with more than two possible values. Examples:
• Political affiliation (Democratic, Republican, Green, Independent, etc.)
• Religion (Muslim, Christian, Buddhist, etc.)
• Geographic Regions (Northeast, Southwest, Northwest, etc.)
• Quantitative data represents characteristics as quantities or numeric values
• Ordinal: Ranks items in order. Examples:
• Likert scales: On a scale of 1 to 5 how do you feel about...
• Poll rankings ( AFI list of Top 100 Movies: #1 Citizen Kane, #2, Casablanca, #3 The Godfather...)
• Completion order (Rank by number of games won: #1 Gonzaga, #2 Baylor, #3 Kansas...)
• Absolute: Numbers that stand alone and do not depend on other numbers for their meaning
• Counts: Numeric characteristics that can be counted with no fractional parts. Also referred to as discrete data. Often represented in storage as integers, although real numbers can be used when the possible values exceed the maximum value that can be represented with an integer. Examples:
• Population (there are no half people)
• Number of votes
• Number of days
• Amounts: Numeric characteristics that are measured. These characteristics do not necessarily fall exactly on integer boundaries and are represented with numbers that have decimal points. Also referred to as continuous data. Represented in storage with real numbers. Examples:
• Household income
• Gross domestic product
• Latitude / Longitude
• Temperature
• Relative: Numbers that have meaning only in reference to other numbers. Created by performing mathematical operations on absolute values.
• Central tendency: A single central (middle) value used to summarize a set of numbers. Examples: mean (average), median, mode, per-capita. Examples:
• Proportion: The fraction that something is of a whole. Examples:
• Spatial Density: The amount of something within a particular unit of area. Examples:
• Residents per square mile in each zip code
• Bushels of corn harvested per acre of farmland
• Growth: Percentage change from a prior time. Examples:
• Index: Values representing different characteristics that are combined into a single number. Examples:

### Units

When an attribute is a number, it needs a unit, which is "a determinate quantity (as of length, time, heat, or value) adopted as a standard of measurement." A number without a unit has no meaning in geospatial data. If you don't know what the unit for an attribute is, you don't know what the numbers mean and you cannot perform any useful analysis with that data.

For example, the number 42 by itself is meaningless. However, 42 people, 42 dollars, or 42 average new cancer cases per year are numbers that clearly describe what is where. In this example, people, dollars, and average new cancer cases per year are examples of units.

With geospatial data, units should include clarifying components:

• Mathematical component: What mathematical unit is used?
• Categorical component: What is the phenomena being measured?
• Spatial component: What area is being covered by each of the values?
• Temporal component: What range of time is covered by each value?

For example, if you have data for Stroke mortality per 100K residents by state for 2016:

• Mathematical component: per 100,000 residents
• Categorical component: stroke mortality (rate of death by stroke)
• Spatial component: by state
• Temporal component: in the year 2016

### Primary vs Secondary Data

Primary data is data you capture yourself.

Secondary data is data you recycle from someone else.

Primary data is often more expensive to obtain since you will have to do work to get it, such as by surveying with GPS devices or conducting interviews with human subjects. But if you're doing something novel, you will likely need novel data that you get yourself rather than from someone else.

### Population vs Sampled Data

Generally, the more data you have, the better your conclusions when you analyze that data. If you have population data for everything you are studying, that is ideal.

An example of population data is the census conducted every ten years, where the US Federal Government attempts to find and count every person in the country and get basic information on who they are and where they live (what is where).

However, in many if not most research situations, it is too difficult or expensive to capture complete data. For example, if you are running a congressional campaign, you cannot survey every voter in your congressional district to find out how they are planning to vote.

In such cases it is usually adequate to capture a sample of the data and then use statistical techniques to make an inference from that sample about the full population. The use of sampling rather than a full census introduces uncertainty about whether your sample reflects the overall population. However, there is always some uncertainty when gathering data (especially about humans) and the uncertainty with sampled data can be quantified and considered as part of the analytical process.

### Individual vs Aggregated Data

Individual data is data about individual persons or objects.

Aggregated data is data that combines data from groups of individuals (often based on location in different geographic areas) into a smaller set of numbers, usually averages or medians. An example of aggregated data is US census data that is aggregated by census tract or county in order to preserve the privacy of individual census respondents.

As with sampling, aggregation introduces uncertainty as important individual distinctions can be lost when people are combined into groups and summarized.

One issue with aggregated data is the ecological fallacy, when you make assumptions about individuals based on aggregated data. For example, states are often classified as red states or blue states based on whether the majority of the voters in that state vote Republican or Democratic, respectively. However, even in very red Utah, Democratic President Obama got 25% of the vote in the 2012, so assuming that everyone you meet in Utah is conservative is incorrect.

The opposite of the ecological fallacy is the exception fallacy, where an assumption is made about a group based on a few exceptional individuals. For example, if you meet a tall basketball player from Ohio, the assumption that everyone in Ohio is tall would be incorrect.

### Structured vs Unstructured Data

Structured data is strictly organized so that for every field or record in your data, you know what it represents. If your data is organized in a table with columns and rows, it is probably structured. Most geospatial data that you deal with in conventional geographic information systems is structured.

Unstructured data is data that is not clearly organized in a way that it can be simply processed by computers. Examples of unstructured data is text like Facebook posts, text messages or tweets. There is meaningful data there, but turning it into something useful requires analysis (often by humans or complex computer algorithms) to give it some kind of structure. Contemporary societies generate tremendous amounts of unstructured data and the analysis and use of that data is an area of heavy research (and investment) called big data.

## Accuracy vs Precision

The precision with which you express a number should be in keeping with the amount of accuracy that your data possesses.

Accuracy is how close your count or measurement reflects the reality that you are counting or measuring.

Precision is the amount of accuracy that you express with the numbers you use to represent your data. Usually what this means is the number of significant digits used. For example:

• 1.2345123 is more precise than 1.2
• 45,000,000 is less precise than 45,244,391.224

For example:

• The measured temperature on three days was 73, 76 and 78 degrees (two significant digits)
• A calculator might calculate and display the average (mean) temperature as 75.666666667
• Since you have only two significant digits in your source data, you do not have 11 digits of accuracy implied by the 11 digits of precision
• Rounding and writing the mean as 76 degrees or 76 2/3 degrees would keep the precision and accuracy in harmony

Precision is often used to deceive people into thinking that you have a better understanding of reality than you actually do. For example:

• You say that 12,456 people attended your campaign rally
• The assumption from that statement is that you have an accurate account of attendance and that you are a confident, well-informed candidate worthy of election
• However, that number is just describing a rough estimate between 5,000 and 15,000 since people come and go from campaign rallies and often are not being accurately counted when they enter or leave an area
• Your highly precise statement is deceptive because you don't really have a count as accurate as the precision of your statement implies

### Spatial Accuracy vs Precision

The world is thing of infinite, wondrous complexity.

When we try to understand that world as numbers, we have to simplify. Our measuring devices and techniques cannot be exactly accurate. We have to round numbers so our precision reflects our accuracy.

With latitudes and longitudes in degrees, the number of decimal places you use (precision) reflects the accuracy of that location on the ground. That accuracy can be expressed as +/- a distance on the ground.

The table below shows the approximate distances for each fraction of a degree in Manhattan:

## Metadata

If you don't know what you have, then you don't have it.

Metadata is data about your data. Typical items kept in metadata include:

• Name of the data set
• Overall description of what the data set contains
• Dates of capture, processing, analysis, and/or backup
• Information about the variables (names, ranges, units, etc)
• Information about the location data
• Notes or caveats about the quality and accuracy of the data
• Source and contact information for the people who collected the data

Students often confuse data with metadata. The following table contrasts data with metadata that could be associated with that data:

DataMetadata
Patient names and addressesThe name of the computer disk file where current patient information is kept
Names of participants in a drug trialThe names of the technicians who recorded the drug trial results
Test scoresThe range of dates when the test scores were captured
Hourly temperaturesThe location of the sensor where those temperatures were recorded
Incidence rates for diseases by stateThe names of the agencies that collected the disease data

While simply remembering what you have is adequate for small projects with short-term needs (like class projects), failure to document your data may make the data useless if you ever want to reuse that data in the future.

• Metadata helps keeping track of your inventory of data when you have a number of different data sets
• Metadata aids in sharing data by making it possible for other people to know exactly what is in your data set
• Metadata can be used as a reminder if you need to use your data in the future and have forgotten details about what the variables mean or where you got the data
• If you are collecting data for published research, you need to be able to provide details about your data so the quality of the research based on that data can be judged.

However, because the creation of metadata is usually separate from the capture and processing of the data itself, and is more about the future than getting the task at hand done, it is common for metadata to be missing or skeletal. While it is often possible to look at the data and make a reasonable guess as to what it represents and where it came from, a few minutes adding metadata can save a great deal of effort in the future for you and for other people who use your data.

## Philosophical Considerations

X/Y coordinates are called Cartesian Coordinates after the French philosopher Rene Descartes (1596 - 1650). Descartes is also remembered for is statement Cogito ergo sum (I think therefore I am) as evidence of our own existence.

Cartesian coordinates combines the principles of geometry codified by the Greek philosopher Euclid (c. 300 BC) with the techniques of algebra codified by the Arab mathemetician Muḥammad ibn Mūsā al-Khwārizmī (c. 780–850) to create a useful and powerful synthesis that is a foundation for contemporary geospatial technology.

Descartes coordinate system was part of a broader philosophical quest to find objective knowledge about the world that is independent of our perspective and preconceptions. Descartes believed that the only essential properties of matter were geometric, so an objective understanding of the world could be had through a sufficiently advanced geometry (Shand 2002, pp 72).

Descartes philosophy, to some extent, is a foundation for the geospatial technology that is based on Decartes way of representing the world in his coordinate system. This is also the foundation of a critique of geospatial technology that reducing the world to X/Y coordinates obscures not only important qualitative and subjective meanings, but also obscures the political and economic forces (and associated moral questions) underlying the development and use of geospatial technology.

Geospatial technologies with their vivid visualizations and massive scope, are charismatic, and charisma can deceive.