Geospatial Data Storage

Digital Data

In contemporary geographic information systems, geospatial data is stored as digital data.

As the name implies, digital data consists of digits or numbers. Internally, digital electronic technology represents data as binary signals (bits) that are either on or off. This binary representation allows a high level of flexibility and accuracy in the representation and processing of data.

For historical reasons, bits are clumped into groups of eight that are called bytes. If you run through all the possible combinations of eight bits, you will find that a byte can have 256 different values (numbers 0 - 255). This is enough for each byte to represent a single character in most languages, so a five-character word like Hello requires five bytes to store.

To improve speed, modern computers process multiple bytes at one time as words. Although mobile devices and older computers use 32-bit words (four bytes), most contemporary laptops and desktops use 64-bit words (eight bytes).

The amount of storage in a computer or storage device is usually measured by the number of bytes that it can store. Because storage devices can store trillions or quadrillions of bytes, Greek prefixes are used to make referring to numbers of bytes easier. However, because this is digital data, powers of two are used, making the decimal numbers look a bit sloppy

Storage Media

Digital data is stored on a variety of physical media, depending on how quickly the data needs to be accessed, how much data needs to be stored, and whether the data needs to continue to exist when the digital device is turned off or rebooted.

Considerations When Choosing Storage Formats and Platforms

A number of factors need to be considered when choosing the appropriate storage hardware and formats for a project. Those needs are driven by the organizational size and mission: What are you ultimately trying to accomplish with the data?

  1. Number of readers
    • How many people need to access the data?
    • How quickly do they need access to the data?
  2. Number of editors
    • How many people capture, process and maintain the data?
    • Will multiple be working on the data at the same time?
  3. Frequency of change:
    • How often is the data changed?
    • How quickly do changes need to be available to users?
  4. Volume and types of data:
    • How much data exists?
    • How much data will exist?
    • How many different types of data need to be kept together?
    • How will needs grow or shrink over time?
  5. Access security:
    • Who needs access to the data?
    • Who should be kept out of the data?
    • Do federal or state regulations require restricting access to the data?
    • How do the costs of a security breach balance against the costs of security?
  6. Availability security:
    • What would happen if this data were lost or destroyed?
    • Who will perform backups?
    • Does this data need to survive this project?
  7. Cost:
    • Will this be compatible with existing processes?
    • What are the set-up and maintenance costs for storage?
    • What can we afford in terms of both capital investment and manpower?
    • Do managers or co-workers have a preconceived bias against a technology?

Storage Formats

Geospatial data can be stored in a number of different types of digital files on the physical media described above. The following are file types you will commonly encounter when performing basic GIS.

Graphical Maps

Gerardus Mercator's 1569 World Map (Wikipedia)

CSV

Geospatial data can be stored in simple table formats like comma-separate variable (CSV) files as columns of latitude and longitude associated on each row with specific attributes at those latitudes and longitudes. However, this is largely limited to points, rather than areas. For lines (like roads) and areas (like neighborhoods or census tracts) you need to save data in a specialized geospatial data file format.

The simplicity of a CSV file also has an advantage in its potential for preserving data. File formats that are more complex (especially proprietary formats) will become obsolete as technology changes. But data in a CSV file will likely be readable for generations to come.

Example CSV File

Shapefiles

The shapefile is a geospatial data file format that was developed by ESRI in the late 1990s. While the age of this format is reflected in its numerous limitations (such as column name length limit of 10 characters), this format is supported by a wide variety of GIS software and is still commonly use for distributing geospatial data by municipal governments, including Denver, Chicago, Los Angeles, New York, among many others.

The shapefile is actually a collection of at least three (and usually more) separate files that store the locational data, the characteristics associated with those locations, and other information about the data. Some common files associated with a shapefile include (listed by the file extension):

For convenience, all these files are usually compressed into a single ZIP file for distribution on websites and servers.

A Listing of the Different Files in a Shapefile

Single-User Geodatabases

Geodatabases are an organized way to keep similar data together. When a database is being used by a single user, ArcMap provides two different single-user database file formats. These are proprietary data file formats that are designed to fully support the features of ESRI software and can only be read with ESRI software.

The recommended single-user database format is the file geodatabase, which was introduced in 2006. Like a shapefile, the file geodatabase is a collection of different files, with all files kept in a folder that has a .gdb extension.

Example of Multiple Feature Classes in a File Geodatabase

A personal geodatabase is a Micro$oft Access database .mdb file containing geospatial data. This format was originally introduced in 1999 and has some significant limitations, including a 2GB table size limit. ESRI recommends using file geodatabases rather than personal geodatabases, although you may occasionally find older data in this format.

An open-source file format analogous to the proprietary ESRI Personal Geodatabase is SpatiaLite, which is an extension of the Sqlite self-contained database file format.

Geospatial File Formats For the Web

Google Earth/Maps exchanges geospatial data in the Keyhole Markup Language (KML) format that is based on Extensible Markup Language (XML). KML is designed for the web and contains information on how the geospatial data should be displayed on a web map like Google Maps, or in Google Earth. Since KML was designed for simple web mapping, it is not particularly good for storing complex attribute data. KML can be imported and exported to/from ArcMap using the KML to Layer and Layer to KML tools, respectively. Most GIS software can read KML files, but shapefiles are usually preferred for serious analysis or when working with data sets of any significant size.

The GPS Exchange Format (GPX) is another XML-based format that is commonly exported by GPS tracker apps in smartphones to store GPS points. GPX files can be imported into ArcMap using the GPX To Features tool.

GeoJSON is an extension to JavaScript Object Notation (JSON) that is used for data displayed in web maps. Although ArcMap can convert to and from GeoJSON using the Features To JSON and JSON To Features tools, respectively, GeoJSON is primarily of value to web map programmers.

AutoCAD

You occasionally may see geospatial data stored in the files used by the engineering drafting program AutoCAD. However, AutoCAD is a general use drafting program for objects of all sizes and the proprietary file format often does not contain adequate coordinate or attribute information to allow data to be transferred directly into GIS software.

AutoCAD

Raster Data Formats

Remotely-sensed raster data from satellites and other aerial platforms is stored in a wide variety of formats like:

These file formats are specialized to raster data and are discussed in much greater practical detail in classes or tutorials on remote sensing.

MODIS NDVI For the USA

Historical geospatial data is frequently gleaned from historical maps that have been scanned into raster images. The image files can the standard JPEG, PNG or TIFF file formats used for photographs.

When digitizing historical information it is possible to georeference maps, which involves interactively assigning contemporary coordinates to locations on a historical map and then use mathematical transformations (resampling) to reshape that map so that historical locations can be found with contemporary geospatial technology. After georeferencing, the images are often saved to a GeoTIFF file that preserves the georeferencing information and allows the image to be loaded just like any other raster data.

As with geocoding, the process is imperfect as there are almost always errors and inaccuracies in the work of map makers (both past and present) and no simplified representation can ever perfectly capture complex, ever-changing reality.

Downtown Spokane Washington, 1908

Storage Platforms

Paper

Shared Computer Hard Drive

Laptop and Mobile Device Memory

Personal/Work Desktop Hard Drive

Portable Flash Drive

File Server

Servers are powerful computers that are accessed over a computer network by client users. This client-server model is used for a wide variety of applications in digital computing.

Client-Server Model (Wikipedia)

File servers are centralized servers that laptop and desktop computers can connect to and have files appear as if they were stored on the local hard drive. They are best for personal use or when working in a small group/organization. On PCs, file servers can be mounted as lettered drives (ex. N:) and handled through the organizations conventional IT management structure.

Network Drive Example

Web Server

When you type a URL into a web browser or click on a hyperlink, your client sends a request to a server for a web page. The server that responds to that request is called a web server, and the response includes the text, images, videos, and other elements needed to display a web page.

Web pages often contain web maps that users can interact with. While the page containing the web map is provided by a web server, the contents of the maps themselves require specialized geospatial data servers.

However, general purpose web server can provide access to geospatial data files, such as through city open data portals, for use with desktop GIS software.

Geospatial Data Server

When working with large, complex data sets that change frequently and are shared among multiple users, you need a more-robust way of storing your data.

Geospatial data servers are specialized servers that use software like ESRI's ArcGIS Server. They store data in databases that make that data accessible to multiple editors for maintenance. Geospatial data servers also render this data into maps that can be displayed on web clients as web maps.

The Cloud

Organizations have traditionally maintained racks of servers for handling different types of data. The challenge with this approach is that servers are expensive to build, operate and maintain. Also, organizations must provide enough server capacity to handle peak loads (such as at 9am when employees arrive at work and check their e-mail). This full capacity is unused most of the time, making conventional server deployment an inefficient use of resources.

One approach to increasing the efficiency of server utilization is server virtualization. This practice is commonly referred to as cloud computing or just, The Cloud.

Collections of servers are shared among multiple organizations and/or groups within organizations. As the needs of these organizations ebb and flow, server capacity is given to those who need more capacity and taken away from those who don't. What is unique about server virtualization is that this dynamic allocation is handled automatically by software on the servers, making it possible to quickly respond to changing conditions.

Cloud servers can be spread out among server farms at multiple geographic locations. This improves reliability because if there is a problem at one server farm, the other server farms in the cloud can pick up the slack. Also, data can be duplicated in multiple locations making it possible to both more-quickly serve multiple users, and providing a backup in case there is a hardware failure in any one server farm.

The cloud is now universally used by all companies with large internet presences, such as Google, Netflix, and Amazon. For smaller organizations, The Cloud can free the organization from the cost, hassle and unreliability of internal servers.

Cloud Storage

Cloud storage is probably the aspect of The Cloud that is most visible to the general public. Companies like OneDrive, Dropbox, Google Drive and Box provide a centralized, instantly available location for storing and backing up personal files.

Infrastructure As a Service

Infrastructure as a Service (IaaS) is when a cloud company makes a virtual server available through the Internet that can function as if it were a server in your own office. You are responsible for maintaining the software, but the cloud company maintains the hardware, and can even allocate more server capacity to you when needed.

For example, Amazon sells cloud services that use the same infrastructure as their sales operations, and although Amazon Web Services only made up 7% of Amazon's 2014 revenue, it accounted for 37% of its profit. A company can use set up an EC2 instance on Amazon Web Services and run ArcGIS for Server in The Cloud just as though it were on a server in their office.

Software As a Service

Software as a Service (SaaS) is when you use an interactive website from a Cloud provider much like you would use a piece of software actually running on your computer or mobile device. Google Docs is an example of SaaS. More specifically with geospatial data, popular SaaS services like ArcGIS Online, Carto, and MapBox make it possible to easily create and maintain complex and attractive web maps without having to install, learn and use complex desktop and server software.

Data As a Service

Data as a Service (DaaS) is available when web mapping portals like Google Maps, OpenStreetMap, and Bing make their geospatial data available to developers through Application Programming Interfaces (APIs) for use in web and mobile applications.

Security has been a persistent question hanging over cloud services, although a bigger issue has been increasing dependence by a larger group of companies on a smaller group of cloud providers, yielding spectacular failures when these providers have technical problems.

The Digital Dark Ages

In the developed world, we capture and store almost everything that can be stored: security video, electronic communications, smartphone photos of events momentous and trivial.

Almost none of that data will survive us.

Although storage becomes cheaper every year, technology changes every year. Data must be migrated from old storage media and file formats, or it is lost to physical degradation or technological obsolescence.

Data in The Cloud never has a permanent physical home. The Cloud is a performance and requires constant flows of capital and resources to stay in operation. Changes in the economics of The Cloud will necessitate loss of some of that data. Which data will be lost to time?

Contrast the impermanence of the digital with papyrus text from 2500 BC or clay tablets from as far back as 3300 BC.

While security camera video from an ATM where there has been no criminal activity may not be something that should outlive us, your grandchildren may want to see some of those thousands of baby pictures that you took of your son in the first year of his life. You should plan accordingly.

Geospatial Data Storage in ArcMap

  1. CSV files (0:10)
  2. Shapefiles (2:50)
  3. File Geodatabases (4:30)
  4. Saving Data in a New File Geodatabase (5:50)
  5. Servers (6:45)
  6. Web Feature Service (7:15)
  7. KML and Google Maps (7:50)
  8. ArcCatalog (9:20)

Geospatial Data Storage in R

Plotting Points From a CSV File

The CSV file in this example is HERE...

# The R GDAL library contains GDAL functions for loading spatial data
# RGDAL loads the sp library, which contains spatial functions
library(rgdal)

# Read the CSV file with latitudes and longitudes
data = read.csv("2016-cracker-barrel.csv", stringsAsFactors=F)

# Convert the data to a SpatialPointsDataFrame
coords = data.frame(data$LONGITUDE, data$LATITUDE)
wgs84 = CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
restaurants = SpatialPointsDataFrame(coords, data, proj4string = wgs84)

# Simple x/y plot of the points
plot(restaurants, pch=16, col="navy")

# Load and plot an OpenStreetMap base map covering the area
upperLeft = c(max(data$LATITUDE) + 0.2, min(data$LONGITUDE) - 0.2)
lowerRight = c(min(data$LATITUDE) - 0.2, max(data$LONGITUDE) + 0.2)

library(OpenStreetMap)
osm = openmap(upperLeft, lowerRight, type="osm")
plot(osm)

# Transform the projection from WGS84 to Spherical Mercator to plot the points
spherical_mercator = CRS("+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 
	+k=1.0 +units=m +nadgrids=@null +no_defs")
restaurants = spTransform(restaurants, spherical_mercator)

plot(restaurants, pch=16, col="red", cex=2, add=T)

# Optional save to a shapefile
writeOGR(restaurants, ".", "cracker-barrel", "ESRI Shapefile")

# Optional save to sqlite database file
writeOGR(restaurants, "cracker-barrel.sqlite", "restaurants", driver="SQLite")

Plotting Polygons From a Shapefile

The data for this example is in 2017-state-data.zip.

library(rgdal)

# Load the shapefile
states = readOGR(".", "2017-state-data", stringsAsFactors=F)

# Select only the contiguous 48 states
states = states[((states$ST != "AK") & (states$ST != "HI")),]

# Transform to Albers Equal Area Conic to make cartographically proper
albers = CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 
	+ellps=GRS80 +datum=NAD83 +units=m +no_defs")
states = spTransform(states, albers)

# Plot a red-state/blue-state choropleth
colors = ifelse(states$WIN2012 == "Obama", "navy", "red")
plot(states, col = colors)