# Geospatial Data Storage

## Digital Data

In contemporary geographic information systems, geospatial data is stored as digital data.

As the name implies, digital data consists of digits or numbers. Internally, digital electronic technology represents data as binary signals (bits) that are either on or off. This binary representation allows a high level of flexibility and accuracy in the representation and processing of data.

For historical reasons, bits are clumped into groups of eight that are called bytes. If you run through all the possible combinations of eight bits, you will find that a byte can have 256 different values (numbers 0 - 255). This is enough for each byte to represent a single character in most languages, so a five-character word like Hello requires five bytes to store.

To improve speed, modern computers process multiple bytes at one time as words. Although mobile devices and older computers use 32-bit words (four bytes), most contemporary laptops and desktops use 64-bit words (eight bytes).

The amount of storage in a computer or storage device is usually measured by the number of bytes that it can store. Because storage devices can store trillions or quadrillions of bytes, Greek prefixes are used to make referring to numbers of bytes easier. However, because this is digital data, powers of two are used, making the decimal numbers look a bit sloppy

• One kilobyte (KB) = 210 bytes = 1,024 bytes
• One megabyte (MB) = 220 bytes = 1,048,576 bytes = 1,024 KB
• One gigabyte (MB) = 230 bytes = 1,073,741,824 bytes = 1,024 MB
• One terabyte (MB) = 240 bytes = 1,099,511,627,776 bytes = 1,024 GB

## Storage Media

Digital data is stored on a variety of physical media, depending on how quickly the data needs to be accessed, how much data needs to be stored, and whether the data needs to continue to exist when the digital device is turned off or rebooted.

• Random access memory (RAM) is made with silicon transistors to quickly store and access data that is being actively used. RAM is fast but more expensive than other forms of memory, and the data is lost when the device is turned off or rebooted
• Magnetic hard disks are spinning platters coated with magnetic material that stores data in magnetic patterns on the disk. Hard drives can store very large amounts of data (in the terabytes), but this data takes longer to access than RAM. Hard drives are a reliable, established technology. Data on a hard drive remains even after the hard drive is powered down, but hard drives do not last forever and will eventually fail, often taking their data with them
• Flash memory is made with transistors like RAM, but built with a special structure (floating-gate MOSFET) that allows the data to persist even if the power is turned off. Flash memory has become ubiquitous in consumer devices (SD cards, thumb drives, smartphones, etc) because it has high capacity and has become inexpensive over the past decade. Flash memory is slowly replacing magnetic hard disks with solid-state drives that are faster and use less power. However, flash memory is limited in the number of times it can be written to, so solid-state drives do not last as long as magnetic hard drives and are prone to unexpected failures
• Optical disks as compact disks (CDs) and digital versatile disks (DVDs) store bits as indentations in aluminum or chemical films that are then encased in plastic disks. Optical disks have high capacity and are inexpensive to manufacture in bulk. However, they are generally used only for data that will not change for extended periods of time, and they are commonly used to archive and backup data from magnetic and flash drives. It is uncertain how long data on a CD or DVD can be expected to last, and optical disks are rapidly becoming obsolete
• Magnetic floppy disks store data in a similar manner to magnetic hard disks, except on a removable plastic disk nestled in a protective case. You may occasionally encounter old data stored on floppy disks, although this technology is obsolete and unreliable. You should migrate any important data off these disks and onto a hard drive as soon as possible so the data is not lost to physical degradation
• Magnetic tape is a roll of plastic film coated with a magnetic material and used to store bits in a similar way as magnetic hard drives. Although tape is one of the oldest technologies for storing digital data, tape drives are still used to back up hard drives for long-term storage

## Considerations When Choosing Storage Formats and Platforms

A number of factors need to be considered when choosing the appropriate storage hardware and formats for a project. Those needs are driven by the organizational size and mission: What are you ultimately trying to accomplish with the data?

1. Number of readers
• How many people need to access the data?
• How quickly do they need access to the data?
2. Number of editors
• How many people capture, process and maintain the data?
• Will multiple be working on the data at the same time?
3. Frequency of change:
• How often is the data changed?
• How quickly do changes need to be available to users?
4. Volume and types of data:
• How much data exists?
• How much data will exist?
• How many different types of data need to be kept together?
• How will needs grow or shrink over time?
5. Access security:
• Who needs access to the data?
• Who should be kept out of the data?
• Do federal or state regulations require restricting access to the data?
• How do the costs of a security breach balance against the costs of security?
6. Availability security:
• What would happen if this data were lost or destroyed?
• Who will perform backups?
• Does this data need to survive this project?
7. Cost:
• Will this be compatible with existing processes?
• What are the set-up and maintenance costs for storage?
• What can we afford in terms of both capital investment and manpower?
• Do managers or co-workers have a preconceived bias against a technology?

## Storage Formats

Geospatial data can be stored in a number of different types of digital files on the physical media described above. The following are file types you will commonly encounter when performing basic GIS.

### Graphical Maps

• Number of readers: Single-user
• Number of editors: Single-editor
• Frequency of change: Information is fixed and difficult to change
• Volume and types of data: Restricted by the size of the media. Limited to information that can be visualized
• Cost: Quality maps are expensive to create. Digitization is time-consuming and error-prone

### CSV

• Number of readers: Single-user
• Number of editors: Single-editor
• Frequency of change: Constrained by single-editor restriction
• Volume and types of data: Primarily useful only for a limited number of points
• Cost: Inexpensive. Can be easily edited with any spreadsheet software and rendered on a variety of web and desktop platforms

Geospatial data can be stored in simple table formats like comma-separate variable (CSV) files as columns of latitude and longitude associated on each row with specific attributes at those latitudes and longitudes. However, this is largely limited to points, rather than areas. For lines (like roads) and areas (like neighborhoods or census tracts) you need to save data in a specialized geospatial data file format.

The simplicity of a CSV file also has an advantage in its potential for preserving data. File formats that are more complex (especially proprietary formats) will become obsolete as technology changes. But data in a CSV file will likely be readable for generations to come.

### Shapefiles

• Number of readers: Single-user
• Number of editors: Single-editor
• Frequency of change: Constrained by single-editor restriction
• Volume and types of data: Points, lines and polygons. Can be used to store large amounts of data, but this is not preferred
• Cost: Requires specialized GIS software to edit

The shapefile is a geospatial data file format that was developed by ESRI in the late 1990s. While the age of this format is reflected in its numerous limitations (such as column name length limit of 10 characters), this format is supported by a wide variety of GIS software and is still commonly use for distributing geospatial data by municipal governments, including Denver, Chicago, Los Angeles, New York, among many others.

The shapefile is actually a collection of at least three (and usually more) separate files that store the locational data, the characteristics associated with those locations, and other information about the data. Some common files associated with a shapefile include (listed by the file extension):

• .shp: Contains the feature geometry (points, lines, polygons)
• .shx: An index file that indicates where specific features are in the .shp file
• .dbf: A dBase IV database file of attributes associated with each of the shapes in the .shp file
• .prj: The coordinate system and projection used by the feature geometry (optional)
• .cpg: The character encoding used by the attributes (optional)
• .qpj: The coordinate system and projection in a format used by QGIS (optional)

For convenience, all these files are usually compressed into a single ZIP file for distribution on websites and servers.

### Single-User Geodatabases

• Number of readers: Single-user
• Number of editors: Single-editor
• Frequency of change: Constrained by single-editor restriction
• Volume and types of data: Supports vector and raster formats as well as additional features like topologies. Can be used to store large amounts of data, but this is not preferred
• Cost: Requires proprietary GIS software to use

Geodatabases are an organized way to keep similar data together. When a database is being used by a single user, ArcMap provides two different single-user database file formats. These are proprietary data file formats that are designed to fully support the features of ESRI software and can only be read with ESRI software.

The recommended single-user database format is the file geodatabase, which was introduced in 2006. Like a shapefile, the file geodatabase is a collection of different files, with all files kept in a folder that has a .gdb extension.

A personal geodatabase is a Micro\$oft Access database .mdb file containing geospatial data. This format was originally introduced in 1999 and has some significant limitations, including a 2GB table size limit. ESRI recommends using file geodatabases rather than personal geodatabases, although you may occasionally find older data in this format.

An open-source file format analogous to the proprietary ESRI Personal Geodatabase is SpatiaLite, which is an extension of the Sqlite self-contained database file format.

### Geospatial File Formats For the Web

• Number of readers: Multi-user
• Number of editors: Single-editor
• Frequency of change: Constrained by single-editor restriction
• Volume and types of data: Points, lines, and polygons. Can store large amounts of data, but large amounts of data in web maps makes them very slow
• Cost: Can be created with a variety of proprietary and open source products

Google Earth/Maps exchanges geospatial data in the Keyhole Markup Language (KML) format that is based on Extensible Markup Language (XML). KML is designed for the web and contains information on how the geospatial data should be displayed on a web map like Google Maps, or in Google Earth. Since KML was designed for simple web mapping, it is not particularly good for storing complex attribute data. KML can be imported and exported to/from ArcMap using the KML to Layer and Layer to KML tools, respectively. Most GIS software can read KML files, but shapefiles are usually preferred for serious analysis or when working with data sets of any significant size.

The GPS Exchange Format (GPX) is another XML-based format that is commonly exported by GPS tracker apps in smartphones to store GPS points. GPX files can be imported into ArcMap using the GPX To Features tool.

GeoJSON is an extension to JavaScript Object Notation (JSON) that is used for data displayed in web maps. Although ArcMap can convert to and from GeoJSON using the Features To JSON and JSON To Features tools, respectively, GeoJSON is primarily of value to web map programmers.

• Number of readers: Single-user
• Number of editors: Single-editor
• Frequency of change: Constrained by single-editor restriction
• Volume and types of data: Engineering data
• Cost: Requires expensive proprietary software to maintain. Transferring into the GIS realm can be difficult and time-consuming
• Can be created with a variety of proprietary and open source products

You occasionally may see geospatial data stored in the files used by the engineering drafting program AutoCAD. However, AutoCAD is a general use drafting program for objects of all sizes and the proprietary file format often does not contain adequate coordinate or attribute information to allow data to be transferred directly into GIS software.

### Raster Data Formats

• Number of readers: Single-user
• Number of editors: Single-editor
• Frequency of change: Constrained by single-editor restriction
• Volume and types of data: Raster data. Can be quite large
• Cost: Can be created with a variety of proprietary and open source products

Remotely-sensed raster data from satellites and other aerial platforms is stored in a wide variety of formats like:

These file formats are specialized to raster data and are discussed in much greater practical detail in classes or tutorials on remote sensing.

Historical geospatial data is frequently gleaned from historical maps that have been scanned into raster images. The image files can the standard JPEG, PNG or TIFF file formats used for photographs.

When digitizing historical information it is possible to georeference maps, which involves interactively assigning contemporary coordinates to locations on a historical map and then use mathematical transformations (resampling) to reshape that map so that historical locations can be found with contemporary geospatial technology. After georeferencing, the images are often saved to a GeoTIFF file that preserves the georeferencing information and allows the image to be loaded just like any other raster data.

As with geocoding, the process is imperfect as there are almost always errors and inaccuracies in the work of map makers (both past and present) and no simplified representation can ever perfectly capture complex, ever-changing reality.

## Storage Platforms

### Paper

• Number of readers: Multiple-user (copies and prints)
• Number of editors: Single-editor
• Frequency of change: Difficult to change. Prone to errors and obsolescence
• Volume and types of data: Useful for small amounts of data, but cumbersome with large amounts of data
• Access security: Immune from digital hacking, but requires expensive physical security
• Availability security: Quality paper kept in optimal conditions can survive for centuries and is immune to technological change. Highly vulnerable to physical destruction and loss
• Cost: Hand-written data often difficult to read. Primary value is as flexible distribution medium in low-tech situations like public presentations or when users cannot access digital media. Use with digital technology requires expensive and error-prone digital technology. Copies are costly if large volumes or high print quality are needed

### Shared Computer Hard Drive

• Number of readers: Single-user
• Number of editors: Single-editor
• Frequency of change: Should only be used for temporary work storage of data transferred from other media
• Volume and types of data: Limited by storage device and installed software
• Access security: Highly insecure
• Availability security: Should only be used for temporary work storage of data transferred from other media
• Cost: Easiest choice in non-critical situations where no other options are available

### Laptop and Mobile Device Memory

• Number of readers: Single-user
• Number of editors: Single-editor
• Frequency of change: Highly flexible
• Volume and types of data: Limited by storage device and installed software
• Access security: Vulnerable to electronic hacking and physical theft
• Availability security: Physically flexible, but vulnerable to physical loss and device malfunction unless synced with server storage
• Cost: Mobile devices are ubiquitous

### Personal/Work Desktop Hard Drive

• Number of readers: Single-user
• Number of editors: Single-editor
• Frequency of change: Highly flexible
• Volume and types of data: Limited by storage device and installed software
• Access security: Vulnerable to electronic hacking and physical theft
• Availability security: Physically immobile. Vulnerable to physical loss and device malfunction
• Cost: Costs for purchase, maintenance, and real estate

### Portable Flash Drive

• Number of readers: Single-user at one time, multiple sequential users
• Number of editors: Single-editor at one time, multiple sequential users
• Frequency of change: Highly flexible
• Volume and types of data: Highly flexible
• Access security: Vulnerable to physical theft
• Availability security: Vulnerable to physical loss and device malfunction. Virus vector
• Cost: Inexpensive and highly mobile

### File Server

• Number of readers: Multi-user. Private access only
• Number of editors: Multi-user when using data formats that support concurrent access
• Frequency of change: Highly flexible
• Volume and types of data: Highly flexible
• Access security: Vulnerable to hacking and physical theft
• Availability security: Secure with adequate physical protection and offsite backup
• Cost: Requires capital and support cost

Servers are powerful computers that are accessed over a computer network by client users. This client-server model is used for a wide variety of applications in digital computing.

File servers are centralized servers that laptop and desktop computers can connect to and have files appear as if they were stored on the local hard drive. They are best for personal use or when working in a small group/organization. On PCs, file servers can be mounted as lettered drives (ex. N:) and handled through the organizations conventional IT management structure.

### Web Server

• Number of readers: Multi-user, public access
• Number of editors: Single-user
• Frequency of change: Highly flexible
• Volume and types of data: Limited to static file formats
• Access security: Vulnerable to hacking and physical theft
• Availability security: Secure with adequate physical protection and offsite backup
• Cost: Capital and support cost dependent on amount of user traffic

When you type a URL into a web browser or click on a hyperlink, your client sends a request to a server for a web page. The server that responds to that request is called a web server, and the response includes the text, images, videos, and other elements needed to display a web page.

Web pages often contain web maps that users can interact with. While the page containing the web map is provided by a web server, the contents of the maps themselves require specialized geospatial data servers.

However, general purpose web server can provide access to geospatial data files, such as through city open data portals, for use with desktop GIS software.

### Geospatial Data Server

• Number of readers: Multi-user
• Number of editors: Multi-user
• Frequency of change: Highly flexible
• Volume and types of data: Highly flexible
• Access security: Vulnerable to hacking and physical theft
• Availability security: Secure with adequate physical protection and offsite backup
• Cost: Requires significant capital and support cost

When working with large, complex data sets that change frequently and are shared among multiple users, you need a more-robust way of storing your data.

Geospatial data servers are specialized servers that use software like ESRI's ArcGIS Server. They store data in databases that make that data accessible to multiple editors for maintenance. Geospatial data servers also render this data into maps that can be displayed on web clients as web maps.

## The Cloud

Organizations have traditionally maintained racks of servers for handling different types of data. The challenge with this approach is that servers are expensive to build, operate and maintain. Also, organizations must provide enough server capacity to handle peak loads (such as at 9am when employees arrive at work and check their e-mail). This full capacity is unused most of the time, making conventional server deployment an inefficient use of resources.

One approach to increasing the efficiency of server utilization is server virtualization. This practice is commonly referred to as cloud computing or just, The Cloud.

Collections of servers are shared among multiple organizations and/or groups within organizations. As the needs of these organizations ebb and flow, server capacity is given to those who need more capacity and taken away from those who don't. What is unique about server virtualization is that this dynamic allocation is handled automatically by software on the servers, making it possible to quickly respond to changing conditions.

Cloud servers can be spread out among server farms at multiple geographic locations. This improves reliability because if there is a problem at one server farm, the other server farms in the cloud can pick up the slack. Also, data can be duplicated in multiple locations making it possible to both more-quickly serve multiple users, and providing a backup in case there is a hardware failure in any one server farm.

The cloud is now universally used by all companies with large internet presences, such as Google, Netflix, and Amazon. For smaller organizations, The Cloud can free the organization from the cost, hassle and unreliability of internal servers.

### Cloud Storage

Cloud storage is probably the aspect of The Cloud that is most visible to the general public. Companies like OneDrive, Dropbox, Google Drive and Box provide a centralized, instantly available location for storing and backing up personal files.

### Infrastructure As a Service

Infrastructure as a Service (IaaS) is when a cloud company makes a virtual server available through the Internet that can function as if it were a server in your own office. You are responsible for maintaining the software, but the cloud company maintains the hardware, and can even allocate more server capacity to you when needed.

For example, Amazon sells cloud services that use the same infrastructure as their sales operations, and although Amazon Web Services only made up 7% of Amazon's 2014 revenue, it accounted for 37% of its profit. A company can use set up an EC2 instance on Amazon Web Services and run ArcGIS for Server in The Cloud just as though it were on a server in their office.

### Software As a Service

Software as a Service (SaaS) is when you use an interactive website from a Cloud provider much like you would use a piece of software actually running on your computer or mobile device. Google Docs is an example of SaaS. More specifically with geospatial data, popular SaaS services like ArcGIS Online, Carto, and MapBox make it possible to easily create and maintain complex and attractive web maps without having to install, learn and use complex desktop and server software.

### Data As a Service

Data as a Service (DaaS) is available when web mapping portals like Google Maps, OpenStreetMap, and Bing make their geospatial data available to developers through Application Programming Interfaces (APIs) for use in web and mobile applications.

Security has been a persistent question hanging over cloud services, although a bigger issue has been increasing dependence by a larger group of companies on a smaller group of cloud providers, yielding spectacular failures when these providers have technical problems.

## The Digital Dark Ages

In the developed world, we capture and store almost everything that can be stored: security video, electronic communications, smartphone photos of events momentous and trivial.

Almost none of that data will survive us.

Although storage becomes cheaper every year, technology changes every year. Data must be migrated from old storage media and file formats, or it is lost to physical degradation or technological obsolescence.

Data in The Cloud never has a permanent physical home. The Cloud is a performance and requires constant flows of capital and resources to stay in operation. Changes in the economics of The Cloud will necessitate loss of some of that data. Which data will be lost to time?

Contrast the impermanence of the digital with papyrus text from 2500 BC or clay tablets from as far back as 3300 BC.

While security camera video from an ATM where there has been no criminal activity may not be something that should outlive us, your grandchildren may want to see some of those thousands of baby pictures that you took of your son in the first year of his life. You should plan accordingly.

## Geospatial Data Storage in ArcMap

1. CSV files (0:10)
2. Shapefiles (2:50)
3. File Geodatabases (4:30)
4. Saving Data in a New File Geodatabase (5:50)
5. Servers (6:45)
6. Web Feature Service (7:15)
7. KML and Google Maps (7:50)
8. ArcCatalog (9:20)

## Geospatial Data Storage in R

### Plotting Points From a CSV File

The CSV file in this example is HERE...

```# The R GDAL library contains GDAL functions for loading spatial data
# RGDAL loads the sp library, which contains spatial functions
library(rgdal)

# Read the CSV file with latitudes and longitudes
data = read.csv("2016-cracker-barrel.csv", stringsAsFactors=F)

# Convert the data to a SpatialPointsDataFrame
coords = data.frame(data\$LONGITUDE, data\$LATITUDE)
wgs84 = CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
restaurants = SpatialPointsDataFrame(coords, data, proj4string = wgs84)

# Simple x/y plot of the points
plot(restaurants, pch=16, col="navy")

# Load and plot an OpenStreetMap base map covering the area
upperLeft = c(max(data\$LATITUDE) + 0.2, min(data\$LONGITUDE) - 0.2)
lowerRight = c(min(data\$LATITUDE) - 0.2, max(data\$LONGITUDE) + 0.2)

library(OpenStreetMap)
osm = openmap(upperLeft, lowerRight, type="osm")
plot(osm)

# Transform the projection from WGS84 to Spherical Mercator to plot the points
spherical_mercator = CRS("+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0
+k=1.0 +units=m +nadgrids=@null +no_defs")
restaurants = spTransform(restaurants, spherical_mercator)

plot(restaurants, pch=16, col="red", cex=2, add=T)

# Optional save to a shapefile
writeOGR(restaurants, ".", "cracker-barrel", "ESRI Shapefile")

# Optional save to sqlite database file
writeOGR(restaurants, "cracker-barrel.sqlite", "restaurants", driver="SQLite")
```

### Plotting Polygons From a Shapefile

The data for this example is in 2017-state-data.zip.

```library(rgdal)

# Load the shapefile
states = readOGR(".", "2017-state-data", stringsAsFactors=F)

# Select only the contiguous 48 states
states = states[((states\$ST != "AK") & (states\$ST != "HI")),]

# Transform to Albers Equal Area Conic to make cartographically proper
albers = CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0
+ellps=GRS80 +datum=NAD83 +units=m +no_defs")
states = spTransform(states, albers)

# Plot a red-state/blue-state choropleth
colors = ifelse(states\$WIN2012 == "Obama", "navy", "red")
plot(states, col = colors)
```