CyberGISX
This tutorial demonstrates how to create a basic Jupyter notebook using spatial data on the CyberGISX platform provided by the CyberGIS Center for Advanced Digital and Spatial Studies and CyberInfrastructure and Geospatial Information Laboratory (CIGI) at the University of Illinois.
Background
CyberGIS is a conceptual framework that merges cyberinfrastructure and geographic information systems to facilitate computationally intensive and collaborative spatial analysis and modeling (Wang 2010).
The term cyberinfrastructure emerged in the 1990s with, perhaps, the earliest appearance in a comment by Jeffrey Hunker (then director of the Critical Infrastructure Assurance Office) at a 1998 press conference on Presidential Decision Directive NSC 63: Critical Infrastructure Protection (The White House 1998):
One of the key conclusions of the President's commission that laid the intellectual framework for the President's announcement today was that while we certainly have a history of some real attacks, some very serious, to our cyber-infrastructure, the real threat lay in the future. And we can't say whether that's tomorrow or years hence. But we've been very successful as a country and as an economy in wiring together our critical infrastructures. This is a development that's taken place really over the last 10 or 15 years -- the Internet, most obviously, but electric power, transportation systems, our banking and financial systems.
The term was more clearly defined in a 2003 NSF report (Atkins et al 2003):
The term infrastructure has been used since the 1920s to refer collectively to the roads, power grids, telephone systems, bridges, rail lines, and similar public works that are required for an industrial economy to function. Although good infrastructure is often taken for granted and noticed only when it stops functioning, it is among the most complex and expensive thing that society creates.
The newer term cyberinfrastructure refers to infrastructure based upon distributed computer, information and communication technology. If infrastructure is required for an industrial economy, then we could say that cyberinfrastructure is required for a knowledge economy.
For the purposes of CyberGIS in an academic institution, a more specific definition of cyberinfrastructure was developed in 2009 by the EDUCAUSE Campus Cyberinfrastructure Working Group and Coalition for Academic Scientific Computation (EDUCAUSE 2009, Stewart et al. 2010):
Cyberinfrastructure consists of computational systems, data and information management, advanced instruments, visualization environments, and people, all linked together by software and advanced networks to improve scholarly productivity and enable knowledge breakthroughs and discoveries not otherwise possible.
The CyberGIS Center for Advanced Digital and Spatial Studies "was established in 2013 as a cross- and trans-disciplinary center engaging a number of units at the University of Illinois at Urbana-Champaign and diverse partners in the US and world" (CyberGIS Center 2021).
In 2014, the CyberGIS Center received a National Science Foundation major research instrumentation grant to establish the ROGER (Resourcing Open Geospatial Education and Research) cyberGIS supercomputer (Wikipedia 2021). The physical ROGER was later supplanted by the cloud-based Virtual ROGER integrated with the Keeling compute cluster operated by the U of I School of Earth, Society, and Environment (SESE) (CyberGIS Center 2018)
CyberGISX is an integrated development and sharing platform running on Virtual ROGER that provides support for geospatial software and applications.
Creating a New Session
UIUC students and faculty can log in to the CyberGISX Hub using your U of I NetID and password. Users from more than 4,000 additional universities, research institutes, and academic organizations in the US and worldwide can quickly register and start using CyberGISX with their institutional credentials.
Create the Notebook
CyberGISX uses Jupyter notebooks as the programming interface.
A notebook is an interactive interface that allows you to integrate programming code with documentation, analysis, and visualizations.
- Jupyter notebooks were developed by Project Jupyter, which was spun off from the IPython interactive computing project in 2014.
- The name Jupyter is a portmanteau formed from the names of the three core language supported by the project: Julia, Python, and R (Wikipedia 2021).
- Notebooks are an incarnation of the concept of literate programming, where pioneering computer scientist Donald Knuth (1984) proposed that since programming methodology had progressed to a point where programs should be considered "works of literature," this should result in a paradigm shift for programmers from "imagining that our main task is to instruct a computer what to do," to concentrating "rather on explaining to human beings what we want a computer to do."
To create a new Jupyter noteboox in CyberGISX:
- At the Home Page, create a New Notebook:, Python 3.
- Right click and Rename the new Untitled.ipynb notebook file to something meaningful (Minn 2023 State Energy).
Markdown Cells
Jupyter notebooks are composed of cells, which are individual sections of the notebook that can contain programming code (Python) or text (markdown).
To add a heading to your notebook:
- Change the cell type to Markdown.
- Add the top level heading by preceding your text with a pound (#) sign.
- Add additional text as needed.
- Run the markdown cell to see what it will look like when rendered.
# State Energy Profiles Michael Minn 28 August 2023
Code Cells
New cells for code can be added by clicking the plus sign (+) on the toolbar.
The following code loads state energy profile data that will be used in this tutorial.
- GeoPandas is a Python package for working with geospatial data.
- You will also need the matplotlib package to plot() maps.
- You can read geospatial data from a file or web source into a GeoDataFrame object using the read_file() function.
- For this example, we will use a GeoJSON file (2019-state-energy.geojson) of state-level energy production and consumption election data from the US Energy Information Administration (EIA).
- You can get a list of available fields and their types using the field (column) names using the GeoPandas info() function.
- Press the Run button to run the code.
import geopandas import matplotlib.pyplot as plt states = geopandas.read_file("https://michaelminn.net/tutorials/data/2019-state-energy.geojson") states.info()
RangeIndex: 53 entries, 0 to 52 Data columns (total 42 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ST 53 non-null object 1 Name 53 non-null object 2 GEOID 53 non-null object 3 AFFGEOID 53 non-null object 4 Square.Miles.Land 53 non-null float64 5 Square.Miles.Water 53 non-null float64 6 State.Name 51 non-null object ... 36 CO2.Per.Capita.Tonnes 51 non-null float64 37 Renewable.Standard.Type 51 non-null object 38 Renewable.Standard.Name 51 non-null object 39 Renewable.Standard.Year 38 non-null float64 40 Senators.Party 50 non-null object 41 geometry 53 non-null geometry dtypes: float64(33), geometry(1), object(8) memory usage: 17.5+ KB
Mapping with GeoPandas
Mapping Categorical Attributes
Methods are functions that are associated with specific classes of objects. While functions can stand alone, methods, as the name implies, perform actions on or based on the contents of the classed objects that they are called with.
Methods are foundational to object-oriented programming where the complexity of operations on objects are hidden from the programmer so they can focus on the high-level objectives of the program rather than the low-level details of operations on complex objects.
- Choropleth maps can be created from a GeoDataFrame using the plot() method.
- The attribute used to color the areas should be specified as the first argument.
- The pyplot set_axis_off() function turns off the axis scale around the map, which is unnecessary with a projected map.
- In order display the plot, you need the pyplot show() function.
states.plot("Renewable.Standard.Type", legend=True) plt.show()
Categorical Color Map
There are a variety of predefined colormaps that you can use to create a map using more descriptive colors by passing the name of the colormap in the cmap parameter on plot()
states.plot("Renewable.Standard.Type", legend=True, cmap="coolwarm") plt.show()
Mapping Quantitative Attributes
The plot() function can also be used to map quantitative attributes, although, again, you may need additional parameters to get what you want.
This creates a map of per-capita energy use in each state in millions of BTUs. The default plot uses a blue-purple-yellow color ramp:
states.plot("Consumption.Per.Capita.MM.BTU", legend=True) plt.show()
Quantitative Color Map
You can specify a specific colormap with the cmap parameter and use easier-to-read categorized colors by passing the scheme = "naturalbreaks" parameters to plot().
states.plot("Consumption.Per.Capita.MM.BTU", legend=True, cmap="coolwarm_r", scheme="naturalbreaks") plt.show()
Selection
There will likely be situations where you only want to use a specific selection of features from a geospatial data set.
GeoPandas GeoDataFrame are extensions of Pandas DataFrame, and rows can be selected by attribute using the same techniques used in Pandas
northeast = states[states.ST.isin( \ ['ME', 'VT', 'NH', 'CT', 'RI', 'NY', 'NJ', 'MA', 'PA'])] northeast.plot("Consumption.Per.Capita.MM.BTU", legend=True, cmap="coolwarm", scheme="naturalbreaks") plt.show()
Projections
The Earth exists in three-dimensions but, other than globes, most representations of the earth, such as printed maps or web maps, are two dimensional. A projection is a set of mathematical transformations used to represent the three-dimensional world in two dimensions.
By default, the plot() function plots the geospatial data using an equirectangular projection that may be undesirable depending on what part of the world you are mapping and what you are using the map for.
The to_crs() method can be used to reproject a geospatial object to a new projection. The parameter will accept a EPSG or ESRI WKID or a proj-string that describes the desired projection.
For this map of the US, we use an ESRI WKID for a Lambert conformal conic projection centered on the continental US.
continental = states[~states.ST.isin(['AK', 'HI'])] continental = continental.to_crs("ESRI:102009") continental.plot("Consumption.Per.Capita.MM.BTU", legend=True, cmap="coolwarm", scheme="naturalbreaks") plt.show()
Correlation
The pandas package upon which geopandas is built is used for data manipulation and analysis. While there are specialized analysis functions that take advantage of the unique characteristics of spatial data, simple non-spatial functions can be used for data exploration.
For example, there is a field for gross domestic product (GDP.B), which represents the total amount of economic activity in the state. Greater economic activity is generally associated with higher energy use. To examine whether that is true at the state level, we can plot an X/Y scatter chart between the two attributes to see if the plot shows a correlation.
- For this call to plot(), we pass the two attributes.
- To make small and large states more visible together, we use logarithmic scales for both the x and y axes (yscale and xscale).
- The chart shows a fairly clear line pattern from lower left to bottom right, indicating there is correlation.
plt.scatter(states["GDP.B"], states["Consumption.Total.B.BTU"]) plt.ylabel("2019 Energy Consumption (MM BTU)") plt.xlabel("2019 GDP ($B)") plt.yscale("log") plt.xscale("log") plt.show()
Linear Model
We can use the OLS.fit() method from the statsmodels module to create a simple bivariate linear model for the relationship between GDP and energy consumption.
import statsmodels.api as sm y = states["Consumption.Total.B.BTU"] x = states[["GDP.B"]] model = sm.OLS(y, x, missing="drop").fit() model.summary()
R2 is a value from zero (no correlation) to one (perfect correlation) and the adjusted R2 value of 0.779 indicates a very strong correlation, as expected.
Regression Line
Finally we can plot model predictions as a regression line across the scatter chart.
plt.scatter(states["GDP.B"], states["Consumption.Total.B.BTU"]) y_model = model.predict(x) plt.plot(x, y_model, color="maroon") plt.text(50, 7e6, "R^2 = " + str(round(model.rsquared_adj, 3))) plt.ylabel("2019 Energy Consumption (MM BTU)") plt.xlabel("2019 GDP ($B)") plt.yscale("log") plt.xscale("log") plt.show()
Finishing Up
Render
Rendering a notebook is the process of transforming a notebook and the computed results into a format that can be read outside of the Jupyter interface.
Jupyter notebooks are commonly rendered into HTML for viewing in web browsers, or PDF files for printing.
- File, Save and Export Notebook As..., HTML
- You may want to upload the rendered HTML to your CyberGISX folder to keep everything together.
Log Out
When you are finished with your notebook, log out.
Reopening a Notebook
To reopen a notebook, find it in the directory list on the left side of the CyberGISX screen and double-click to reopen.
Sharing with OneDrive
CyberGISX kernels are local to the CyberGISX environment, so if you want to share your notebook or associated data over the Internet, you need to put it on a server.
Sharing a Notebook on OneDrive
- Right click on the notebook and Download the notebook to your local storage.
- Upload the file in OneDrive.
- Share with Anyone with the link.
Sharing HTML on OneDrive
- File, Save and Export Notebook As..., HTML
- Upload the file to OneDrive
Sharing with GitHub
GitHub is an internet hosting service for sofware developers that uses the Git version control software (Wikipedia 2023).
GitHub provides a wide variety of sophisticated features for collaborating on complex software projects, but some easily-accessible features can be useful by users with limited development experience. GitHub is integrated into CyberGISX for sharing notebooks and data.
New Account
You can create a new GitHub account by clicking the Sign in link on the GitHub.com home page and then clicking the Create an account link.
New Repository
Collections of related project files are kept together in GitHub repositories.
- To create a new repository, navigate to the Repositories page and select New.
- Give the repository a meaningful name (cybergisx).
- Click Create repository.
Sharing Notebooks
- Click the Restart the kernel and rerun the whole notebook to make sure the notebook runs from the top and contains graphics.
- Save the notebook to update the file.
- Right click on the notebook and Download.
- Upload the file to GitHub:
- For a new repository, click the uploading an existing file link to upload the file from your local machine.
- For an existing repository, click the Add file button.
- Commit the changes to the repository.
- Click the file to see the file.
- Copy permalink to get a link you can share with others.
Updating Files
To upload a new version of a file:
- Re-run, save, and download your notebook.
- Add file and Upload files using a file with the same name.
- Commit the changed file.
- Git track changes to different versions of the same file, and you can see prior versions of a file in its History page.
Sharing Data Files
If you have a data file that you wish to share so people who have your notebook can access that data through the internet, you can upload it to GitHub and get a shared link to incorporate in your notebook.
- Add file and Upload files with the data file. GitHub will create simple renderings of geospatial data files.
- Copy the shared link from the Download button.
- Incorporate that link in your notebook code.