Exploring Data in ArcGIS Pro
When you begin working with a new data set, your first step is usually to explore that data to find out what is in the data and whether it will meet the needs of your project. ArcGIS Pro provides a number of visualization and analysis tools to facilitate exploration of data.
This tutorial covers techniques for exploring geospatial data with descriptive statistics in ArcGIS Pro.
Loading and Mapping Data
ArcGIS Pro is fundamentally a mapping program, so the first step in exploring data in ArcGIS Pro is to load it and map it.
The example data used for this tutorial is a collection of country-level energy indicators for the year 2019 from the US Energy Information Administration, the World Bank, and others. It is available as the Minn 2019 World Energy Indicators feature layer from the University of Illinois ArcGIS Online organization, and as a GeoJSON file on this website.
Mapping permits visual analysis of geospatial data for answering general where questions.
- What regions of the world have the highest per capita energy consumption? (North America, Western Europe, the Global North)
- What regions of the world have the lowest per capita energy consumption? (Sub-Saharan Africa, South Asia)
In ArcGIS Pro, Under Analysis, run the Export Features tool to export the feature service data into the project geodatabase.
- Input Features: Search ArcGIS Online for the desired layer (Minn 2019 World Energy Indicators).
- Output Features: Provide a meaningful name (World Energy)
British Thermal Units
Energy values in this data set are expressed in BTUs. The British thermal unit (BTU) is an energy unit commonly used in American energy literature which is equivalent to the amount of heat needed to raise the temperature of one pound of water at sea level by one degree Fahrenheit.
- The BTU is slightly larger than the kilojoule that is commonly used for measuring amounts of energy in scientific literature.
- MM is a common abbreviation for million also used in energy literature.
- For context, US per capita energy consumption in 2022 was around 301 MM BTU (EIA 2023).
Metadata
When working with a data set for the first time, you should investigate whether metadata is available for the data set that will answer questions about the data like:
- What time period does the data cover?
- What were the original sources for the data?
- Who published the data?
- When was the data published?
- What do the different data fields represent?
- What units are used by the different data fields?
- Are there any legal restrictions on reuse or redistribution of the data?
- Are there warnings about possible problems with the quality or validity of the data?
Because metadata is created separately from the data and is tedious to create and maintain, metadata is often incomplete and out-of-date.
When working with an ArcGIS Online feature layer, you will need to go into ArcGIS Online, create a map, add the layer, and then view the information page.
Attribute Table
You can view the attribute table for a feature class by right-clicking on the layer in the Contents pane and selecting Attribute Table.
The total number of features in the feature class is listed below the table.
To view details on the attribute data types, click the menu icon at the top right of the attribute table and select Fields View.
The Fields View can be used to answer questions like:
- What fields are available in the data?
- What are the data types of the available fields?
ArcGIS Pro has a variety of different types of fields, and when you are creating new feature classes or modifying existing feature classes, you will need to decide which types to use.
- Long integer: These are 32-bit fields used for representing quantitiative values that are whole numbers (counts - no decimal part).
- Short integer: These are 16-bit fields also used for representing small whole numbers when conserving storage space is important, or if the data comes from an old source. You will generally avoid this type on modern computers.
- Float: These are 32-bit floating point fields that can be used with decimal numbers.
- Double: These are high precision 64-bit floating point fields that can also be used with decimal numbers. Double may give marginally higher performance on contemporary machines, at the cost of additional storage requirements with large data sets.
- Text: These are strings of characters that can be used to represent any data type, although conversion to one of the quantitative types above will be needed when mapping quantitative data in ArcGIS Pro
- Date: These are used to represent dates and times. Although internally stored as numbers, these should only be used for date or time information.
Distribution
The distribution of a variable is the manner in which the values are spread across the range of possible values.
- Distributions are commonly summarized with a central tendency like mean or median.
- The mean (commonly called average) is the sum of all values divided by the number of values.
- Normal distributions are further summarized with standard deviation that indicates how far values are spread away from the mean.
- Skew is the extent to which values are evenly distributed around the mean compared to a normal distribution. Skew will be negative if the left tail of the distribution is longer, and positive if the right tail of the distribution is longer. The skewness of a normal distribution is zero.
- Kurtosis is a measurement of how flat or peaked the distribution is compared to a normal distribution. Kurtosis is positive for a sharply peaked distribution and negative for a distribution that is flatter than a normal distribution. The kurtosis of a normal distribution is zero.
A quantile shows the values in a distribution below given percentages of the population.
- The 0% quantile is the minimum value.
- The 50% quantile is the median where 50% of the values in the distribution are at or below the median value.
- The 100% quantile is the maximum value.
- Medians are often preferred over means because medians give a clearer value for central tendency than means that can be distorted by skew and outliers.
- Quantiles that divide the distribution into four groups are called quartiles and quantiles that divide the distribution into five groups are called quintiles.
Calculate Statistics
Descriptive statistics available in ArcGIS Pro allow you to answer questions about individual data fields.
- What is the range of values (minimum and maximum)?
- What is the central value (mean or median)?
- How many features are missing values (Nulls)?
- How close is the distribution to normal (skew and kurtosis)?
To view descriptive statistics for selected attributes:
- In the Contents pane, select the layer.
- Select Data, Data Engineering.
- Drag fields into the viewing area, or just select Add all fields and calculate.
Histograms
Histograms are charts commonly used to visualize distributions by showing bars with the number of entries in different ranges of values.
Histograms are used to answer the question, "How are the values in a variable distributed across the range of possible values?"
- A normal distribution has most of the values clustered in the middle of the histogram around the mean (central tendency) value.
- A left skewed distribution has most values clustered in the higher part of the range, with a tail of lower values extending to the left on the histogram.
- A right skewed distribution has most values clustered in the lower part of the range, with a tail of higher values extending to the right on the histogram.
The histogram below shows that half of all countries had per capita energy use of around 54 MM BTU in 2019, with a handful of very high energy countries in the right tail (right skewed).
- In the Contents pane, select the layer.
- Select Data, Visualize, Create Chart, Histogram
- Number: MM_BTU_per_Capita
- Statistics: Mean, Median and Std.Dev.
- Under General properties, remove the unnecessary title to leave more space for the visualization.
- If desired, Export As Graphic.
Rank Order Bar Plot
The rows with the highest and lowest values in the distribution can be displayed by viewing the Attribute Table and right-clicking on the field you want to use to sort the table.
A rank order graphic or table displays the values in a variable sorted highest to lowest.
To create a bar plot showing the highest and lowest values in rank order:
- Copy and Paste the data layer in the Contents pane to duplicate it (Top Countries).
- Under Properties, add a Definition Query to list only the highest and lowest values based on thresholds you view in the attribute table (> 400 for top, < 1.57 for low)
- Under Data, Visualize, Create Chart, create a Bar Chart.
- The Category is the feature name.
- Summary is None.
- Add the Numeric Field to plot (MM_BTU_per_Capita)
- Select Label bars
- Sort Y-axis Descending.
- Under Axis, increase the X-axis character limit if needed to fully display the feature names.
- If needed, Export the chart.
Group Comparison
If you have a categorical variable that divides your features into groups, you have a variety of options for exploring the differences between the groups.
In this example, the WB_Income_Group is used to group values World Bank classifications of income.
Category Counts
To create a chart of counts for the number of rows associated with the different values of a categorical variable, select Data, Visualize, Create Chart, Bar Chart.
- Category or Date: WB_Income_Group
- Aggregation: Count
- Check Label bars
- Under General, unclick Chart title to remove the redundant chart title and leave more space for the bars.
- If needed, Export the chart as a PNG file to import to a document as a figure.
Row Aggregation
You can also use bar charts to summarize field values associated with groups of rows defined by a categorical variable.
- Select Data, Visualize, Create Chart, Bar Chart
- Category or Date: WB_Income_Group
- Aggregation: Average
- Numeric field(s): MM_BTU_per_Capita
- Check Label bars
- Under General, unclick Chart title to remove the redundant chart title and leave more space for the bars.
- If needed, Export the chart as a PNG file to import to a document as a figure.
Sorted Tables
If you need a table rather than a chart to compare groups, the Summary Statistics tool can be used to create sorted summary tables that can be exported to reports.
- Input Table: World_Energy
- Output Table: World_Energy_Statistics
- Statistics Fields:
- MM_BTU_per_Capita, Max
- MM_BTU_per_Capita, Mean
- MM_BTU_per_Capita, Median
- MM_BTU_per_Capita, Min
- Case Field: WB_Income_Group
Distribution Comparison
Box and whisker plots (box plots) display distributions as quartile boxes with whiskers and outlier dots showing the full range of values.
Box and whisker plots are useful for comparing the distributions of quantitative variable for different groups defined by a categorical variable.
Correlation
An initial investigation of relationships often starts by looking for bivariate correlation between pairs of attributes.
Correlation is "a relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or occur together in a way not expected on the basis of chance alone" (Merriam-Webster 2022). While it is important to always remember that correlation and causation are two different things, correlation analysis is a very powerful exploratory technique for determining whether there is a relationship between two variables.
The strength of a correlation is measured using the coefficient of determination which is more commonly called R-squared.This can be written as R2, R squared, R-squared, or R^2.
Evaluation of R2 to determine whether correlation should be considered strong or not depends on the type of phenomena being studied.
- The range of R2 is from 0.000 (no correlation) to 1.000 (perfect correlation).
- Values of less than 0.100 can usually be considered to represent no meaningful correlation.
- In the social sciences where relationships often involve the complex interplay of ambiguous factors, values as low as the 0.200s or 0.300s can be considered strongly correlated enough to merit further investigation.
- In the natural sciences, values above 0.600 are often expected from variables that are strongly correlated.
Scatter Plots
Correlation can be visualized by plotting the two variables on an x/y scatter chart and looking for an upward or downward pattern of dots diagonally across the chart.
- Negative correlation means that as one variable goes up, the other goes down. When two variables with a negative correlation are plotted on the two axes of an X/Y scatter chart, the points form a rough line or curve downward from left to right.
- Positive correlation means that as one variable goes up, the other goes up as well. When two variables with a positive correlation are plotted on the two axes of an X/Y scatter chart, the points form a rough line or curve upward from left to right.
- When there is no correlation, the X/Y scatter chart dots have no clear upward or downward pattern.
To create an X-Y scatter chart in ArcGIS Pro:
- Select the layer in the Contents pane, and under Data, Visualize, Scatter configure the chart.
- Choose the variables to compare (MM_BTU_per_Capita and GDP_per_Capita_PPP_Dollars).
- If your fields are highly skewed (visible as most dots clustered in a corner), changing the axis to Log (logarithmic transformation) will make the presence or absence of a pattern clearer. In some cases (like this), using log axes will cause the regression line to be visually misaligned with the points, so removing the line may be less confusing.
- Display the R2 value on the chart.
- Remove the redundant Chart Title.
- If needed, Export the chart.
Local Bivariate Relationships
The Local Bivariate Relationships tool visualizes the variability of the relationship between two variables across geographic space.
The Post Hoc Fallacy
While correlation may be interesting, what we are usually more interested in is causation. Correlation is a simple mathematical relationship between two variables, but causation means that there is a material cause-and-effect relationship between the two phenomenon we are measuring with our variables.
Correlation is empirical (based on observation), and causation is rational (based on reason). When we observe two phenomena occurring together and we observe that there is some mechanism connecting the two phenomena, we use reason and logic to tie those two phenomena together in a cause and effect relationship.
Assuming that correlation proves causation is the post hoc fallacy, from the Latin phrase post hoc ergo propter hoc (after this, therefore because of this). A logical fallacy is "an often plausible argument using false or invalid inference."
For example, the correlation between per capita energy use and level of nuclear power production could lead to a fallacious inference that the use of nuclear power increases energy use.
However, both the presence of nuclear power and high per capita energy use are probably more causally related through a third variable of early economic development, which results in higher energy use and the development of complex and expensive nuclear technology that is both economically out of reach of poorer countries, and which is actively blocked from proliferation for political reasons by wealthy countries.
Correlation points to possible causal relationships, but does not prove them, and there are a variety of logical arguments to show how making a simple assumption that correlation is causation will lead you astray. Determining whether there is a cause-and-effect relationship requires more sophisticated techniques and domain knowledge beyond simple mathematical correlation.
Experience Builder
You can share results of exploratory data analysis using a web app created with ArcGIS Experience Builder.
New Experience
- From your ArcGIS Online Content page, select Create app and choose Experience Builder
- On your Exprience Builder home page, select Create new
- Create using the Blank scrolling template, which provides a flexible grid that can be used to reliably lay out designs that display cleanly on mobile devices (mobile first design).
- Drag a Text box onto the canvas.
- Drag the right edge of the box to the full width of the canvas.
- Double-click the box and add your title (Energy vs. GDP).
- Center the text.
- Increase the font size and bold the text.
- If desired, change the font color and Style the Background color.
- Style the Height to Auto so the box stays a fixed height based on the title text.
- Select the block containing the text box and Style the Height to Auto.
Web Map
- If needed, publish your data as an ArcGIS Online feature layer.
- In ArcGIS Online, create a web map with that layer (Minn 2019 World Energy Indicators).
- Drag a Map widget into the app.
- Click on the map and in the side bar Content pane, click Select map
- Click Add new data and find the web map you created above, then click the added map to make it the Source for your map widget.
- Click the Data icon on the left side of the app builder, click the Feature Layer, and select View for empty selection at the bottom of the side panel. This will provide default values for the dynamic text configured below.
- In the Map pane on the right side of the app builder, under Content change the Initial view to Custom and Modify.
- Zoom and pan the intial view so the mapped area fills the block. Note that this area will be narrower when viewed on a mobile device, so the area shown in the middle of the view should be able to stand on its own.
- Enable the Select tools.
- Turn off Enable pop-up.
Dynamic Text
Dynamic text boxes change as you select objects on the map.
- Drag a Text widget into the block below the map.
- Drag the edge of the box to fill half the width of the app.
- Double click the box and add the title of your variable (MM BTU per Capita)
- Select the text and center it, enlarge the font (18 pt), and bold the text.
- Click Connect to data and Select data to use your web map.
- On the second line of your text box, click the Dynamic Content icon.
- Select Statistics
- Choose Selected features
- Select AVERAGE
- Select your analysis variable
- Insert into the text box.
- Style the height of your box to Automatic.
- Repeat for your second variable (GDP per Capita PPP)
- Switch into Live View to verify your fields change when you select features.
Charts
- Drag a Chart widget into the block below the dynamic text boxes.
- Drag the right edge to fill the width of the app.
- Select data from your web map.
- Select chart with the chart type (Scatter plot).
- Under Data and Variables, select the variables to compare on the chart (MM_BTU_per_Capita and GDP_per_Capita_PPP).
- Under Axes label your axes.
- Under Tools, turn off Selection & Zoom because scatter charts and maps cannot be linked with actions.
Publish
Publishing your app makes it visible to users on the internet.
- Click the Save icon to save your changes.
- Click the Publish button at the top of the page.
- Click the ellipsis (...) at the top of the page and Change share settings to Everyone.
- Click the ellipsis (...) at the top of the page and Copy published item link.