Creating Subsets of Data in ArcGIS Pro
You will often run into occasions when you have a large collection of features, but only want to map or perform analysis on a subset of those features, such as for a particular area. ArcGIS Pro has a variety of different (and redundant) methods for subsetting data, which can add confusion when deciding how to subset data for a particular application.
This tutorial will cover some means of selecting subsets of data in ArcGIS Pro.
Example Data
The example data for this tutorial will be a three feature classes created in a project database using data from the US Census Bureau (USCB).
A feature class is a collection of features each having the same spatial representation (point, line, or polygon) and the same set of attributes (ESRI 2023). The three sets of USCB features will be loaded into three separate feature classes.
A geodatabase is a collection of feature classes, rasters, and/or tables held in a common file system folder (file geodatabase) or relational database management system (ESRI 2023).
A project geodatabase is a file geodatabase specific to each ArcGIS Pro project that is included in a project directory and is used as the default geodatabase for new feature classes created by tools used in the project.
Census Tracts
The US Census Bureau (USCB) is the US federal government agency responsible for collecting data about people and the economy in the United States. The Census Bureau has its roots in Article I, section 2 of the US Constitution, which mandates an enumeration of the entire US population every ten years (the decennial census) in order to set the number of members from each state in the House of Representatives and Electoral College (USCB 2017). The Census Act of 1840 established a central office for conducting the decennial census, and that office became the Census Bureau under the Department of Commerce and Labor in 1903 (USCB 2021).
Demographic data is "the statistical characteristics of human populations (such as age or income)." Etymologically, the word is a combination of the Greek words dêmos (people) and graphein (write) - literally, writing about people (Merriam-Webster 2020).
Census tracts are organizational boundaries used for USCB data collection that are drawn to roughly align with neighborhood borders. Ideally, each tract contains 4,000 residents, although the number of residents can vary depending on area (USCB 2019).
The Minn 2015-2019 ACS Tracts feature layer in the University of Illinois ArcGIS Online organization provides basic demographic data for census tracts and counties in the US from the 2015-2019 American Community Survey five-year estimates.
To avoid issues with access speed and availability, we use the Export Features tool to read the data from the feature services into new feature classes in the project geodatabase (US_Tracts)
Roads
County-level road data is sourced from the USCB TIGER/Line Shapefiles All Roads download (County_Roads).
Places
USCB places data from the USCB TIGER/Line Shapefiles contains boundaries of municipalities and unincorporated areas (State_Places).
Filters vs. Definition Queries
There are two broad approaches to subsetting based on attributes: filters and definition queries.
A definition query isolates a subset of displayed features on a map layer, but leaves the source feature class intact.
Definition queries are set on layer properties in maps and are most appropriate when you just need a simple subset for mapping or basic analysis.
This example shows how to use a definition query to subset Mount Pulaski, IL from the State_Places feature class using the NAME field.
A filter isolates data as it is being moved or processed into a new feature class by a tool.
- Filters are available in many ArcGIS pro analysis and data management tools.
- Filters are more appropriate than definition queries when working with sequences of tools.
- Filters are your only option when adding tools to ModelBuilder diagrams.
- The examples in this tutorial use filters.
This example shows how to use a definition query to subset Mount Pulaski, IL from the State Places feature class using the Select tool with an expression on the NAME field that creates a new feature class in the project geodatabase (City_Boundary).
ModelBuilder
ModelBuilder is a visual programming language in ArcGIS Pro that allows you use a graphical editor to create custom tools that allow you to automate complex, tedious, or repetitive tasks where there are consistent step-by-step sequences of operations (workflows).
ModelBuilder is useful when working with tool filters because you can easily diagnose filter problems and re-run tools without having to completely re-enter the tool and filter parameters.
This example demonstrates creating a new ModelBuilder diagram and adding an Export Features tool using the same filter from the interactive example above.
Behind the scenes, ModelBuilder creates Python code that uses the ArcPy API. While most users never need to see this code, you can export and examine it if needed for diagnostics or to submit for an assignment.
You can view and copy the code under ModelBuilder, Send To Python Window.
Filter Expressions
Filter expressions are used to define the criteria for subsetting features. Expressions are used in both tool filters and definition queries.
Aside from simple exact value matches like the NAME example above, expressions can be created to perform a variety of different comparisons.
Quantitative Ranges
You can create subsets based on ranges of quantitative variables.
For this example, we Export Features for census tracts with a median household income below the 2015-2019 Illinois median $65,886 (Low_Income_Tracts).
Multiple Conditions Combined with AND
To filter by multiple conditions, you can use the AND operator to subset only features that meet two or more different conditions For this example, we Export Features for census tracts in Illinois (ST = IL) and median household income less than $65,886 (Low_Income_Tracts).
Partial Text Match
You can create subsets based on partial matches of names. This can be useful when names may contain additional text, such as directions (east, west) or unpredictable suffixes like "Road" vs. "Rd."
For this example, we use an expression to Select places across the US that Contains the word "Mount" in their name (Mount_Places).
Begins With
Begins with conditions match text that begins with a search string.
One application of begins with is subsetting census tracts in specific counties based on the GEO_ID field included in USCB data.
- For census tracts, the GEO_ID begins with 1400000US followed by the five-digit FIPS code for the county, followed by the five-digit county code.
- GEO_ID and FIPS codes for different geographies are explained further in this tutorial on Geospatial Data from the US Census Bureau.
- For this example, we subset the census tracts for Logan County, Illinois, USA (FIPS code 17107), where the begins with comparison string is 1400000US17107
Multiple Conditions Combined with OR
You can add additional clauses connected with the OR operator to match multiple criteria.
For this example, we subset interstates and major roads that have a RTTYP (route type code) of I (interstate), U (US highway), and S (state highway) to create a new feature class of highways (County_Highways).
Subsetting Based on Proximity
The distinquishing characteristic of geospatial data is the where component, and ArcGIS Pro can be used to subset data based on location relative to features in other layers.
Select by Location
The Select Layer by Location tool selects features based on their proximity to features in another layer. Because this tool only selects features, you also need to run the Export Features tool to copy those features to a new feature class.
For this example, we create a feature class of census tracts (Interstate_Tracts) within one mile of interstates, which could be useful for analyzing the health effects of exposure to high levels of auto and truck pollution.
Clipping
Clipping subsets features contained within (or outside) boundar(ies) defined by polygon(s) in another layer.
In this example, we use the Clip tool to subset roads (County_Roads) within Mount Pulaski, IL (City_Boundary) into a new feature class (City_Roads).
Subsets by Drawing Selections
In some cases, the subsetting criteria may be primarily visual or too amorphous to encode in a formal filter or location expression.
In such cases, you can subset data by manually drawing selections and then using the Export Features tool to export those selected features into a new feature class (Chicago_Metro_Tracts).
This technique is inherently arbitrary and difficult to reproduce or automate, and should be used only when there is absolutely no reasonable way to create a filter expression.
For this example, we display selected tracts around Springfield, IL.
SQL Expressions
Enterprise GIS commonly stores geospatial data in the same types of enterprise relational databases that are ubiquitous in information technology.
Structured query language (SQL) is a language used to interact with relational databases. While ArcGIS Pro generally hides SQL behind the user interface and eliminates the need for most users to know SQL. However, developers or other users who work on the information technology side of GIS need to have some basic familiarity with SQL.
Filter and query expressions can be specified in ArcGIS Pro using the same syntax as SQL WHERE clauses. SQL can sometimes be easier to work with than dialog combo boxes when expressions incorporate multiple comparisons.
For example, this expression subsets features from the Places feature class that have a NAME of Mount Pulaski.
NAME = 'Mount Pulaski'
Note that text values used for comparison must be enclosed in single quotation marks, but numeric values can be used as they are.
ALAND >= 100000000
SQL can be used in both definition queries and tool filters by selecting the SQL switch in the dialog.
Comparison Operators
Comparison operators are used to specify selection criteria.
- = equal
- > greater than
- < less than
- >= greater than or equal
- <= less than or equal
- LIKE partial string search
- IS NULL
- IS NOT NULL
Logical Operators
Logical operators are used to combine comparisons.
- AND
- OR
- NOT
Examples
These examples are based on the dialog subsets demonstrated above.
Census tracts with a median household income below the 2015-2019 Illinois median $65,886
Median_Household_Income <= 65886
Census tracts in Illinois and median household income less than $65,886
(ST = 'IL') AND (Median_Household_Income <= 65886)
Counties across the US that contain the word "Pulaski" in their name
NAME LIKE '%Pulaski%'
Census tracts in Logan County, IL (FIPS code 17107)
FactFinder_GEO_ID LIKE '1400000US17107%'
Interstates and major roads that have a RTTYP (route type code) of I (interstate), U (US highway)
(RTTYP = 'I') OR (RTTYP = 'U')