Classification in ArcGIS Pro

ArcGIS Pro provides a number of different ways of allocating different ranges of numbers to different categories (classification methods). The choice of which method you should use depends both on the characteristics of the variable you are mapping as well as the story you are trying to tell.

The differences in the maps created with these different classification methods can sometimes be subtle, but can also be quite dramatic depending on the distribution of values in your variable.

This tutorial will walk through five steps to choose a classification scheme for a choropleth or graduated symbol map:

  1. Create the Map
  2. Analyze the Distribution
  3. Define the Audience and Intent
  4. Choose the Classification Scheme
  5. Publish

Create the Map

Classification is relevant primarily with choropleths (graduated colors), although it is also a lesser consideration for graduated symbols like bubble maps.

For this tutorial, we will use the Minn 2014-2018 ACS States feature service from the University of Illinois organization that contains a variety of useful variables from the American Community Survey. This data is also available as a GeoJSON file here.

By default, this feature service opens with a display of median household income and an unclassed colors symbology, and the video shows how to change the classification to Natural Breaks (Jenks)

Alternative Classification Schemes

Analyze the Distribution

A distribution is the manner in which a variable's values are spread across the range of values. In ArcGIS Pro, the distribution of a variable you are symbolizing can be viewed in the histogram display on the classification dialog. The bars represent the number of features at different values, and those bars are overlaid by lines indicating where the active classification scheme draws the classification boundaries for colors / sizes.

ArcGIS Pro Classifcation Histogram

The Normal Distribution

While there are dozens of mathematically-defined distributions that have a variety of applications in statistics, one very common distribution found in social and environmental geospatial data is the normal distribution (commonly called the "bell curve") where values are evenly distributed around a central value.

Figure
Normal Distribution: Percent Adults With Bachelor's Degrees by State

Skew

Perfectly normal distributions only occur in the abstract world of mathematics, and there are a variety of ways in which real-world distributions vary from the mathematical ideal. Those variations can guide the selection of classification schemes used to visualize your data.

One common variation is skew, where the clump of values is higher or lower than the middle of the range of values.

A good example is median household income, where most households clump around a middle range, but a handful of households have high or very high incomes, which skews the distribution to the right.

Figure
Skewed Normal Distribution: Median Household Income

One extreme form of skew is the geometric distribution where values are clumped at the low end (left) of the distribution with a handful of high values spread out (skewed) to the right.

These types of distribution are common with population counts where a handful of large areas/countries have high populations, but most areas are small and have low populations. These are often log normal because the logarithms of the values form a normal distribution.

Figure
Geometric Distribution: Total Population

Kurtosis

Kurtosis is the sharpness of the peak of a centralized distribution. If a distribution has a wide range of values and only a weak cluster around the central value, the kurtosis of that distribution is said to be low. If more of the values in the distribution are clustered around the mean than would be expected with a mathematical normal curve, kurtosis is said to be high.

For example, although there are states with unusually high numbers of children (Utah) and unusually high numbers of seniors (Vermont), lifespans are fairly similar around the US, so median age by state has high kurtosis with a sharp peak around the average of 38.3.

Figure
High Kurtosis: Median Age

An extreme example of low kurtosis is the uniform distribution, where values are spread fairly evenly across the range of values with no significant clusters.

Uniform distributions are rare with social or environmental data. However, rank orders that force quantitative values into a sequence of ordinal numbers are uniform. For example, the distribution of the rank orders of percent of population that has a college degree is uniform.

Figure
Uniform Distribution: States Ranked in Order By Income

Multimodality

Distributions often have multiple clumps of values. Even distributions with a dominant central clump may have smaller secondary clumps on the left or right tails. Such distribution with multiple clumps are called multimodal.

Most real-world geospatial data is at least a little bit lumpy. In evaluating these clumps, the question becomes whether those clumps represent something meaningful (such as a certain class of people or region of the country), or whether they are simply a random accident.

An example of a multimodal distribution is the percent of veterans by state. While there is a sharp peak (high kurtosis) around the mean of 8.2%, there is a smaller clump of states with low percentages (5% - 6%) of veterans.

Figure
Multimodal Distribution: Percent of Veterans Living in Each State

Define the Audience and Intent

As with cartography in general, your choice of classification scheme will be influenced by the potential audience for the map, by their needs in reading the map, and by your intention for creating the map. While there are cartographic conventions that prescribe and proscribe certain practices, maps have different needs that can be met in different ways.

Toward that end, there are two broad considerations in choosing a classification scheme: understandability and focus.

Understandability

What information do you want users to be able to glean from the map?

You need to understand whether users will be using the map to find specific information about areas or simply need to get a general impression of the spatial distribution of the phenomena represented by the variable.

If users need to understand specific information, such as wanting to find the range of values for a specific city or county, the category boundaries should be clear memorable numbers where the user can understand their relationship.

If users are just using the map to get a general impression of where thing are, the boundaries can be determined by other aesthetic or ideological considerations.

Cartographic literature based on perceptual studies usually recommends five to nine classes as the maximum number of choropleth classes that map users can distinguish and comprehend (Declercq 1995, Mersey 1990). ArcGIS Pro usually defaults to five.

Focus

What information do you want to emphasize for readers of your map?

Different classification schemes will highlight the areas with the highest and/or lowest values, while others will create classes that cause a more uniform distribution of colors.

When your data is sharply skewed or has extreme outliers, you need to consider whether it is important to highlight those areas or to create a more even distribution of colors/sizes.

You need to decide whether you intend the classifications to represent clearly defined categories of areas (such as classes of income or levels of crime - low, medium, high), or whether they will simply be viewed relative to each other.

This is especially important with multimodal data. If the clumps represent important groupings that need to be accented, the clumps need to be in separate visual categories. However, if the clumps are just statistical accidents, emphasizing those clumps can create an impression of differences that the data does not support.

Choose the Classification Scheme

Natural Breaks (Jenks)

This scheme uses an algorithm to seek clumps of values that are clustered together in order to form categories that may reflect meaningful groupings of areas. It was named after the developer of the algorighm, George Jenks.

Natural breaks is the default in ArcGIS Pro and is a safe generic choice with most distributions if you are in a hurry and the map just needs to give a general impression of the distribution of values across the areas.

For the purposes of this tutorial, other classification schemes will be compared to natural breaks.

One problem can arise with natural breaks classification occurs when the data contains clusters of values that are not actually meaninful groupings. Natural breaks classification will place those clusters together and create a false visual impression that is not actually reflective of the phenomenon being visualized.

The example below is a map of the percentage of adults with bachelor's degrees by state. This variable has a normal distribution.

Figure
Natural Breaks: Percent of Adults With Bachelor's Degrees by State

Quantile

Quantile classification creates grouping so that there are an even number of values in each grouping. This effectively creates a map showing the rank order of a variable. If you use the ArcGIS default of five categories, you are dividing the features into five ranked categories by percentile:

Assuming the areas are of comparatively similar size, quantile classification distributes the colors of a choropleth evenly across the map and can blur clusters that occur in the data.

This example is a map of median household income by state. This variable has a skewed normal distribution because most American households are clumped around the mean, but a handful of wealthy / expensive states extend the right tail of the curve.

Figure
Natural Breaks (Left) vs. Quantile (Right) Classification: Median Household Income by State

Equal Interval / Defined Interval

Equal interval classification divides the range of values evenly by the number of categories to create evenly spaced categories.

Defined interval classification is like equal interval except the cartographer can specify the numeric width of classes and the software will adjust the range and number of classes accordingly.

This example of median age has a distribution with high kurtosis.

Figure
Natural Breaks (Left) vs. Equal Interval (Right) Classification: Median Age by State

Likewise, for a map of population (highly skewed log-normal distribution) the equal interval classification clearly focuses on how much more highly-populated the largest states are relative to the country, as opposed to the natural breaks classification which creates aggregations that blur those distinctions.

Figure
Natural Breaks (Left) vs. Equal Interval Classification (Right): Population by State

Geometric Division

Geometric classification performs an equal interval classification on a logarithmic scale. This scheme is most appropriate for data with a highly skewed log-normal distribution.

This example uses population of states.

Figure
Natural Breaks (Left) vs. Geometric Classification (Right): Population by State

Standard Deviation

Standard deviation classification choses classes based on the mean and standard deviation.

This is the most statistically rigorous classification scheme, although it only has that rigor if used with data that is normally distributed. The algorithm does not transform to compensate for skew, kurtosis, or multimodality.

The example below is a map of the percentage of adults with bachelor's degrees by state. This variable has a normal distribution.

Figure
Natural Breaks (Left) vs. Standard Deviation Classification (Right): Percent of Adults With Bachelor's Degrees by State

Continuous Color Scheme

You can avoid the pitfalls of classification by not classifying and just using a continuous color scheme. This is similar to equal interval classification, and you can use a logarithmic or exponential transformations if your data is heavily skewed.

Figure
Natural Breaks Classification (Left) vs. Continuous (Unclassed) Colors (Right): Median Household Income by State