Exploring Data in R

The determination of what methods to use when analyzing data in R will be determined by the characteristics of the data and the question(s) you seek to answer with that data.

This tutorial will introduce techniques for exploring and determining the characteristics of data, and methods that can answer different types of questions with those different types of data.

Exploratory Data Analysis

The idealized world of the scientific method is question-driven, with the collection and analysis of data determined by questions and hypotheses:

The Scientific Method

However, in some cases, especially at the early states of research, a data-driven approach may be more appropriate. This reverses the question-driven process, with the data coming first and animating a process of exploratory data analysis, which is an approach outlined in this tutorial:

Exploratory Data Analysis

Example Data

As example data, this tutorial will use a table of anonymized individual responses from the CDC's Behavioral Risk Factor Surveillance System. The BRFSS is a "system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services" (CDC 2019).

Guidance on how to download and process this data is available here...

A CSV file with the selected variables used in this tutorial is available here...

The Behavioral Risk Factor Surveillance System

Metadata

Metadata is data about your data. If there is any documentation of what your data contains, that will make it much easier to work with the data than if you have to figure out what everything is.

Ideally, your metadata should include information about the variables in the data set:

Metadata may also include:

Metadata is often an afterthought when data is captured, processed, and maintained. Accordingly, metadata is often cryptic, incomplete, and inaccurate. But it is better than nothing.

The BRFSS has a codebook that gives the survey questions associated with each variable, and the way that responses are encoded in the variable values.

BRFSS codebook

Import

Tabular data can be read into a data frame using the read.csv() file. Data frames are analogous to spreadsheets with columns of data for variables arranged in rows.

responses = read.csv("2020-brfss-survey-responses.csv", as.is=T)

The dim() function shows the dimensions of the data frame in number of rows (survey responses in this example) and number of columns (variables on each response in this example).

dim(responses)

	[1] 401958     49

The names() function shows the available variables. Consult the metadata for documentation of what the variables represent and what the ranges of values are.

names(responses)

	 [1] "FIPS"      "X_URBSTAT" "X_AGEG5YR" "DISPCODE"  "NUMADULT"  "SEXVAR"   
	 [7] "GENHLTH"   "PHYSHLTH"  "MENTHLTH"  "HLTHPLN1"  "EXERANY2"  "SLEPTIM1" 
	[13] "CVDINFR4"  "CVDCRHD4"  "CVDSTRK3"  "ASTHMA3"   "CHCSCNCR"  "CHCOCNCR" 
	[19] "CHCCOPD2"  "HAVARTH4"  "ADDEPEV3"  "CHCKDNY2"  "DIABETE4"  "DIABAGE3" 
	[25] "LASTDEN4"  "RMVTETH4"  "MARITAL"   "EDUCA"     "RENTHOM1"  "VETERAN3" 
	[31] "EMPLOY1"   "CHILDREN"  "INCOME2"   "PREGNANT"  "WEIGHT2"   "HTIN4"    
	[37] "DEAF"      "BLIND"     "SMOKDAY2"  "STOPSMK2"  "AVEDRNK3"  "DRNK3GE5" 
	[43] "SEATBELT"  "HADMAM"    "PSATEST1"  "COLNSCPY"  "HIVTST7"   "HIVRISK5" 
	[49] "ST"       

Subsetting

You will commonly need to create subsets of your data that select specific rows or columns from your data frame.

To select a single column (variable) from a data frame, follow the name of the data frame with a dollar sign and the name of the variable.

For example, this subsets the HTIN4 variable into a vector.

heights = responses$HTIN4

quantile(heights, na.rm=T)

  0%  25%  50%  75% 100% 
  36   64   67   70   95 

Data frames can be treated like arrays and subsets are taken by following the data frame name with square brackets.

For example, this statement selects the HTIN4 variable for only rows where the SEXVAR is one (male).

men = responses[responses$SEXVAR == "1", "HTIN4"]

quantile(men, na.rm=T)

  0%  25%  50%  75% 100% 
  36   68   70   72   95 

This particular BRFSS data set often mixes special codes in with quantitative values. Subsetting can remove those codes so you have only the quantitative values.

sleep = responses[responses$SLEPTIM1 <= 24, "SLEPTIM1"]

quantile(sleep, na.rm=T)
  0%  25%  50%  75% 100% 
   1    6    7    8   24 

The %in% operator can be used with the NOT (!) operator to remove specific values.

health = responses[!(responses$PHYSHLTH %in% c(77, 99, NA)), "PHYSHLTH"]

health[health == 88] = 0

quantile(health)

  0%  25%  50%  75% 100% 
   0    0    0    2   30 

Distributions

A frequency distribution is "an arrangement of statistical data that exhibits the frequency of the occurrence of the values of a variable" (Merriam-Webster 2022).

The type of distribution will dictate what kinds of statistical techniques you should use on your data.

Histograms

A basic way of visualizing a distribution is a histogram, which is a bar graph that shows the number of values within different ranges of values.

In R, the hist() function can be used to create a quick histogram. In this data, the HTIN4 variable is height in inches.

hist(responses$HTIN4)
Histogram of height in inches

Density Plots

An alternative visualization is a density plot that shows clearer detail by smoothing the histogram in manner similar to a continuous density distribution.

plot(density(responses$HTIN4, bw=1, na.rm=T), lwd=3, col="navy")
Density plot of height in inches

Normal Distributions

The normal distribution is a type of distribution that deserves special attention because many variables in both the social and environmental sciences follow a normal distribution. Accordingly, many contemporary statistics assumes the variables are normally distributed, and you should use caution in selecting analytical methods to make sure your method is appropriate for your distribution.

The frequency distribution curve for the normal distribution is often called the bell curve because of its bell-like shape.

R contains functions for analyzing and simulating a variety of distributions. The dnorm() function returns the

plot(dnorm(seq(-4, 4, 0.1)), type="l", lwd=3, col="navy")
The normal curve

The height variable roughly follows a normal distribution, in which values tend to cluster around a central (mean) value, with fewer values as you get further above or below the mean.

In the height density curve above, you may notice a slight notch on the right side of the curve. This is because men and women have separate normal distributions of height, and adding them together creates an asymmetrical curve.

men = responses[responses$SEXVAR == "1", "HTIN4"]

women = responses[responses$SEXVAR == "2", "HTIN4"]

plot(density(women, bw=1, na.rm=T), lwd=3, col="darkred", main="")

lines(density(men, bw=1, na.rm=T), lwd=3, col="navy")
Separate density plots for men (blue) and women (red)

Skew

A normal distribution is symmetrical in that there are an equal number of values above and below a central value, and the values change in a similar manner as you move above or below that central value.

Many distributions are skewed to one side or the other. For example in this dataset, the number of hours that most people sleep per night is around seven hours per night, but there are a limited number of people who claim to sleep considerably more that, which stretches the right tail of the curve far to the right.

Note that values above 24 hours per day are used as codes for uncertain or refused to answer, and those codes should be removed before analyzing the distribution.

sleep = responses[responses$SLEPTIM1 <= 24, "SLEPTIM1"]

hist(sleep)
Distribution of hours of sleep per night

Modality

The mode of a distribution is the most common value in the distribution. With continuous variables, the mode is the peak of the density curve.

Distributions like the normal distribution are unimodal because they have one cluster of values around which a smaller number of higher and lower values cluster.

Multimodal distributions have more than one prominent peak value.

For example in this data, the number of days over the past 30 days that respondents indicated that they felt their physical health was not good clusters into two modes representing generally well people (zero on one day) and people with chronic illnesses (all 30 days).

Note that this data set represents zero days with the code "88," and also includes codes for uncertain (77) and refused to answer (99).

health = responses[!(responses$PHYSHLTH %in% c(77, 99, NA)), "PHYSHLTH"]

health[health == 88] = 0

hist(health)
Multimodal distribution of people by number of days over the past 30 days that felt their physical health was not good

Measures of Central Tendency

When working with a variable, we often wish to summarize the variable with a measure of central tendency, which is a single value that represents the center of the distribution.

There are three primary types of central tendency values: mean, median, and mode.

Mean

The arithmetic mean of a variable is the sum of all values divided by the number of values. In common parlance, arithmetic means are often simply called means or averages.

In R, the mean() function calculates the arithmetic mean of the vector or field passed to it. The na.rm=T parameter is often helpful since mean() will return NA if any values within the variable are NA, and na.rm=T instructs mean() to ignore NA values.

x = mean(responses$HTIN4, na.rm=T)

print(x)

	[1] 66.99402

plot(density(responses$HTIN4, bw=1, na.rm=T), lwd=3, col="navy", main="")

abline(v=x, col="darkred", lwd=3)
A density plot of height (in inches) with a line for the mean

Median

A median is a central tendency value where half of the values in the distribution are below the median and half are above the median.

Medians are often used rather than means with non-normal or heavily skewed distributions where outliers can make the mean give a mistaken understanding of the phenomenon. For example, if a billionaire developer walks into a room of construction workers, the mean income will jump dramatically, while the median will stay largely the same.

In a symmetrical distribution like the normal distribution, the mean and median should be nearly identical. But in skewed distributions, the mean will be biased in the direction of the outliers.

sleep = responses[responses$SLEPTIM1 <= 24, "SLEPTIM1"]

xmean = mean(sleep, na.rm=T)

xmedian = median(sleep, na.rm=T)

plot(density(sleep, na.rm=T, bw=0.5), xlim=c(4, 12), col="gray", lwd=3, main="")

abline(v=xmean, col="navy", lwd=3)

abline(v=xmedian, col="darkred", lwd=3)

text(6.2, 0.1, "Median")

text(7.8, 0.1, "Mean")
Mean vs. median hours of sleep per night

Mode

The mode of a distribution is the most common value in the distribution.

This concept is most clearly applicabale with variables that have a limited number of possible values, such as categorical or ordinal variables.

Likert scales that convert qualitative values into ordinal quantitative values.

For example, the SEATBELT question is a Likert scale based on respondent ratings of their seatbelt usage on a scale of one (always) to five (never). Additional codes are included for people who were unsure, who never drive, or who refused to answer the question.

The table() function tallies frequency for each value in the distribution.

The mode for this distribution is clearly 1 (always wear a seat belt).

seatbelts = table(responses$SEATBELT)

print(seatbelts)

	     1      2      3      4      5      7      8      9 
	327249  26763  10195   3942   6632    263    976    427 

Dividing by the number of responses and multiplying by 100 gives percentages that can be visualized as a bar graph with the barplot() function.

barplot(seatbelts * 100 / nrow(responses))
Seat belt usage Likert scale responses

While this visualization shows a clear dominance of people who respond that they always wear seat belts, keep in mind that with responses that involve politically-sensitive issues or imply a moral judgement, people often exhibit social-desirability bias in which they (intentionally or unintentionally) respond with the answer they think they are supposed to give rather than an answer that accurately reflects what they actually do in their lives (Krumpal 2013).

Measures of Variation

In order to fully describe a distribution you need not only a central value, but a measure of how the values in a distribution vary across the range of values.

Range

The simplest measure of variation with a quantitative value is the range() of values.

Note that ranges tell you nothing about the frequency of values, so outliers can cause range to give a distorted picture of what the data actually shows.

range(responses$HTIN4, na.rm=T)

	[1] 36 95

For categorical variables, the unique() function shows all possible values.

unique(responses$MARITAL)

	[1]  3  1  2  4  5  9  6 NA

A table() from a categorical variable can give you the frequency (count of rows) in each category. For coded variables like this marriage status variable, you will need to consult the metadata to find the meaning of each code.

table(responses$MARITAL)

	     1      2      3      4      5      6      9 
	207302  51939  43646   7975  72051  15261   3772 

Quantiles

A quantile is a set of values that divide a frequency distribution into classes (bins) that each contain the same percentage of the total population.

Quantiles are especially useful for describing non-normal distributions.

Quantiles with commonly used numbers of classes have their own names. A quartile is three values that divide a distribution into four groups. A quintile is four values that divide a distribtuion into five groups.

Quantiles are commonly visualized as box-and-whisker plots using the boxplot() functions. Groups of side-by-side box plots can provide a way of clearly comparing distributions. By default, box-and-whisker plots in R represent the following values (Ford 2015).

boxplot(HTIN4 ~ SEXVAR, data = responses)

Percentiles

A percentile is "a value on a scale of 100 that indicates the percent of a distribution that is equal to or below it" (Merriam-Webster 2022).

Percentiles can be calculated by passing a probability in the prob parameter of the quantile() function.

For example, this calculates the 25th percentile of height, which means that 25% of male respondents were at or below 68 inches in height.

men = responses[responses$SEXVAR == "1", "HTIN4"]

quantile(men, prob=0.25, na.rm=T)

	25% 
	 68 

Going the other direction and asking what percent of respondents meet a condition requires a bit of calculation based on the number of rows in a subset that meet that condition, divided by the total number of rows, and multiplied by 100 to get a percentage.

over_six_feet = 100 * sum(men > 60, na.rm=T) / length(men)

print(over_six_feet)

	[1] 94.27646

Standard Deviation

With normally-distributed data, the width of the bell curve is specified by the standard deviation. Standard deviation is represented by the lower-case Greek letter sigma (σ), which looks like a lower-case English "b" that's had too much to drink.

The number of standard deviations that a value departs from the mean is called the z-score. The z-score of the mean is zero.

Converting values to z-scores is called standardizing, which allows comparison between values from different distributions.

The area under the bell curve between a pair of z-scores gives the percentage of things associated with that range range of values. For example, the area between one standard deviation below the mean and one standard deviation above the mean represents around 68.2 percent of the values.

Percentages of Values Within A Normal Distribution

The sd() function can be used to calculate the standard deviation for a set of values. For example, this plots the density curve for height, along with vertical lines for the mean, one standard deviation below the mean (z = -1) and one standard deviation above the mean (z = 1).

height = responses[!is.na(responses$HTIN4), "HTIN4"]

xmean = mean(height)

print(xmean)

	[1] 66.99402

xstdev = sd(height)

	[1] 4.188529

print(xstdev)

plot(density(height, bw=1, na.rm=T), lwd=3, col="navy", main="")

abline(v=xmean, col="navy", lwd=3)

abline(v=xmean - xstdev, col="darkred", lty=2, lwd=3)

abline(v=xmean + xstdev, col="darkred", lty=2, lwd=3)
Height density plot with means and standard deviations

Coefficient of Variation

One challenge with the standard deviation is that it is an absolute value that needs to be assessed in the context of the mean. This makes it difficult to compare the standard deviations for two different distributions.

One solution is to use the coefficient of variation, which is standard deviation divided by the mean, making it represent a percentage variation.

For example, with our height data, the coefficient of variation for women is slightly higher than that for men, which indicates that women's heights vary slightly more than men's heights.

men = responses[responses$SEXVAR == 1, "HTIN4"]

women = responses[responses$SEXVAR == 2, "HTIN4"]

male.covar = sd(men, na.rm=T) / mean(men, na.rm=T)

print(male.covar)

	[1] 0.04415672

female.covar = sd(women, na.rm=T) / mean(women, na.rm=T)

print(female.covar)

	[1] 0.04505871

Skew and Kurtosis

Skew is the extent to which values are evenly distributed around the mean. The skewness() function from the e1071 library can be used to calculate a value that will be negative if the left tail of the distribution is longer, and positive if the right tail of the distribution is longer. The skewness of a normal distribution is zero.

Kurtosis is a measurement of how flat or peaked the distribution is. The kurtosis() function from the e1071 library can be used to calculate a value that is positive for a sharply peaked distribution and negative for a distribution that is flatter than a normal distribution. The kurtosis of a normal distribution is zero.

For the example skewed distribution of hours of sleep per night, the positive skewness() corresponds to the long right tail and the high kurtosis() corresponds to the sharp peak around seven hours / night.

library(e1071)

sleep = responses[responses$SLEPTIM1 <= 24, "SLEPTIM1"]

print(skewness(sleep, na.rm=T))

	[1] 0.7551364

print(kurtosis(sleep, na.rm=T))

	[1] 8.574684

In contrast, the largely normal distribution for height has very small values of both skewness and kurtosis.

height = responses$HTIN4

print(skewness(height, na.rm=T))

	[1] 0.08109344

print(kurtosis(height, na.rm=T))

	[1] -0.08372729

Mapping

When data has some kind of spatial component, a map of the variables can be useful for exploring relationships and patterns.

While this BRFSS survey data has no specific respondent locations (to preserve anonymity), it does have a state (ST) variable that can be used to aggregate values.

Aggregate

Aggregate means " to collect or gather into a mass or whole" (Merriam-Webster 2022).

The aggregate() function can be used to aggregate values by geographic areas.

The FUN parameter indicates what function is used to aggreate values in each geographic area.

This example takes a mean of responses to a question about general health (1 = excellent, 5 = poor).

health = responses[responses$GENHLTH <= 5, c("ST", "GENHLTH")]

genhlth = aggregate(health$GENHLTH, by=list(health$ST), FUN=mean, na.rm=T)

names(genhlth) = c("ST", "GENHLTH")

genhlth = genhlth[order(genhlth$GENHLTH),]

head(genhlth)

	   ST  GENHLTH
	20 MA 2.236222
	8  DC 2.243354
	6  CO 2.275349
	32 NJ 2.289930
	7  CT 2.304615
	45 UT 2.312311


tail(genhlth)

	37 OK 2.617400
	2  AL 2.667730
	18 KY 2.687404
	3  AR 2.730681
	26 MS 2.757623
	50 WV 2.769966

Join

In order to create a map of state data, we need to join the table data to geospatial polygons that indicate where to draw the features.

For this state-level data, we can use a GeoJSON file created with US Census Bureau polygons and ACS data that is available for download here.

library(sf)

states = st_read("2015-2019-acs-states.geojson")

states = states[!(states$ST %in% c("AK", "HI", "PR")),]

states = st_transform(states, crs=3082)

state_genhlth = merge(states, genhlth, by="ST", all.x=T)

plot(state_genhlth["GENHLTH"], pal=colorRampPalette(c("navy", "lightgray", "red")))
Mean responses on a self-reported Likert scale of general health from 1 (excellent) to 5 (poor)

Categorical Data

For mapping categorical variables, we can aggregate a count the number of specific responses by creating a subset with only those responses and then aggregating using the length() function.

women = responses[(responses$SEXVAR == 2) & (responses$HADMAM == 1), c("ST", "HADMAM")]

mammogram = aggregate(women$HADMAM, by=list(women$ST), FUN=length)

names(mammogram) = c("ST", "HADMAM")

mamstates = merge(states, mammogram, by="ST", all.x=T)

plot(mamstates["HADMAM"], pal=colorRampPalette(c("red", "lightgray", "navy")))
Count of female respondents by state that have ever had a mammogram

The challenge with mapping by absolute counts is that (depending on what you are counting) counts tend to be higher in more populous areas, which often makes a map of counts simply a map of population. With this particular survey data set, there may also be an issue that different states surveyed different percentages of their residents, which will bias the map toward states with more active survey efforts or higher response rates.

Normalization is " a scaling technique method in which data points are shifted and rescaled so that they end up in a range of 0 to 1" (Alam 2020). Aggregated variables are often normalized before mapping in order to make visual comparison of different areas possible.

In this example, we use a custom function to normalize the counts as a proportion of the total responses in each state and then multiplying by 100 to get percents that can be compared across states.

women = responses[(responses$SEXVAR == 2) & !is.na(responses$HADMAM), c("ST", "HADMAM")]

mampercent = function(x) { 100 * sum(x == 1) / length(x) }

mammogram = aggregate(women$HADMAM, by=list(women$ST), FUN=mampercent)

names(mammogram) = c("ST", "MAMPERCENT")

mamstates = merge(states, mammogram, by="ST", all.x=T)

plot(mamstates["MAMPERCENT"], pal=colorRampPalette(c("red", "lightgray", "navy")))
Percent of female respondents by state that have ever had a mammogram

Classification

The choice of what colors to use for maps and what values to associate with those colors is both an aesthetic decision and a determinant of how readers will interpret a map.

With thematic choropleth maps like those used above, classification is the technique used to associate ranges of values with specific colors from the color palette used on the map. Classification choices are critical to the "story" a map will tell, and can be used to "lie" with maps (Monmonier 2018). Far from being objective representations of reality, maps incorporate numerous cartographic choices that reflect the subjectivities and goals of the cartographer.

With the plot() function from the sf library, the breaks parameter can be used to specify the algorithm for choosing the breaks between ranges of values associated with each color. The choice of algorithm will be influenced by the type of distribution. Some of the many options include:

par(mfrow=c(2,2))

palette = colorRampPalette(c("navy", "lightgray", "red"))

plot(state_genhlth["GENHLTH"], pal=palette, breaks="jenks", main="jenks")

plot(state_genhlth["GENHLTH"], pal=palette, breaks="quantile", main="quantile")

plot(state_genhlth["GENHLTH"], pal=palette, breaks="sd", main="sd")

plot(state_genhlth["GENHLTH"], pal=palette, breaks="equal", main="equal")
Comparison of different classification algorithms with mean responses on general health by state (1 = excellent, 5 = poor)

Modifiable Areal Unit Problem

Different choices of what types and boundaries of areas you used to aggregate spatial data can result in radically different results.

The modifiable areal unit problem

The modifiable areal unit problem (MAUP) is "a source of statistical bias that is common in spatially aggregated data, or data grouped into regions or districts where only summary statistics are produced within each district, especially where the districts chosen are not suitable for the data" (wiki.gis.com 2022).

For example, maps of electoral results give a very different impression depending on whether you map by state or by county.

2016 percent Democratic presidential vote aggregated by state
2016 percent Democratic presidential vote aggregated by county

In drawing electoral boundaries, the MAUP is intentionally used in a process called gerrymandering, where specific groups of voters are either "packed" into a small number of homogenous districts or "cracked" across multiple districts in order to reduce the power and representation available to those groups.

There is no universally applicable solution to the MAUP. In cases where accuracy is not essential (such as with data exploration) or possible (such as with data that has been aggregated for privacy reasons), the the uncertainty presented by the MAUP may be acceptable as long as caveats are provided along with the analysis. In other cases, more-sophisticated analytical approaches may be more appropriate.