Exploring Data in R: When To Use What Method

The determination of what methods to use when analyzing data in R will be determined by the characteristics of the data and the question(s) you seek to answer with that data.

This tutorial will introduce techniques for exploring and determining the characteristics of data, and methods that can answer different types of questions with those different types of data. This approach proceeds through a sequence of questions:

How Does My Data Relate to Reality?

The idealized world of the scientific method is question-driven, with the collection and analysis of data determined by questions and hypotheses:

The Scientific Method

However, in some cases, especially at the early states of research, a data-driven approach may be more appropriate. This reverses the question-driven process, with the data coming first and animating a process of exploratory data analysis, which is an approach outlined in this tutorial:

Exploratory Data Analysis

Metadata

The first thing you should do when you get new data is document where it came from, before you lose track of this information as your project proceeds and other priorities intervene. You may need this metadata in the future, especially if you plan on sharing or publishing the results of your analysis.

Metadata may include:

Types of Variables

Structured data represents characteristics of entities as variables. This section uses Mosteller and Tukey's (1977) proposed a seven-level typology for levels of measurement that is a more-robust alternative to Stanley Steven's (1946) traditional four-level typology (nominal, ordinal, interval, ratio).

Population vs Sampled Data

Sampled data is an incomplete, but hopefully representative, subset of the broader population being analyzed. Many statistical techniques, such as tests, use the laws of probability to calculate the amount of certainty that observations made on sampled data represent that broder population.

In choosing what method(s) to use with your data, you should note whether the method is specifically designed for use with sampled data.

Noise and Uncertainty (Fixed X)

Many statistical tests and methods rely on the assumption of fixed X, where the data represents exact values and there is no random "noise" involved (Zuur et al 2007, 53).

In reality, there is always noise and uncertainty associated with data, layering additional probabilities on top of the analysis and making the rigorous p-values and confidence intervals that statistical techniques produce less precise. Indeed, many phenomena are ontologically fuzzy, and the data used to represent those phenomena reflect the subjectivities of the researchers. This is especially true in the social sciences where the phenomena are impossibly complex and research results can often not be reproduced.

Results of data analysis should always acknowledge and report these uncertainties as well as the implications of these uncertainties.

How Does My Data Relate to the Units of Measurement (Distribution)?

A distribution is the manner in which the values of a variable are distributed across the range of possible values. The characteristics of a variables distribution will dictate the types of analysis that can be performed using that variable.

Example Data

Many of these examples use a collection of US county data that you can download here, and import using the read.csv() function:

> counties = read.csv("2017-county-data.csv", as.is=T)
> names(counties)
 [1] "FIPS"       "NAME"       "AREASQMI"   "ST"         "CBSA"      
 [6] "CBSPOP2013" "POP2013"    "URBAN2013"  "URBAN2006"  "URBAN1990" 
[11] "REGION"     "WIN2012"    "WIN2016"    "DEM2012"    "PCDEM2012" 
[16] "GOP2012"    "PCGOP2012"  "DEM2016"    "PCDEM2016"  "GOP2016"   
[21] "PCGOP2016"  "CHANGEDEM"  "CHANGEGOP"  "PCOBESITY"  "OBESITY"   
[26] "MEDHHINC"   "MEDHOUSE"   "MEDAGE"     "POPSQM2013"

Visualizing Distributions With Histograms and Bar Plots

The quickest way to get a general sense of the characteristics of a continuous variable distribution is using a histogram generated with the hist() function.

For example, this is a histogram of the percent of obese people across US counties approximately follows a normal curve:

>hist(counties$PCOBESITY)
Histogram of Obesity by County in the USA (CDC, 2013)

The equivalent of a histogram for categorical data is the bar plot, which consists of horizontal or vertical bars where the height or width determined by the count of values in each category.

To create a bar plot, you must first create a contingency table with the table() function that counts the number of values in each category. You can then feed that table into a barplot() function to draw a bar plot.

This example is a contingency table and bar plot of the the rural-urban classification of counties used by the CDC, ranging from 1 for large central metropolitan areas, to 6 for non-core (rural) areas.

The barplot() function returns the x coordinates of the center of each bar. Those x coordinates can be fed into the text() function to label the bars:

> y = table(counties$URBAN2013)
> y

   1    2    3    4    5    6 
  68  368  372  358  641 1333 
> x = barplot(y)
> text(x = x, y = y / 2, labels = y, font=2)
Bar Plot of Rural-Urban Classification By County (CDC 2013)

Normality

Measurements of many phenomena follow a normal distribution, or a bell curve, with values clustered around a central mean and becoming less frequent as the values move higher or lower away from the mean.

Many of the methods used in statistics are based on the assumption that the variables that approximate a normal distribution or can be transformed into something close to a normal distribution. Tests that rely on variables with a known distribution are called parametric tests.

> standard_deviations = seq(-3, 3, 0.1)
> norm = dnorm(standard_deviations)
> plot(standard_deviations, norm, type="l")
Example Normal Density Curve

Evaluating Normality With Q-Q-Plots

The normality of a distribution can be visualized using a QQ-plot generated with the qqnorm() plotting function, which draws the values on the x-axis in comparison to what would be expected in a normal distribution on the y-axis. If the distribution is normal, the dots will align diagonally with the line drawn with the qqline() function.

qqnorm(counties$PCOBESITY)
qqline(counties$PCOBESITY)
QQ-Plot of Obesity by County in the USA (CDC, 2013)

The mean can be calculated with the mean() function, and the standard deviation (which defines the width of the curve) with the sd() function.

> mean(counties$PCOBESITY, na.rm=T)
[1] 30.94169
> sd(counties$PCOBESITY, na.rm=T)
[1] 4.459513

As we see for the obesity example, the line is close but not perfectly diagonal, indicating there may be a slight skew (shift left or right) and/or kurtosis (flattening or spreading) from normal.

We can compare the median() (the value where have the values are below and half the values are above) to the mean() to detect skew.

> mean(counties$PCOBESITY, na.rm=T)
[1] 30.94169
> median(counties$PCOBESITY, na.rm=T)
[1] 31.2

The skewness() and kurtosis() functions from the e1071 library can also be used to quantify those deviations, with zero values meaning perfectly normal, and higher values indicating higher deviation from normal.

The values below indicate the distribution is slightly skewed to the right (longer left tail) and is slightly more clustered around the mean than in a perfectly normal distribution

> library(e1071)
> skewness(counties$PCOBESITY, na.rm=T)
[1] -0.3890195
> kurtosis(counties$PCOBESITY, na.rm=T)
[1] 1.077061

Normality Tests of Samples

When working with small numbers of samples (less than 40 or so) from a larger population, there are test functions that can give the probability that the population distribution is normal:

These tests do not work with large numbers of values, but with many statistical techniques, violations of normality assumptions do not cause major problems when large sample sizes are used. (Ghasemi and Sahediasi 2012).

> # Simulated random sample from a normal distribution
> x = rnorm(30)
> shapiro.test(x)

        Shapiro-Wilk normality test

data:  x
W = 0.98185, p-value = 0.8722

> ks.test(x, "pnorm", mean=mean(x), sd=sd(x))

        One-sample Kolmogorov-Smirnov test

data:  x
D = 0.095656, p-value = 0.9222
alternative hypothesis: two-sided

> library(nortest)
> pearson.test(x)

        Pearson chi-square normality test

data:  x
P = 9.4667, p-value = 0.09184

Transformations

Skewed distributions can often be made normal (or close to normal) using transformations.

For example, median household income is skewed to the left (median below the mean) by a handful of extremely wealthy counties:

> hist(counties$MEDHHINC)
> mean(counties$MEDHHINC, na.rm=T)
[1] 46839.65
> median(counties$MEDHHINC, na.rm=T)
[1] 45114
Histogram of Median Household Income by County in the USA (USCB 2015)

The logarithmic transformation using the log() function is commonly used with distributions like income where there are a large number of low values (poor people) and fewer higher values (rich people). The exp() function can be used to transform the logarithmically-transformed values back to the original units

> par(mfrow=c(1,2))
> hist(log(counties$MEDHHINC))

> qqnorm(log(counties$MEDHHINC))
> qqline(log(counties$MEDHHINC))

> exp(mean(log(counties$MEDHHINC), na.rm=T))
[1] 45405.38
> exp(median(log(counties$MEDHHINC), na.rm=T))
[1] 45114
Transformed Median Household Income by County in the USA (USCB 2015)

Evaluating Non-Normal Distributions With Quantiles

Some variables are clearly not normally distributed. For example, core-based statistical areas (CBSA) are used by the US Census Bureau to deliniate metropolitan areas. The populations of these areas tend to be fairly evenly distributed under around 200,000, although there are a handful of extremely large CBSAs around the country's largest cities. A logarithmic transformation makes the histogram clearer, but does not make the distribution normal.

> par(mfrow=c(1,2))
> hist(log(counties$CBSPOP2013))
> qqnorm(log(counties$CBSPOP2013))
> qqline(log(counties$CBSPOP2013))
> exp(mean(log(counties$CBSPOP2013), na.rm=T))
[1] 260334.6
> exp(median(log(counties$CBSPOP2013), na.rm=T))
[1] 187530
Transformed Populations of Core-Based Statistical Areas in the USA (USCB 2015)

For distributions like this, division of values into quantiles using the quantile() function gives a clearer description. The five quantile values divide the distribution into four parts, with each part having an equal number of elements. The quantile values are the dividing values between parts. The 0% quantile is the minimum value, the 50% quantile is the median, and the 100% quantile is the highest value:

> quantile(counties$CBSPOP2013, na.rm=T)
      0%      25%      50%      75%     100% 
   13200    61029   187530   901700 19831858 

How Does My Data Relate to Other Variables?

Variables will usually be captured as related collections and most statistical work involves in either modeling or testing relationships between variables.

Univariate Data

Univariate data is a single variable that stands alone. R stores univariate data as vectors.

Univariate data is commonly analyzed with descriptive statistics like mean/median/mode, standard deviation, normality, etc.

Examples of univariate data include ages, weights, income, etc.

Bivariate / Multivariate Data

Bivariate data is a pair of variables that are related to each other. Bivariate comparisons are often useful on their own or as part of the exploratory process of multvariate data.

Multivariate data is more than two variables that are related to each other, often with one dependent variable and one or more independent variables that influence the dependent variable

A common form of multivariate data is cross-sectional data, where a variety of variables are collected for multiple individuals at a single point in time.

Differences between sampled variables are assessed with tests like the t-test, chi-square test or ANOVA

Relationships between variables are assessed with correlation or regression

> plot(log(counties$MEDHOUSE), log(counties$MEDHHINC), col="#00000040")
> linear_model = lm(log(counties$MEDHHINC) ~ log(counties$MEDHOUSE))
> abline(linear_model, col="red2", lwd=3)
> summary(linear_model)

Call:
lm(formula = log(counties$MEDHHINC) ~ log(counties$MEDHOUSE))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.65517 -0.09869 -0.00429  0.09568  0.69201 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            6.732363   0.059120  113.88   <2e-16 ***
log(counties$MEDHOUSE) 0.609686   0.009021   67.58   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1572 on 3137 degrees of freedom
  (94 observations deleted due to missingness)
Multiple R-squared:  0.5928,    Adjusted R-squared:  0.5927 
F-statistic:  4567 on 1 and 3137 DF,  p-value: < 2.2e-16
X-Y Scatter Chart Showing Correlation Between Median Household Income and Housing Costs in US Counties (USCB 2015)

Time-Series Data

Time-series data consists of one or more variables that are observed over time.

# This is demonstration data included with R that consists of
# Annual measurements of the level, in feet, of Lake Huron from 1875-1972
plot(LakeHuron)
Example line graph of time-series data: The level of Lake Huron in feet between 1875 and 1972 (Brockwell and Davis 1991)

Spatial Data

Spatial data consists of one or more variables (also called fields or attributes) measures in point, along lines, within areas (polygons), or as grids (raster) on the earth's surface. Commonly visualized as maps.

library(rgdal)
counties = readOGR(".", "2017-county-data")
counties = counties[!(counties$ST %in% c("AK", "HI", NA)),]
usa_albers = CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 
	+ellps=GRS80 +datum=NAD83 +units=m +no_defs")
counties = spTransform(counties, usa_albers)
spplot(counties, zcol="WIN2012", col.regions=c("navy", "red2"))
Map Demonstrating Choropleth Visualization of Spatial Data: Results of the 2012 Election by County (AP via Politico)

Panel Data

Panel data is multidimensional data, usually data that has multiple variables on multiple individuals (cross-sectional data) that are observed at multiple periods of time and/or locations on the surface of the earth.

Analysis of panel data is often performed by generating mixed models. Because of the multidimensionality of the data, these models are often difficult to describe and visualize.

Heteroskedacity

Homoskedasticity means that the standard deviation of a distribution is the same throughout its range. The opposite of homoscedacity is heteroskedasticity, and a heteroscedastic distribution can cause inaccuracies when used with parametric tests, and have poor model fit with parametric models.

Heteroskedasticity is of primary interest when comparing groups of sampled data.

For example, the following is a simulated sample of what different people in a downtown area paid for lunch, along with their annual income. People with lower incomes tend to be more budget-conscious, while wealthier people have a broader range of options and can choose both expensive and inexpensive options. When the variables are plotted in comparison to each other on an X-Y scatter plot, heteroskedasticity causes the plot to spread vertically as the values move to the right.

When comparing two different continuous variables like this, the Breusch-Pagan heteroskedasticity test can be run using the bptest() function from the lmtest library. The low p-value in this example (below 0.05 for 95% confidence) means we can reject the null hypothesis of homoskedasticity.

> lunch = c(5, 0, 6, 8, 7, 10, 8, 5, 8, 8,
+           8, 10, 11, 15, 8, 8, 0, 5, 16, 10,
+           8, 9, 25, 8, 0, 40, 25, 8, 10, 10)
> income = c(35000, 34000, 40000, 42000, 50000, 52000, 53000, 53000, 54000, 56000,
+         56000, 60000, 61000, 61000, 65000, 70000, 71000, 71000, 75000, 76000,
+         76000, 80000, 82000, 90000, 100000, 110000, 110000, 120000, 120000, 120000)
> plot(log(income), lunch)
> library(lmtest)
> bptest(lunch ~ log(income))

        studentized Breusch-Pagan test

data:  lunch ~ log(income)
BP = 5.6513, df = 1, p-value = 0.01744
X-Y Scatter Plot Demonstrating Heteroskedacity

In contrast, if the distribution is homoskedastic, the points in the X-Y scatter will be closer to a line. For this example, we assume that everyone spends 1/100th of one percent of their annual income on lunch each day. The Breusch-Pagan p-value is high, indicating that we cannot reject the null hypothesis of homoskedasticity:

> percentage = income * 0.0001
> plot(log(income), percentage)
> bptest(percentage ~ log(income))

        studentized Breusch-Pagan test

data:  percentage ~ log(income)
BP = 0.29333, df = 1, p-value = 0.5881
X-Y Scatter Plot Demonstrating Homoskedacity

If the samples are divided into discrete groups (categorical data), the Bartlett test of homogeneity of variances can be run using the bartlett.test() function. The boxplot shows a clear difference in the range of values in each group, which is validated by the low p-value, indicating that we should reject the null hypothesis of homoskedasticity.

> wealthy = ifelse(income > 80000, "High", "Low")
> boxplot(lunch ~ wealthy)
> bartlett.test(lunch, wealthy)

        Bartlett test of homogeneity of variances

data:  lunch and wealthy
Bartlett's K-squared = 18.84, df = 1, p-value = 1.422e-05
Boxplot Demonstrating Heteroskedacity

How Does My Data Relate to Itself (Independence)?

Many statistical tests and models are based on the assumption that values are independent of each other. Unlike normality and homooskedasticity, determination of independence requires consideration of what the data is measuring.

The opposite of independence is autocorrelation where values correlate, or mirror, values separated by a given interval.

Temporal Autocorrelation

Time-series data is generally not independent. For example, temperatures over time are usually related to each other - days with similar temperatures tend to cluster together across the regular cycles of seasons.

Analysis of time-series data often involves decomposition of the data into:

With temporal data, we often looking for trends (long-term) or irregularities (deviations from long-term), so this type of decomposition with removal of the seasonal component is useful.

For example, this performs a time series decomposition on historic mean monthly temperatures at Dallas Love Field from the National Weather Service, which can be downloaded here. The CSV file is arranged with columns of months and rows of years (with an annual average in the last column), which requires some rearranging to turn into a univariate time series.

x = read.csv("dallas-avg-temp.csv")
x = x[,c(-1, -14)]
x = ts(as.vector(t(x)), start=c(1976, 1), frequency=12)
plot(stl(x, s.window="periodic"))
Time Series Decomposition of Average Monthly Temperatures at Dallas Love Field, 1977 - 2017 (NWS)

Spatial Autocorrelation

Likewise, spatial data is also not independent. Following Tobler's First Law of Geography (which is more of a heuristic than a law), like things and people tend to cluster together. This results in spatial autocorrelation.

While we are often looking for areas of autocorrelation in spatial data (such as crime clusters or hotspots), we are also often looking for relationships of phenomena in space associated with underlying variables (such as the relationship of housing prices to neighborhood income) that necessitate consideration of autocorrelation with techniques like spatial regression.

For example, when creating a simple (non-spatial) bivariate regression model relating median monthly housing costs to median household income, mapping of the residuals show that the model consistently overestimates housing costs in the low-cost Great Plains states, while underestimating housing costs in high-cost coastal megaregions.

> library(rgdal)
> counties = readOGR(".", "2017-county-data")
> counties = counties[!(counties$ST %in% c("AK", "HI", NA)),]
> usa_albers = CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0
+         +ellps=GRS80 +datum=NAD83 +units=m +no_defs")
> counties = spTransform(counties, usa_albers)
> counties = counties[!is.na(counties$MEDHHINC),]
> model = lm(MEDHOUSE ~ MEDHHINC, data=counties)
> summary(model)
Call:
lm(formula = MEDHOUSE ~ MEDHHINC, data = counties)

Residuals:
    Min      1Q  Median      3Q     Max 
-716.60  -85.07   -1.51   85.26  615.17 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.300e+01  1.080e+01  -7.686 2.03e-14 ***
MEDHHINC     1.745e-02  2.241e-04  77.878  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 151.2 on 3104 degrees of freedom
Multiple R-squared:  0.6615,    Adjusted R-squared:  0.6614 
F-statistic:  6065 on 1 and 3104 DF,  p-value: < 2.2e-16

> counties@data$RESIDUALS = residuals(model)
> spplot(counties, zcol="RESIDUALS")
Residuals From Non-Spatial Regression Comparing Monthly Housing Costs to Median Household Income Across Counties (USCB 2015)

What Methods Should I Use?

Based on the exploration of your data outlined above, you should be able to make a rational decision about the appropriate method(s) to use for evaluating your data.

Descriptive Statistical Methods: Central Tendency and Variation

Minimum / Maximum / Range

Median

Mean

Standard Deviation

Mode

Quantiles

Statistical Tests

Statistical tests are used with sampled data to determine whether an observed data set is so different from what you would expect under a null hypothesis that you should reject the null hypothesis. (McDonald 2014). Statistical tests consider the sample size in determining the probability of the null hypothesis being incorrect.

Correlation

One-Sample t-Test

Two-Sample t-Test

Mann-Whitney U-Test

Chi Square Test Goodness of Fit

Chi Square Test of Independence

Analysis of Variation (ANOVA)

Kruskal-Wallace One-Way Analysis of Variance

Models

Models are formulas that represent relationships between variables. They are simplified mathematical representations of reality. All models are broken but some models are useful (George Box)

Models are used to:

Linear Regression

Logistic Regression