Correlation, Simple Linear Regression, and X-Y Scatter Charts in R

Correlation is the strength of the relationship between two variables. The measurement of correlation is one of the most common and useful tools in statistics.

Reading CSV Files Into Data Frames

The examples below use data from a variety of sources on US counties in a CSV (comma-separated variable) file you can download here.

CSV files can be loaded with the read.csv() function in R. The colnames() function will list the names of the variables from the column names. The rownames() function lists the names of the rows from column one, in this case the names of countries.

> counties = read.csv("2017-county-data.csv", stringsAsFactors = F)

> colnames(counties)
 [1] "X"          "FIPS"       "ST"         "COUNTYNAME" "CBSA"      
 [6] "CBSAPOP"    "COUNTYPOP"  "URBAN2013"  "URBAN2006"  "URBAN1990" 
[11] "REGION"     "VOTES2012"  "OBAMA"      "ROMNEY"     "WINNER"    
[21] "MEDHOUSE"  

> rownames(counties)
  [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12" 
 [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
 [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 

Types of Correlation

A positive correlation means that as one variable goes up, the other goes up as well. When two variables with a positive correlation are plotted on the two axes of an X-Y scatter chart, the points form a rough line or curve upward from left to right.

The plot() function can be used to quickly create X-Y scatter plots. The first parameter is the X-axis and the second parameter is the Y-axis.

For example: Counties with higher annual median household income tend to have higher median monthly housing costs:

> plot(counties$MEDHHINC, counties$MEDHOUSE, col="#00000040")
Median Household Income vs. Median Monthly Housing Cost by County (US Census Bureau, American Community Survey)

A negative correlation means that as one variable goes up, the other goes down. When two variables with a negative correlation are plotted on the two axes of an X-Y scatter chart, the points form a rough line or curve downward from left to right.

For example, wealthier counties tend to have fewer residents with obesity problems:

> plot(counties$MEDHHINC, counties$PCOBESITY, col="#00000040")
Median Household Income vs. Percent Obesity by County (US Census Bureau, Centers for Disease Control)

When two variables with no correlation are plotted on the two axes of an X-Y scatter chart, the points form a diffuse cloud.

For example, there is no correlation between median household income and county size. There are small wealthy counties and there are large poor counties.

> plot(counties$MEDHHINC, counties$COUNTYPOP, col="#00000040", log="xy")
Median Household Income vs. County Population (US Census Bureau)

Correlation Does Not Imply Causation

One major warning about the interpretation of correlation is that correlation does not imply causation. This is also commonly referred to as the post hoc fallacy, from the Latin phrase post hoc ergo propter hoc (after this, therefore because of this).

Causation can be indirect - poor infant mortality is caused by poor nutrition and medical care, which in turn are a reflection of the economic capabilities of parents.

Correlations can also be spurious, such as the strong correlation between iPhone sales in the USA and the number of deaths from falling down stairs. See Tyler Vigen's collections of spurious correlations if you find such things amusing.

Simple Linear Regression

Correlation can be assessed by performing regression.

The term regression was first used by 19th century statistician Francis Galton to describe how the decendants of exceptional individuals tend to regress toward average with each successive generation. The term was later generalized to describe a statistical process for estimating the relationships among variables.

Perhaps the simplest form of model is simple linear regression. This expresses a correlation relationship between two variables x and y as a line formula:

yi = α + Βxi + εi

Software that implements regression usually uses an ordinary least squares (OLS) algorithm that systematically tries different values for the coefficients to get the line to fit the data points as closely as possible and keep the error term as small as possible.

Excel implements this with trendlines on X-Y scatter charts.

R implements this with the lm() function. The first parameter is a formula with the dependent variable (y) and independent variable (x) separated by a tilde (~). You can think of this as y depends on x.

When the variables are plotted on an x-y scatter plot, you can draw a regression line from the model using the abline() function:

> plot(counties$MEDHHINC, counties$MEDHOUSE, col="#00000040")
> model = lm(MEDHOUSE ~ MEDHHINC, data=counties)
> abline(model, col="darkred", lwd=3)
Median Household Income vs. Median Monthly Housing Cost by County (US Census Bureau, American Community Survey)
The summary() function prints summary information about the model.

> summary(model)

lm(formula = MEDHOUSE ~ MEDHHINC, data = counties)

    Min      1Q  Median      3Q     Max 
-712.46  -86.20   -1.02   85.85  617.81 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.873e+01  1.076e+01   -7.32 3.13e-13 ***
MEDHHINC     1.734e-02  2.221e-04   78.08  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 152.4 on 3137 degrees of freedom
  (8 observations deleted due to missingness)
Multiple R-squared:  0.6602,    Adjusted R-squared:  0.6601 
F-statistic:  6096 on 1 and 3137 DF,  p-value: < 2.2e-16

The regression coefficients are listed with their estimated values. For the model listing above, the model function is:

MEDHOUSE = -78.73 + (0.01734 * MEDHHINC)

The t value beside each coefficient is the coefficient divided by it's standard error, and that t value is used with a t-distribution to calculate Pr(>|t|), which is the probability that the coefficient is actually zero and the independent variable has no effect on the dependent variable. Low Pr(>|t|) means that the variable is important.

The R-squared is the coefficient of determination that indicates how well the model fits the data. The range is zero to one, and higher values are better.

Other values in the summary are explained here...

Logarithmic Relationships

Many phenomena in both the social and non-human realms are exponential. For example, populations (at least initially) tend to increase exponentially:

Such phenomena are commonly modeled as logarithms, which would be the exponents in the series above.

For example, because there are only a few wealthy counties with high housing costs in the US, the points tend to be bunched on the lower left side of the graph:

Linear X-Y Scatter

By making the axes space numbers logarithmically with the log="xy" parameter, the values are spread out and made more visible:

> plot(counties$MEDHHINC, counties$MEDHOUSE, col="#00000040", log="xy")
Logarithmic X-Y Scatter

When a figure is plotted with logarithmic axes, the abline() function will not draw an appropriate line. In such cases, you will need to create a model, use the predict() function to create modeled Y values for a set of X values you provide, then use the lines() function to draw an appropriate line.

plot(counties$MEDHHINC, counties$MEDHOUSE, col="#00000040", log="xy")
model = lm(MEDHOUSE ~ MEDHHINC, data=counties)
x = seq(0, 200000, 10000)
y = predict(model, newdata=data.frame(MEDHHINC = x))
lines(x, y, lwd=3, col="darkred")
Predicted Regression Line

When the pattern of dots has a general trend upward or downward, but there are a large number of points that do not conform to this relationship, the correlation is weak and is said to exhibit heteroskedacity.

Models With Multiple Independent Variables (Multiple Regression)

Most phenomena have multiple causes, and the spread of residuals (deviation from a straight line) on the plot hints that more than income determines whether someone will be obese or not. While true spatial epidemiology would need to involve a more fine-grained investigation of individuals, there are some neighborhood characteristics that are related to the prevelance of obesity.

The use of multiple variables is one way to control for different characteristics and help isolate characteristics that are more important than others.

Included in the county data are a couple of categorical variables: REGION is the US census region and URBAN2013 is National Center for Health Statistics 2013 rural-urban classification of counties.

A boxplot() displays a box-and-whisker plot, where:

Boxplots of obesity by region and by urban area classification (1 is most urban, 6 is most rural) show the differences in prevalance of obesity in different areas. People who live in dense urban areas tend to be less obese than folks in suburban and rural areas.

> boxplot(counties$PCOBESITY ~ counties$URBAN2013)
Percent Obesity in County by Urban Classification (CDC, NCHS)

And people who live in the West to be less obese than folks from other parts of the country.

> boxplot(counties$PCOBESITY ~ counties$REGION)
Percent Obesity in County by Region (CDC, NCHS)

Because these are categorical variables, they are difficult (but not impossible) to use directly in regression. Since we are only concerned with whether a county is in the West or if it is urban (category 1), we can create dummy variables. The name is a bit deceptive. Dummy variable means that the value is either zero or one.

> counties$WEST = ifelse(counties$REGION == "West", 1, 0)
> counties$URBAN = ifelse(counties$URBAN2013 == 1, 1, 0)

We can now create a model with the three independent variables. This has an R2 of 0.4213, which is far from a perfect fit, but which is much better than the prior one-variable model value of 0.2353.

> model = lm(PCOBESITY ~ MEDHHINC + WEST + URBAN, data=counties)
> summary(model)

lm(formula = PCOBESITY ~ MEDHHINC + WEST + URBAN, data = counties)

     Min       1Q   Median       3Q      Max 
-11.5495  -2.1546  -0.0198   2.1271  12.7662 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.938e+01  2.426e-01 162.346  < 2e-16 ***
MEDHHINC    -1.635e-04  5.068e-06 -32.256  < 2e-16 ***
WEST        -5.462e+00  1.792e-01 -30.472  < 2e-16 ***
URBAN       -3.281e+00  4.182e-01  -7.844 5.93e-15 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.384 on 3102 degrees of freedom
  (41 observations deleted due to missingness)
Multiple R-squared:  0.4251,    Adjusted R-squared:  0.4245 
F-statistic: 764.6 on 3 and 3102 DF,  p-value: < 2.2e-16

Visualizing Multiple Regression

One challenge with visualizing multiple regression is that since there are multiple variables, you cannot simply make an X-Y scatter chart (which uses only two variables). One solution is to create a line graph of the predicted output of the model plotted over the actual dependent variable values.

The predict() function takes the model created with lm() and the independent variables, and outputs a vector of the values from the model function. In this example, we add those values as a column to the data frame, sort the data frame using the order() function, plot() the actual values and then draw a line with the predicted values on top:

model = lm(PCOBESITY ~ MEDHHINC + WEST + URBAN, data=counties)
counties$PREDICTED = predict(model, newdata = counties)
counties = counties[order(counties$PREDICTED), ]
plot(counties$PCOBESITY, col="#00000040")
lines(counties$PREDICTED, type="l", col="red", lwd=3)
Multiple Regression Graph

Improving X-Y Scatter Plot Appearance

The default plot() layout is functional, but of dubious aesthetic value. While the ggplot2 does somewhat better, it is possible to configure the native plot() command with a few extra parameters. This particular layout is clean minimalist.

plot(counties$MEDHHINC, counties$PCOBESITY, col="#00004040", log="x",
	las=1, yaxs="i", fg="white", font.lab=2, 
	xlim=c(20000, 150000), ylim=c(0,50),
	panel.first = grid(nx=NA, ny=NULL, lty=1, col="gray"),
	xlab="Median Household Income ($)", 
	ylab="County Obesity (%)")

abline(a=0, b=0, lwd=3)
Improved Chart Layout

Nonlinear Regression

As the name implies, linear regression assumes that the relationship between the variables is linear. Ideally, the X-Y scatter would form a straight line and the model is simply the formula for that straight line.

However, there are often cases where the relationship is not a line: it is nonlinear.

For example: when a species begins to populate an area, they begin with a handful of individuals that then expand exponentially: two individuals have 8 offspring, who have 64 offspring, who have 512 offspring, and so on.

To model this relationship, you need the exp() function. For example: below is the US population from the decennial census from 1790 to 2010:

pop = c(3,5,7,10,12,17,23,31,39,50,62,76,92,106,123,132,152,181,205,227,250,282,309)
year = seq(1790, 2010, 10)
data = data.frame(year, pop)

The lm() function can handle exponential relationships with the exp() function in the formula, although in this case we need to scale the years by 0.01 to keep the range of values within what lm() can handle. The predict() function is then used to get predicted values from the model for plotting:

plot(data$year, data$pop, 
	pch=19, col="navy", bty="n", fg=NA, las=1, font.lab=2,
	xlim=c(1790,2100), ylim=c(0,500),
	xlab="Year", ylab="US Population (MM)")
abline(a=0, b=0, lwd=3)

model = lm(pop ~ exp(year * 0.01), data=data)
pyears = 1790:2100
lines(x = pyears, y = predict(model, data.frame(year = pyears)), lwd=3, col="#80000080")
Modeled Exponential Population Growth

Although population growth start exponential, eventually that growth levels off as the carrying capacity of an area is reached and resources can no longer support additional population. Long-term, population often follows an s-shaped sigmoid curve. One common type of sigmoid curve is a logistic curve.

For situations like this, you need nls() (nonlinear least squares). This function requires that you specify the model coefficients in your formula AND give estimated starting values that the function will start from in fitting the curve.

nls() supports the SSlogis() (self-starting logistic) function in formulas. The coefficients are the time period (years), the peak population, the point in time when growth is halfway to the maximum, and a scaling parameter that represents the growth rate relative to time. Given some reasonable starting values in the start parameter, nls() function will tweak those coefficients to fit a logistic curve to the data.

The example below fits a logistic curve to historic US population and predicts population out to 2100:

plot(data$year, data$pop, 
   pch=19, col="navy", bty="n", fg=NA, las=1, font.lab=2,
   xlim=c(1790,2200), ylim=c(0,500),
   xlab="Year", ylab="US Population (MM)")
abline(a=0, b=0, lwd=3)

model = nls(pop ~ SSlogis(year, peak, half, scale), data=data, start=list(peak = 600, half=1950, scale=70))
pyears = 1790:2200
lines(x = pyears, y = predict(model, data.frame(year = pyears)), lwd=3, col="#00800080")
Modeled Logistic Population Growth

The summary shows that this crude model predicts a peaking of US population at 475 million people (peak coefficient) and that the halfway point in US population growth was 1982 (half coefficient). Of course, this model does not consider the actual drivers and limits to population growth, so it can only be guide for speculation and, perhaps, more-careful modeling.

> print(summary(model))

Formula: pop ~ SSlogis(year, peak, half, scale)

      Estimate Std. Error t value Pr(>|t|)    
peak   475.804     32.381   14.69 3.52e-12 ***
half  1982.963      6.708  295.60  < 2e-16 ***
scale   47.332      1.978   23.93 3.42e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.005 on 20 degrees of freedom

Number of iterations to convergence: 7 
Achieved convergence tolerance: 1.971e-06