Working with Sampled Data in R

We rarely have complete data about something we want to know. For example, we often want to know who will win a political election, but we do not have the time to ask every potential voter who they plan to vote for. And people's attitudes change during a campaign.

However, it is possible to take get the opinion of a small number of people and make an estimate of the possible outcomes of the election.

That small number of people is called a sample, and the estimate we make about the broader population is called an inference.

Descriptive statistics is the use of statistical techniques to describe data.

Inferential statistics is the use of statistical techniques to make inferences about a complete population based on a sample from that population.

This tutorial will cover some basic techniques used in inferential statistics.

Inferential vs. descriptive statistics

Types of Sampling

The method used to determine what subset of people or things you base your inference on is called sampling.

There are two broad types of sampling techniques, and a number of subtypes within those broad types.

Probability sampling is when any member of the target population has a known probability of being included in the sampled population. Such techniques include:

Nonprobability sampling is used in cases where probability sampling is impractical. While you cannot use statistical techniques to make clear inferences about the broader population from these types of samples, descriptive results from these samples can offer suggestions of research paths that would justify the expense and difficulty of further probability sampling. Some common subtypes of nonprobability sampling include:

Descriptive Statistics

Example Data

These examples use data from this CSV file.

survey = read.csv("2022-sample-data.csv")

dim(survey)
[1] 53  3

This example survey data contains an identifying SurveyID, a quantitative GPA, and a categorical Laptop Type.

names(survey)
[1] "SurveyID"    "GPA"         "Laptop.Type"

Count and Amount Variables

The range() of values as well as the mean() and median() are standard descriptive statistics to run on any quantitative variable.

range(survey$GPA, na.rm=T)

mean(survey$GPA, na.rm=T)

median(survey$GPA, na.rm=T)
[1] 2 4
[1] 3.009143
[1] 3.01

The distribution of values across the range can be assessed with a histogram and/or density plot.

hist(survey$GPA, col="navy")
Showing a distribution in a histogram
plot(density(survey$GPA, na.rm=T), lwd=3, col="navy")
Showing a distribution in a density plot

Categorical Variables

The table() function shows the counts of different values in a categorical variable.

table(survey$Laptop.Type)
 Mac None   PC 
  25    5   23 

A table() can also be visualized as a barplot() with values converted to percents.

percents = table(survey$Laptop.Type) * 100 / nrow(survey)

percents = percents[order(percents, decreasing=T)]

x = barplot(percents, ylim=c(0, 60), col="navy")

text(x, percents + 5, paste0(round(percents), "%"))
Barplot of a categorical variable

Confidence Interval

If we sample a characteristic of a population that can be represented as a continuous variable, we can take the mean of that sample and get the sample mean. But how confident can we be that sample mean is anywhere close to the population mean?

A curious and wondrous fact about random sampling is that if you take a sufficient number of random samples (usually 30 or more) from any kind of distribution (normal or not), the distribution of deviations (errors) between the sampled values and the population mean will be a normal distribution.

This is the central limit theorem, which was originally proposed by French mathemetician Abraham de Moive in 1733 and later developed by French mathemetician Pierre-Simon Laplace in 1812 and Russian mathemetician Aleksandr Lyapunov in 1901.

The implication of the central limit theorem is that you can use a standard deviation and the rules of probability to calculate confidence intervals and assess how reliable your sampling results are. The central limit theorem is one of the most important theorems in statistics.

For example, within a normal distribution:

68% confidence interval

Standard Error

Since our primary interest is is a potential amount of error between the sample mean and population mean, we can use the standard deviation of the sample and the size of the sample to calculate the standard error. The standard error is used to estimate how far your sampled data may differ from the population data.

Standard error for a sample mean (x) is calculated as the standard deviation of the population (σ) divided by the square root of the sample size. The greater the sample size, the closer the sample mean can be assumed to get to the actual population mean.

σ = σ / √n

Since we rarely know the standard deviation of the population (which is why we're sampling in the first place) we estimate the standard error using the standard deviation of the sample (s):

σ = s / √n

Confidence Intervals For Means

The confidence level you want determines how many standard errors above or below the mean you are willing to accept. The most common confidence levels are 95% and 99% (Investopedia 2022).

In a normal distribution, we know that 95% of the values (2.5 percentile to 97.5 percentile) are within 1.96 standard deviations (a z-score of ±1.96) above or below the mean.

qnorm(c(0.025, 0.975))
[1] -1.959964  1.959964

Accordingly, if we have a random sample, we can be 95% confident (confidence interval) that the actual population mean is within 1.96 standard errors above or below the sample mean (s).

s ± 1.96 * σ

The confidence interval is sometimes described in terms of margin of error.

95% Confidence Interval

Using the GPA data in the sample data above, the confidence interval based on standard error can be calculated in R:

gpa = na.omit(survey$GPA)

n = length(gpa)

stderr = sd(gpa) / sqrt(n)

moe = round(1.96 * stderr, 2)

print(paste("The margin of error is", moe))

gpa.mean = round(mean(gpa), 2)

print(paste("The 95% confidence interval is", gpa.mean - moe, " to ", gpa.mean + moe))
[1] "The margin of error is 0.15"

[1] "The 95% confidence interval is 2.86  to  3.16"

The MOE can be visualized with a semitransparent rect() around the mean line.

plot(density(gpa), lwd=3, col="navy")

rect(xleft = gpa.mean - moe, ybottom = -1, xright = gpa.mean + moe, 
	ytop = 1, border=NA, col="#00000020")

abline(v = gpa.mean, lwd=3, col="darkred")
Visualizing a confidence interval

Confidence Intervals for Proportions

If your sample data is dichotomous (e.g. Mac vs PC), or categorical that can be expressed as dichotomous (e.g. people whose favorite fruit is apple), what you are estimating is the population proportion in each group (x% use MAC, y% use PC). In that case, the confidence interval is:

p ± Z * σp

based on the standard error for proportions:

σp = √(p * (1 - p) / n)

Where:

Z is the z-score for the desired confidence interval (1.96 for a 95% level of confidence)

p is the proportion from the sample

n is the size of the sample

For example, in a survey of 53 students in regard to their laptop operating system:

laptop = table(survey$Laptop.Type, useNA = "no") / nrow(survey)

stderr = sqrt(laptop * (1 - laptop) / nrow(survey))

moe = 1.96 * stderr

paste0("Estimated ", names(laptop), " users = ", 
	round(100 * (laptop - moe)), "% to ", 
	round(100 * (laptop + moe)), "%")
[1] "Estimated Mac users = 34% to 61%" 
[2] "Estimated None users = 2% to 17%"
[3] "Estimated PC users = 30% to 57%" 

arrows() can be used to draw error bars (whiskers) on a barplot().

x = barplot(laptop, col="navy", ylim=c(0, 0.8))

arrows(x0 = x, y0 = laptop - moe, y1 = laptop + moe,
        code=3, angle=90, length=0.1, col="darkgray", lwd=3)
Barplot with confidence interval error bars

Z-Scores For Other Levels of Confidence

The Z-score for a 95% level of confidence is 1.96 standard errors.

In R, the qnorm() (normal curve quantile function) can be used to calculate the z-score for other confidence intervals. The math for the parameter is a bit confusing because it is based on margin of error rather than confidence level:

confidence = c(0.99, 0.95, 0.9, 0.8)

z = qnorm(1 - ((1 - confidence) / 2))

paste0("Z scores for ", confidence * 100, "% confidence interval = ", round(z, 2))
[1] "Z scores for 99% confidence interval = 2.58"
[2] "Z scores for 95% confidence interval = 1.96"
[3] "Z scores for 90% confidence interval = 1.64"
[4] "Z scores for 80% confidence interval = 1.28"

Confidence Intervals With Small Sample Sizes (the t-distribution)

When the sample size is 30 or less, the possibility of large errors increases and the distribution of errors becomes a student's T distribution rather than a normal distribution.

The t-distribution also requires the number of degrees of freedom (the size of the sample minus one) for calculation.

The density graph compares the normal distribution with an analagous t distribution with 10 degrees of freedom. Note that the tails of the t distribution are taller, which represents additional area of uncertainty associated with small samples compared to large sample sizes. As degrees of freedom increase, the t distribution begins to approximate the normal distribution.

Normal Distribution vs a T Distribution With 10 Degrees of Freedom

Confidence Interval for Means

T is used instead of the z-score to calculate a margin of error for sample means or proportions

s ± t * σ

The example small data set for this sample is available here.

sample = read.csv("2022-sample-data-small.csv")

degrees_freedom = nrow(sample) - 1

t = qt(0.95, degrees_freedom)

stderr = sd(sample$GPA) / sqrt(nrow(sample))

moe = round(t * stderr, 2)

mean.gpa = round(mean(sample$GPA), 2)

paste("Estimated average GPA is", mean.gpa - moe, "to", mean.gpa + moe)
[1] "Estimated average GPA is 2.91 to 3.15"

Confidence Interval for Proportions

p ± t * σp

Using the example above of a survey where 40% of respondendents indicated using a Mac rather than a PC for their home computer or laptop, we can see the wider margin of error for proportions associated with a small sample:

sample_size = nrow(sample)

laptop = table(sample$Laptop.Type)

laptop = laptop / sum(laptop)

stderr = sqrt(laptop * (1 - laptop) / sample_size)

t = qt(0.95, sample_size - 1)

moe = t * stderr

paste0("Estimated ", names(laptop), " users (small sample) = ", 
	round((laptop - moe) * 100, 1), "% - ", 
	round((laptop + moe) * 100, 1), "%")
[1] "Estimated MacOS users (small sample) = 12.3% - 47.7%"  
[2] "Estimated None users (small sample) = -3.4% - 13.4%"   
[3] "Estimated Windows users (small sample) = 46.6% - 83.4%"

How Big a Sample Do I Need to Have the Confidence I Want?

Using simple algebra, we can transform the formulas for margin of error to find the sample size (n) that we need get the margin of error (E) that we are able to tolerate:

For population mean estimates:

n = (Z * s / E)2

For population proportion estimates:

n = Z2 * p * (1 - p) / E2

Estimating s and p

For the mean formula, you need a sample mean (s), and for the proportion formula you need the sample proportion (p). But since you have not yet done the sampling yet, you cannot know these values.

There are three imperfect but practical ways to estimate these values:

Note that the examples below presume a random sample from the entire population you are basing the estimate upon. If you are trying to make estimates about Americans, your random sample would need to be drawn from a list of all Americans with 100% participation. Drawing a perfectly random sample from anything other than a trivial or captive population is usually impossible, requiring more sophisticated modeling techniques to adjust the results and compensate for segments of the population that were undersampled.

Example Sample Size

Following the examples using male height above, suppose you wish to get an estimate for GPA ±0.1 points (E = 0.1) with 95% confidence (Z = 1.96). For s you use the mean from the sample of 3.009143.

Z = 1.96

s = 3.009143

E = 0.1

n = (Z * s / E)^2

paste("Minimum sample size =", ceiling(n))
[1] "Minimum sample size = 3479"

Following on the proportions example above, suppose you wish to get a better estimate of the percent of students that use Macs as their personal home computers or laptops ±1% (E = 0.01) at a 95% level of confidence (Z = 1.96). Using the estimated proportion of 47% (p = 0.47) from the survey given above:

Z = 1.96

p = 0.47

E = 0.01

n = (Z^2) * p * (1 - p) / (E^2)

paste("Minimum sample size =", ceiling(n))
[1] "Minimum sample size = 9570"

Appendix: Calculations in Excel

Confidence Interval for Means

Using a simulated sample of heights from 100 American men in this CSV file, this formula can be used to calculate standard error:

=STDEV(A2:A101) / SQRT(COUNTA(A2:A101))

To calculate margin of error for means given:

=B3 * B2 / SQRT(B4)

Confidence Interval for Proportions

Using survey responses from this CSV file, we can calculate the confidence interval for proportions in Excel.

When dealing with categorical data in Excel, you can use the COUNTIF() function to find the number of cells in your sample that match a particular value. The first parameter is the data and the second parameter is the condition. Note that the second parameter must be enclosed in quotes.

For example, with the CSV file linke above, the data in cells A2:A33 is either "Mac" or "PC". This will return the number of Mac users:

=COUNTIF(A2:A33, "Mac")

You can then get the proportion (percentage) of yes cells with:

=COUNTIF(A2:A22, "Mac") / COUNTA(A2:A22)

If you put your proportion in cell B2 and your sample size in cell B3, the standard error for proportions is:

=SQRT(B2 * (1 - B2) / B3)

If you then put the Z-score for your level of confidence in cell B3 (for 95% confidence, this is 1.96), the Excel formula for margin of error for proportions is:

=B4 * SQRT(B2 * (1 - B2) / B3)

Z-Scores For Other Levels of Confidence

In Excel, z-scores for other levels of confidence can be calculated with the NORMSINV() function. With the confidence interval in cell B2 (as a percent on a scale of 0 to 1):

=NORMSINV(1 - ((1 - B2) / 2))

Confidence Interval for Small Samples

In Excel, the TINV() function can be used to calculate t for use with standard error to calculate a confidence interval.

To calculate the margin of error for means given:

=TINV(1 - B2, B4 - 1) * B3 / SQRT(B4)

To calculate the margin of error for proportions:

=TINV(1 - B3, B4 - 1) * SQRT(B2 * (1 - B2) / B4)

Excel Confidence Interval Functions

As you might expect, Excel has functions to simplify confidence intervals for means. CONFIDENCE.NORM() can be used to calculate margin of error with sample sizes over 30 and CONFIDENCE.T can be used with sample sizes of 30 or less. Both take the same parameters:

=CONFIDENCE.NORM(alpha, stdev, sample_size)

=CONFIDENCE.T(alpha, stdev, sample_size)

alpha = 1 minus level of confidence (0.05 for 95% confidence level)
stdev = standard deviation of the sample
sample_size = count of values in the sample

For example, given a level of confidence in cell B2, the standard deviation in cell B3, and the sample size in cell B4, the margin of error (for a sample size less than 30) using the t-distribution function:

=CONFIDENCE.T(1 - B2, B3, B4)

Excel does not have a convenience function for confidence interval for proportions.

Example Sample Size Estimation

Following the examples using male height above, suppose you wish to get an estimate for female height ±1 inch (E = 1) with 95% confidence (Z = 1.96). For s you use a value you have seen on the internet of 63.7 inches. The power() function is used for the exponent:

=POWER(1.96 * 63.7 / 1, 2)

Following on the proportions example above, suppose you wish to get a better estimate of the percent of Americans that use Macs as their personal home computers or laptops &plusmin;5% at a 95% level of confidence (Z = 1.6). Using the small sample estimated mean of 40% given above:

=POWER(1.96, 2) * 0.4 * (1 - 0.4) / POWER(0.05, 2)