Sampling: How Much Can I Know With Only A Limited Amount of Data?

We rarely have complete data about something we want to know. For example, we often want to know who will win a political election, but we do not have the time to ask every potential voter who they plan to vote for. And people's attitudes change during a campaign.

However, it is possible to take get the opinion of a small number of people and make an estimate of the possible outcomes of the election.

That small number of people is called a sample, and the estimate we make about the broader population is called an inference. The use of statistical techniques to make inferences is called inferential statistics.

Sampling Concepts

Types of Sampling

The method used to determine what subset of people or things you base your inference on is called sampling.

There are two broad types of sampling techniques, and a number of subtypes within those broad types.

Probability sampling is when any member of the target population has a known probability of being included in the sampled population. Such techniques include:

Nonprobability sampling is used in cases where probability sampling is impractical. While these techniques are not as reliable as probability sampling techniques for making inferences about the broader population, results from these samples can offer suggestions of research paths that would justify the expense and difficulty of further probability sampling. Some common subtypes of nonprobability sampling include:

Central Limit Theorem

If we sample a characteristic of a population that can be represented as a continuous variable, we can take the mean of that sample and get the sample mean. But how confident can we be that sample mean is anywhere close to the population mean?

A curious and wondrous fact about random sampling is that if you take multiple random samples that involve estimating the arithmetic mean of some population characteristic, the distribution of deviations (errors) between sample means from the population mean will be a normal distribution.

This is the central limit theorem, which was originally proposed by French mathemetician Abraham de Moive in 1733 and later developed by French mathemetician Pierre-Simon Laplace in 1812 and Russian mathemetician Aleksandr Lyapunov in 1901. What it means is that you can use standard deviation and the rules of probability to calculate confidence intervals and assess how reliable your sampling results are. The central limit theorem is one of the most important theorems in statistics.

Within a normal distribution, we know that around 68% of the values are within one standard deviation of the mean and we know that 99.7% of the values are within two standard deviations of the mean.

Therefore, we can be 68% certain that our sample mean is within one standard deviation of the population, and 99.7% certain that our sample mean is within two standard deviations of the population. Since what we are dealing with here is a potential amount of error between the sample mean and population mean, this standard deviation of error is called standard error.

68% Confidence Interval

Standard Error

The standard error is used to estimate how far the sample mean deviates from the actual population mean.

Standard error for a sample mean (x) is calculated as the standard deviation of the population (σ) divided by the square root of the sample size. The greater the sample size, the closer the sample mean can be assumed to get to the actual population mean.

σ = σ / √n

Since we rarely know the standard deviation of the population (which is why we're sampling in the first place) we estimate the standard error using the standard deviation of the sample (s):

σ = s / √n

Uses for standard error is described below.

Confidence Intervals For Means

The level of confidence you want determines how many standard errors above or below the mean you are willing to accept. Commonly used levels of confidence include 90%, 95%, and 99%. Which one of these is acceptable depends on how accurate you need your estimates to be.

In a normal distribution, we know that 95% of the values are within 1.96 standard deviations (a z-score of 1.96) above or below the mean.

Accordingly, if we have a random sample, we can be 95% confident (confidence interval) that the actual population mean is within 1.96 standard errors above or below the sample mean (s).

s ± 1.96 * σ

Confidence interval is sometimes described in terms of margin of error. Confidence interval is the whole range of values above and below the mean. Margin of error is the difference between the mean and the bottom or top of the confidence interval (z-score times standard error).

95% Confidence Interval

R Example

Using a simulated sample of heights from 100 American men in this CSV file, the confidence interval based on standard error can be calculated in R:

> heights = read.csv("simulated-male-height.csv")$Inches
> print(heights)
  [1] 63.3 63.6 63.8 64.1 64.3 64.5 64.8 65.0 65.2 65.4 65.6 65.7 65.9 66.0 66.1
 [16] 66.2 66.3 66.5 66.6 66.7 66.8 66.9 67.0 67.1 67.2 67.3 67.4 67.5 67.6 67.7
 [31] 67.7 67.8 67.9 68.0 68.1 68.1 68.2 68.3 68.3 68.4 68.5 68.6 68.6 68.7 68.8
 [46] 68.8 68.9 69.0 69.0 69.1 69.2 69.2 69.3 69.4 69.5 69.5 69.6 69.7 69.8 69.8
 [61] 69.9 70.0 70.1 70.2 70.3 70.3 70.4 70.5 70.6 70.7 70.8 70.9 71.0 71.1 71.2
 [76] 71.3 71.4 71.5 71.6 71.7 71.8 72.0 72.1 72.2 72.3 72.4 72.5 72.7 72.8 73.0
 [91] 73.2 73.4 73.6 73.9 74.1 74.3 74.6 74.8 75.1 75.3

> n = length(heights)
> stderr = sd(heights) / sqrt(n)
> moe = round(1.96 * stderr, 2)
> paste("Estimated average height for men = ", mean(heights), "inches +/-", moe, "inches")
[1] "Estimated average height for men =  69.235 inches +/- 0.56 inches"

Excel Example

Using a simulated sample of heights from 100 American men in this CSV file, this formula can be used to calculate standard error:

=STDEV(A2:A101) / SQRT(COUNTA(A2:A101))

To calculate margin of error for means given:

=B3 * B2 / SQRT(B4)

Confidence Intervals for Proportions

If your sample data is dichotomous (e.g. Mac vs PC), or categorical that can be expressed as dichotomous (e.g. people whose favorite fruit is apple), what you are estimating is the population proportion in each group (x% use MAC, y% use PC). In that case, the confidence interval is:

p ± Z * σp

based on the standard error for proportions:

σp = √(p * (1 - p) / n)

Where:

Z is the z-score for the desired confidence interval (1.96 for a 95% level of confidence)

p is the proportion from the sample

n is the size of the sample

For example, in a survey of 32 people asking what operating system their home computer or laptop uses, 13 (40.6%) said Mac OS and 19 (59.4%) said PC (Windoze). To estimate the number of Mac users in the general population:

σp = √(0.406 * (1 - 0.406) / 32) = 0.087

p ± 1.96 * 0.087

40.6% ± 17%

R Example

Using survey responses from this CSV file, we can calculate the confidence interval for proportions in R:

> # Read survey results
> os = read.csv("operating-system-survey.csv")$OS
> print(os)
 [1] Mac PC  PC  Mac PC  Mac PC  Mac PC  PC  PC  PC  Mac PC  PC  PC  Mac Mac PC 
[20] PC  PC  PC  Mac Mac Mac Mac PC  PC  Mac Mac PC  PC 
Levels: Mac PC
> 
> # Get the percentage of peo
> mac = sum(os == 'Mac') / length(os)
> 
> # Calculate standard error
> stderr = sqrt(mac * (1 - mac) / length(os))
> moe = 1.96 * stderr
> paste0("Estimated Mac users = ", round(mac * 100, 1), "% +/- ", round(moe * 100, 1), "%")
[1] "Estimated Mac users = 40.6% +/- 17%"

Excel Example

Using survey responses from this CSV file, we can calculate the confidence interval for proportions in Excel.

When dealing with categorical data in Excel, you can use the COUNTIF() function to find the number of cells in your sample that match a particular value. The first parameter is the data and the second parameter is the condition. Note that the second parameter must be enclosed in quotes.

For example, with the CSV file linke above, the data in cells A2:A33 is either "Mac" or "PC". This will return the number of Mac users:

=COUNTIF(A2:A33, "Mac")

You can then get the proportion (percentage) of yes cells with:

=COUNTIF(A2:A22, "Mac") / COUNTA(A2:A22)

If you put your proportion in cell B2 and your sample size in cell B3, the standard error for proportions is:

=SQRT(B2 * (1 - B2) / B3)

If you then put the Z-score for your level of confidence in cell B3 (for 95% confidence, this is 1.96), the Excel formula for margin of error for proportions is:

=B4 * SQRT(B2 * (1 - B2) / B3)

Z-Scores For Other Levels of Confidence

The Z-score for a 95% level of confidence is 1.96 standard errors.

In R, the qnorm() (normal curve quantile function) can be used to calculate the z-score for other confidence intervals. The math for the parameter is a bit confusing because it is based on margin of error rather than confidence level:

> confidence = c(0.99, 0.95, 0.9, 0.8)
> z = qnorm(1 - ((1 - confidence) / 2))
> paste0("Z scores for ", confidence * 100, "% confidence interval = ", round(z, 2))
[1] "Z scores for 99% confidence interval = 2.58"
[2] "Z scores for 95% confidence interval = 1.96"
[3] "Z scores for 90% confidence interval = 1.64"
[4] "Z scores for 80% confidence interval = 1.28"

In Excel, z-scores for other levels of confidence can be calculated with the NORMSINV() function. With the confidence interval in cell B2 (as a percent on a scale of 0 to 1):

=NORMSINV(1 - ((1 - B2) / 2))

Confidence Intervals With Small Sample Sizes (the t-distribution)

When the sample size is 30 or less, the possibility of large errors increases and the distribution of errors becomes a Student's T Distribution rather than a normal distribution.

The t-distribution also requires the number of degrees of freedom (the size of the sample minus one) for calculation.

The density graph compares the normal distribution with an analagous t distribution with 10 degrees of freedom. Note that the tails of the t distribution are taller, which represents additional area of uncertainty associated with small samples compared to large sample sizes. As degrees of freedom increase, the t distribution begins to approximate the normal distribution.

Normal Distribution vs a T Distribution With 10 Degrees of Freedom

T is used instead of the z-score to calculate a margin of error for sample means or proportions:

s ± t * σ

p ± t * σp

R Examples

Confidence interval for means:

> # Simulated small sample
> set.seed(0)
> heights = rnorm(15, 69.2, 2.8)
> 
> # 95% confidence
> degrees_freedom = length(heights) - 1
> t = qt(0.95, degrees_freedom)
> stderr = sd(heights) / sqrt(length(heights))
> moe = t * stderr
> 
> paste("Estimated average height is", round(mean(heights), 1), "inches +/-", round(moe, 1), "inches")
[1] "Estimated average height is 69.5 inches +/- 1.4 inches"

Confidence interval for proportions: Using the example above of a survey where 40% of respondendents indicated using a Mac rather than a PC for their home computer or laptop, we can see the wider margin of error for proportions associated with a small sample:

> mac = 0.4
> 
> # Small Sample (t distribution)
> sample_size = 10
> stderr = sqrt(mac * (1 - mac) / sample_size)
> t = qt(0.95, sample_size - 1)
> moe = t * stderr
> 
> paste0("Estimated Mac users (small sample) = ", round(mac * 100, 1), "% +/- ", round(moe * 100, 1), "%")
[1] "Estimated Mac users (small sample) = 40% +/- 28.4%"
> 
> # Large sample (normal distribution)
> sample_size = 100
> stderr = sqrt(mac * (1 - mac) / sample_size)
> moe = 1.96 * stderr
> 
> paste0("Estimated Mac users (large sample) = ", round(mac * 100, 1), "% +/- ", round(moe * 100, 1), "%")
[1] "Estimated Mac users (large sample) = 40% +/- 9.6%"

Excel Example

In Excel, the TINV() function can be used to calculate t for use with standard error to calculate a confidence interval.

To calculate the margin of error for means given:

=TINV(1 - B2, B4 - 1) * B3 / SQRT(B4)

To calculate the margin of error for proportions:

=TINV(1 - B3, B4 - 1) * SQRT(B2 * (1 - B2) / B4)

Excel Confidence Interval Functions

As you might expect, Excel has functions to simplify confidence intervals for means. CONFIDENCE.NORM() can be used to calculate margin of error with sample sizes over 30 and CONFIDENCE.T can be used with sample sizes of 30 or less. Both take the same parameters:

=CONFIDENCE.NORM(alpha, stdev, sample_size)

=CONFIDENCE.T(alpha, stdev, sample_size)

alpha = 1 minus level of confidence (0.05 for 95% confidence level)
stdev = standard deviation of the sample
sample_size = count of values in the sample

For example, given a level of confidence in cell B2, the standard deviation in cell B3, and the sample size in cell B4, the margin of error (for a sample size less than 30) using the t-distribution function:

=CONFIDENCE.T(1 - B2, B3, B4)

Excel does not have a convenience function for confidence interval for proportions.

How Big a Sample Do I Need to Have the Confidence I Want?

Using simple algebra, we can transform the formulas for margin of error to find the sample size (n) that we need get the margin of error (E) that we are able to tolerate:

For population mean estimates:

n = (Z * s / E)2

For population proportion estimates:

n = Z2 * p * (1 - p) / E2

Estimating s and p

For the mean formula, you need a sample mean (s), and for the proportion formula you need the sample proportion (p). But since you have not yet done the sampling yet, you cannot know these values.

There are three imperfect but practical ways to estimate these values:

Note that the examples below presume a random sample from the entire population you are basing the estimate upon. If you are trying to make estimates about Americans, your random sample would need to be drawn from a list of all Americans with 100% participation. Drawing a perfectly random sample from anything other than a trivial or captive population is usually impossible, requiring more sophisticated modeling techniques to adjust the results and compensate for segments of the population that were undersampled.

Example Sample Size Estimation in R

Following the examples using male height above, suppose you wish to get an estimate for female height ±1 inch (E = 1) with 95% confidence (Z = 1.96). For s you use a value you have seen on the internet of 63.7 inches:

> z = 1.96
> s = 63.7
> maxerr = 1
> n = (z * s / maxerr)^2
> paste("Minimum sample size =", ceiling(n))
[1] "Minimum sample size = 15589"

Following on the proportions example above, suppose you wish to get a better estimate of the percent of Americans that use Macs as their personal home computers or laptops ±5% (E = 0.05) at a 95% level of confidence (Z = 1.96). Using the estimated mean of 40% that from the survey given above:

> z = 1.96
> p = 0.4
> e = 0.05
> n = (z^2) * p * (1 - p) / (e^2)
> paste("Minimum sample size =", ceiling(n))
[1] "Minimum sample size = 369"

Example Sample Size Estimation in Excel

Following the examples using male height above, suppose you wish to get an estimate for female height ±1 inch (E = 1) with 95% confidence (Z = 1.96). For s you use a value you have seen on the internet of 63.7 inches. The power() function is used for the exponent:

=POWER(1.96 * 63.7 / 1, 2)

Following on the proportions example above, suppose you wish to get a better estimate of the percent of Americans that use Macs as their personal home computers or laptops &plusmin;5% at a 95% level of confidence (Z = 1.6). Using the small sample estimated mean of 40% given above:

=POWER(1.96, 2) * 0.4 * (1 - 0.4) / POWER(0.05, 2)