Sampling: How Much Can I Know With Only A Limited Amount of Data?
We rarely have complete data about something we want to know. For example, we often want to know who will win a political election, but we do not have the time to ask every potential voter who they plan to vote for. And people's attitudes change during a campaign.
However, it is possible to take get the opinion of a small number of people and make an estimate of the possible outcomes of the election.
That small number of people is called a sample, and the estimate we make about the broader population is called an inference. The use of statistical techniques to make inferences is called inferential statistics.
Types of Sampling
The method used to determine what subset of people or things you base your inference on is called sampling.
There are two broad types of sampling techniques, and a number of subtypes within those broad types.
Probability sampling is when any member of the target population has a known probability of being included in the sampled population. Such techniques include:
- Simple random sampling involves randomly selecting members of the population. There is an equal probability that any member of the For example, a random number generator (like the Excel RAND() function) can be used to select a subset of numbers from a phone number list that covers the population. In the natural sciences, an example might be placing remote cameras at random locations in an area when trying to get a count of the population of a specific type of animal in that area
- Systematic sampling involves selecting members from the target population at a regular interval. For example, every 10th number from a sorted list of addresses
- Stratified sampling involves taking random samples from different subgroups of the target population and then using known information about the size or characteristics of those subgroups to adjust the results. For example, if studying the opinions about college students toward a university policy, different classes (freshmen, sophomores, etc) or majors might have different opinions about that policy. Separating calculations into different sub-groups can be useful for discerning the differences between those groups
Nonprobability sampling is used in cases where probability sampling is impractical. While these techniques are not as reliable as probability sampling techniques for making inferences about the broader population, results from these samples can offer suggestions of research paths that would justify the expense and difficulty of further probability sampling. Some common subtypes of nonprobability sampling include:
- Convenience sampling involves sampling a group of people you can conveniently access. For example, American college students are some of the most studied populations in history. College students tend to be young and from a specific set of social classes, so they cannot be said to be a random sample of the broader population. However, college students are readily accessible to the academic researchers that are their professors, and they often are happy to be research subjects in exchange for pizza or other relatively small amounts of compensation
- Purposive sampling is a variant on convenience sampling that is commonly used when information about a specific subgroup of the broader population is needed. For example, if information about the opinions of men are needed, volunteers might stand on a busy street corner and specifically target only men that pass by for further inquiry (and likely rejection)
- Snowball sampling involves asking sampled subjects for recommendations of other people that might fit a specific profile. For example, if studying a relatively small ethnic group, asking a subject for a recommendation on friends and family members that might participate would make it easier to find other members of that ethnic group
Central Limit Theorem
If we sample a characteristic of a population that can be represented as a continuous variable, we can take the mean of that sample and get the sample mean. But how confident can we be that sample mean is anywhere close to the population mean?
A curious and wondrous fact about random sampling is that if you take multiple random samples that involve estimating the arithmetic mean of some population characteristic, the distribution of deviations (errors) between sample means from the population mean will be a normal distribution.
This is the central limit theorem, which was originally proposed by French mathemetician Abraham de Moive in 1733 and later developed by French mathemetician Pierre-Simon Laplace in 1812 and Russian mathemetician Aleksandr Lyapunov in 1901. What it means is that you can use standard deviation and the rules of probability to calculate confidence intervals and assess how reliable your sampling results are. The central limit theorem is one of the most important theorems in statistics.
Within a normal distribution, we know that around 68% of the values are within one standard deviation of the mean and we know that 99.7% of the values are within two standard deviations of the mean.
Therefore, we can be 68% certain that our sample mean is within one standard deviation of the population, and 99.7% certain that our sample mean is within two standard deviations of the population. Since what we are dealing with here is a potential amount of error between the sample mean and population mean, this standard deviation of error is called standard error.
The standard error is used to estimate how far the sample mean deviates from the actual population mean.
Standard error for a sample mean (x) is calculated as the standard deviation of the population (σ) divided by the square root of the sample size. The greater the sample size, the closer the sample mean can be assumed to get to the actual population mean.
σx̄ = σ / √n
Since we rarely know the standard deviation of the population (which is why we're sampling in the first place) we estimate the standard error using the standard deviation of the sample (s):
σx̄ = s / √n
Uses for standard error is described below.
Confidence Intervals For Means
The level of confidence you want determines how many standard errors above or below the mean you are willing to accept. Commonly used levels of confidence include 90%, 95%, and 99%. Which one of these is acceptable depends on how accurate you need your estimates to be.
In a normal distribution, we know that 95% of the values are within 1.96 standard deviations (a z-score of 1.96) above or below the mean.
Accordingly, if we have a random sample, we can be 95% confident (confidence interval) that the actual population mean is within 1.96 standard errors above or below the sample mean (s).
s ± 1.96 * σx̄
Confidence interval is sometimes described in terms of margin of error. Confidence interval is the whole range of values above and below the mean. Margin of error is the difference between the mean and the bottom or top of the confidence interval (z-score times standard error).
Using a simulated sample of heights from 100 American men in this CSV file, the confidence interval based on standard error can be calculated in R:
> heights = read.csv("simulated-male-height.csv")$Inches > print(heights)  63.3 63.6 63.8 64.1 64.3 64.5 64.8 65.0 65.2 65.4 65.6 65.7 65.9 66.0 66.1  66.2 66.3 66.5 66.6 66.7 66.8 66.9 67.0 67.1 67.2 67.3 67.4 67.5 67.6 67.7  67.7 67.8 67.9 68.0 68.1 68.1 68.2 68.3 68.3 68.4 68.5 68.6 68.6 68.7 68.8  68.8 68.9 69.0 69.0 69.1 69.2 69.2 69.3 69.4 69.5 69.5 69.6 69.7 69.8 69.8  69.9 70.0 70.1 70.2 70.3 70.3 70.4 70.5 70.6 70.7 70.8 70.9 71.0 71.1 71.2  71.3 71.4 71.5 71.6 71.7 71.8 72.0 72.1 72.2 72.3 72.4 72.5 72.7 72.8 73.0  73.2 73.4 73.6 73.9 74.1 74.3 74.6 74.8 75.1 75.3 > n = length(heights) > stderr = sd(heights) / sqrt(n) > moe = round(1.96 * stderr, 2) > paste("Estimated average height for men = ", mean(heights), "inches +/-", moe, "inches")  "Estimated average height for men = 69.235 inches +/- 0.56 inches"
Using a simulated sample of heights from 100 American men in this CSV file, this formula can be used to calculate standard error:
=STDEV(A2:A101) / SQRT(COUNTA(A2:A101))
To calculate margin of error for means given:
- That standard error in cell B2
- The Z-score for a 95% level of confidence (z = 1.96) in cell B3
=B3 * B2 / SQRT(B4)
Confidence Intervals for Proportions
If your sample data is dichotomous (e.g. Mac vs PC), or categorical that can be expressed as dichotomous (e.g. people whose favorite fruit is apple), what you are estimating is the population proportion in each group (x% use MAC, y% use PC). In that case, the confidence interval is:
p ± Z * σp
based on the standard error for proportions:
σp = √(p * (1 - p) / n)
Z is the z-score for the desired confidence interval (1.96 for a 95% level of confidence)
p is the proportion from the sample
n is the size of the sample
For example, in a survey of 32 people asking what operating system their home computer or laptop uses, 13 (40.6%) said Mac OS and 19 (59.4%) said PC (Windoze). To estimate the number of Mac users in the general population:
σp = √(0.406 * (1 - 0.406) / 32) = 0.087
p ± 1.96 * 0.087
40.6% ± 17%
Using survey responses from this CSV file, we can calculate the confidence interval for proportions in R:
> # Read survey results > os = read.csv("operating-system-survey.csv")$OS > print(os)  Mac PC PC Mac PC Mac PC Mac PC PC PC PC Mac PC PC PC Mac Mac PC  PC PC PC Mac Mac Mac Mac PC PC Mac Mac PC PC Levels: Mac PC > > # Get the percentage of peo > mac = sum(os == 'Mac') / length(os) > > # Calculate standard error > stderr = sqrt(mac * (1 - mac) / length(os)) > moe = 1.96 * stderr > paste0("Estimated Mac users = ", round(mac * 100, 1), "% +/- ", round(moe * 100, 1), "%")  "Estimated Mac users = 40.6% +/- 17%"
Using survey responses from this CSV file, we can calculate the confidence interval for proportions in Excel.
When dealing with categorical data in Excel, you can use the COUNTIF() function to find the number of cells in your sample that match a particular value. The first parameter is the data and the second parameter is the condition. Note that the second parameter must be enclosed in quotes.
For example, with the CSV file linke above, the data in cells A2:A33 is either "Mac" or "PC". This will return the number of Mac users:
You can then get the proportion (percentage) of yes cells with:
=COUNTIF(A2:A22, "Mac") / COUNTA(A2:A22)
If you put your proportion in cell B2 and your sample size in cell B3, the standard error for proportions is:
=SQRT(B2 * (1 - B2) / B3)
If you then put the Z-score for your level of confidence in cell B3 (for 95% confidence, this is 1.96), the Excel formula for margin of error for proportions is:
=B4 * SQRT(B2 * (1 - B2) / B3)
Z-Scores For Other Levels of Confidence
The Z-score for a 95% level of confidence is 1.96 standard errors.
In R, the qnorm() (normal curve quantile function) can be used to calculate the z-score for other confidence intervals. The math for the parameter is a bit confusing because it is based on margin of error rather than confidence level:
> confidence = c(0.99, 0.95, 0.9, 0.8) > z = qnorm(1 - ((1 - confidence) / 2)) > paste0("Z scores for ", confidence * 100, "% confidence interval = ", round(z, 2))  "Z scores for 99% confidence interval = 2.58"  "Z scores for 95% confidence interval = 1.96"  "Z scores for 90% confidence interval = 1.64"  "Z scores for 80% confidence interval = 1.28"
In Excel, z-scores for other levels of confidence can be calculated with the NORMSINV() function. With the confidence interval in cell B2 (as a percent on a scale of 0 to 1):
=NORMSINV(1 - ((1 - B2) / 2))
Confidence Intervals With Small Sample Sizes (the t-distribution)
When the sample size is 30 or less, the possibility of large errors increases and the distribution of errors becomes a Student's T Distribution rather than a normal distribution.
The t-distribution also requires the number of degrees of freedom (the size of the sample minus one) for calculation.
The density graph compares the normal distribution with an analagous t distribution with 10 degrees of freedom. Note that the tails of the t distribution are taller, which represents additional area of uncertainty associated with small samples compared to large sample sizes. As degrees of freedom increase, the t distribution begins to approximate the normal distribution.
T is used instead of the z-score to calculate a margin of error for sample means or proportions:
s ± t * σx̄
p ± t * σp
Confidence interval for means:
> # Simulated small sample > set.seed(0) > heights = rnorm(15, 69.2, 2.8) > > # 95% confidence > degrees_freedom = length(heights) - 1 > t = qt(0.95, degrees_freedom) > stderr = sd(heights) / sqrt(length(heights)) > moe = t * stderr > > paste("Estimated average height is", round(mean(heights), 1), "inches +/-", round(moe, 1), "inches")  "Estimated average height is 69.5 inches +/- 1.4 inches"
Confidence interval for proportions: Using the example above of a survey where 40% of respondendents indicated using a Mac rather than a PC for their home computer or laptop, we can see the wider margin of error for proportions associated with a small sample:
> mac = 0.4 > > # Small Sample (t distribution) > sample_size = 10 > stderr = sqrt(mac * (1 - mac) / sample_size) > t = qt(0.95, sample_size - 1) > moe = t * stderr > > paste0("Estimated Mac users (small sample) = ", round(mac * 100, 1), "% +/- ", round(moe * 100, 1), "%")  "Estimated Mac users (small sample) = 40% +/- 28.4%" > > # Large sample (normal distribution) > sample_size = 100 > stderr = sqrt(mac * (1 - mac) / sample_size) > moe = 1.96 * stderr > > paste0("Estimated Mac users (large sample) = ", round(mac * 100, 1), "% +/- ", round(moe * 100, 1), "%")  "Estimated Mac users (large sample) = 40% +/- 9.6%"
In Excel, the TINV() function can be used to calculate t for use with standard error to calculate a confidence interval.
To calculate the margin of error for means given:
- A level of confidence in cell B2
- The standard deviation in cell B3
- The sample size in cell B4
=TINV(1 - B2, B4 - 1) * B3 / SQRT(B4)
To calculate the margin of error for proportions:
=TINV(1 - B3, B4 - 1) * SQRT(B2 * (1 - B2) / B4)
Excel Confidence Interval Functions
As you might expect, Excel has functions to simplify confidence intervals for means. CONFIDENCE.NORM() can be used to calculate margin of error with sample sizes over 30 and CONFIDENCE.T can be used with sample sizes of 30 or less. Both take the same parameters:
=CONFIDENCE.NORM(alpha, stdev, sample_size) =CONFIDENCE.T(alpha, stdev, sample_size) alpha = 1 minus level of confidence (0.05 for 95% confidence level) stdev = standard deviation of the sample sample_size = count of values in the sample
For example, given a level of confidence in cell B2, the standard deviation in cell B3, and the sample size in cell B4, the margin of error (for a sample size less than 30) using the t-distribution function:
=CONFIDENCE.T(1 - B2, B3, B4)
Excel does not have a convenience function for confidence interval for proportions.
How Big a Sample Do I Need to Have the Confidence I Want?
Using simple algebra, we can transform the formulas for margin of error to find the sample size (n) that we need get the margin of error (E) that we are able to tolerate:
For population mean estimates:
n = (Z * s / E)2
For population proportion estimates:
n = Z2 * p * (1 - p) / E2
Estimating s and p
For the mean formula, you need a sample mean (s), and for the proportion formula you need the sample proportion (p). But since you have not yet done the sampling yet, you cannot know these values.
There are three imperfect but practical ways to estimate these values:
- Two-stage sampling design involves doing a preliminary survey to get an estimate of the sample mean (s) or the proportion (p)
- For means or proportions, when available, you can use results from a prior survey or estimate
- For proportions you can use p = 50% as a worst case scenario. This causes the p * (1 - p) part of the formula to be its maximimum possible value so the sample size is the largest possible value
Note that the examples below presume a random sample from the entire population you are basing the estimate upon. If you are trying to make estimates about Americans, your random sample would need to be drawn from a list of all Americans with 100% participation. Drawing a perfectly random sample from anything other than a trivial or captive population is usually impossible, requiring more sophisticated modeling techniques to adjust the results and compensate for segments of the population that were undersampled.
Example Sample Size Estimation in R
Following the examples using male height above, suppose you wish to get an estimate for female height ±1 inch (E = 1) with 95% confidence (Z = 1.96). For s you use a value you have seen on the internet of 63.7 inches:
> z = 1.96 > s = 63.7 > maxerr = 1 > n = (z * s / maxerr)^2 > paste("Minimum sample size =", ceiling(n))  "Minimum sample size = 15589"
Following on the proportions example above, suppose you wish to get a better estimate of the percent of Americans that use Macs as their personal home computers or laptops ±5% (E = 0.05) at a 95% level of confidence (Z = 1.96). Using the estimated mean of 40% that from the survey given above:
> z = 1.96 > p = 0.4 > e = 0.05 > n = (z^2) * p * (1 - p) / (e^2) > paste("Minimum sample size =", ceiling(n))  "Minimum sample size = 369"
Example Sample Size Estimation in Excel
Following the examples using male height above, suppose you wish to get an estimate for female height ±1 inch (E = 1) with 95% confidence (Z = 1.96). For s you use a value you have seen on the internet of 63.7 inches. The power() function is used for the exponent:
=POWER(1.96 * 63.7 / 1, 2)
Following on the proportions example above, suppose you wish to get a better estimate of the percent of Americans that use Macs as their personal home computers or laptops &plusmin;5% at a 95% level of confidence (Z = 1.6). Using the small sample estimated mean of 40% given above:
=POWER(1.96, 2) * 0.4 * (1 - 0.4) / POWER(0.05, 2)