# Sampling: How Much Can I Know With Only A Limited Amount of Data?

We rarely have complete data about something we want to know. For example, we often want to know who will win a political election, but we do not have the time to ask every potential voter who they plan to vote for. And people's attitudes change during a campaign.

However, it is possible to take get the opinion of a small number of people and make an estimate of the possible outcomes of the election.

That small number of people is called a **sample**, and the estimate
we make about the broader population is called an **inference**.
The use of statistical techniques to make inferences
is called **inferential statistics**.

## Sampling Concepts

### Types of Sampling

The method used to determine what subset of people or things you
base your inference on is called **sampling**.

There are two broad types of sampling techniques, and a number of subtypes within those broad types.

**Probability sampling** is when any member
of the target population has a known probability of being included in
the sampled population. Such techniques include:

**Simple random sampling**involves randomly selecting members of the population. There is an equal probability that any member of the For example, a random number generator (like the Excel RAND() function) can be used to select a subset of numbers from a phone number list that covers the population. In the natural sciences, an example might be placing remote cameras at random locations in an area when trying to get a count of the population of a specific type of animal in that area**Systematic sampling**involves selecting members from the target population at a regular interval. For example, every 10th number from a sorted list of addresses**Stratified sampling**involves taking random samples from different subgroups of the target population and then using known information about the size or characteristics of those subgroups to adjust the results. For example, if studying the opinions about college students toward a university policy, different classes (freshmen, sophomores, etc) or majors might have different opinions about that policy. Separating calculations into different sub-groups can be useful for discerning the differences between those groups

**
Nonprobability sampling**
is used in cases where probability sampling is impractical.
While these techniques are not as reliable as probability sampling techniques
for making inferences about the broader population,
results from these samples can offer suggestions of research
paths that would justify the expense and difficulty of further
probability sampling. Some common subtypes of nonprobability sampling include:

**Convenience sampling**involves sampling a group of people you can conveniently access. For example, American college students are some of the most studied populations in history. College students tend to be young and from a specific set of social classes, so they cannot be said to be a random sample of the broader population. However, college students are readily accessible to the academic researchers that are their professors, and they often are happy to be research subjects in exchange for pizza or other relatively small amounts of compensation**Purposive sampling**is a variant on convenience sampling that is commonly used when information about a specific subgroup of the broader population is needed. For example, if information about the opinions of men are needed, volunteers might stand on a busy street corner and specifically target only men that pass by for further inquiry (and likely rejection)**Snowball sampling**involves asking sampled subjects for recommendations of other people that might fit a specific profile. For example, if studying a relatively small ethnic group, asking a subject for a recommendation on friends and family members that might participate would make it easier to find other members of that ethnic group

### Central Limit Theorem

If we sample a characteristic of a population that can be
represented as a continuous variable, we can take the mean of that sample
and get the **sample mean**. But how confident can we be that
sample mean is anywhere close to the **population mean**?

A curious and wondrous fact about random sampling is that if you take
multiple random samples that involve estimating the arithmetic mean of some
population characteristic, the distribution of deviations (errors) between
sample means from the population mean will be a **normal distribution**.

This is the **central limit theorem**, which was originally
proposed by French mathemetician Abraham de Moive in 1733 and later
developed by French mathemetician Pierre-Simon Laplace in 1812 and
Russian mathemetician Aleksandr Lyapunov in 1901. What it means is
that you can use standard deviation and the rules of probability to
calculate
**confidence intervals** and assess how
reliable your sampling results are. The central limit theorem is one
of the most important theorems in statistics.

Within a normal distribution, we know that around 68% of the values are within one standard deviation of the mean and we know that 99.7% of the values are within two standard deviations of the mean.

Therefore, we can be 68% certain that our sample mean
is within one standard deviation of the population,
and 99.7% certain that our sample mean is within two standard deviations
of the population. Since what we are dealing with here is
a potential amount of error between the sample mean and
population mean, this standard deviation of error
is called **standard error**.

### Standard Error

The **standard error** is used to estimate how far the sample mean
deviates from the actual population mean.

Standard error for a sample mean (x) is calculated as the standard deviation of the population (σ) divided by the square root of the sample size. The greater the sample size, the closer the sample mean can be assumed to get to the actual population mean.

σ_{x̄} = σ / √n

Since we rarely know the standard deviation of the population (which is why we're sampling in the first place) we estimate the standard error using the standard deviation of the sample (s):

σ_{x̄} = s / √n

Uses for standard error is described below.

## Confidence Intervals For Means

The **level of confidence** you want determines how
many standard errors above or below the mean you are willing
to accept. Commonly used levels of confidence include 90%,
95%, and 99%. Which one of these is acceptable depends
on how accurate you need your estimates to be.

In a normal distribution, we know that 95% of the
values are within 1.96 standard deviations (a **z-score** of 1.96)
above or below the mean.

Accordingly, if we have a random sample, we can be 95%
confident (**confidence interval**) that the actual population mean is within 1.96
**standard errors** above or below the sample mean (*s*).

s ± 1.96 * σ_{x̄}

Confidence interval is sometimes described in terms of **margin of error**.
Confidence interval is the whole range of values above and below the
mean. Margin of error is the difference between the mean and the
bottom or top of the confidence interval (z-score times standard error).

### R Example

Using a simulated sample of heights from 100 American men in this CSV file, the confidence interval based on standard error can be calculated in R:

> heights = read.csv("simulated-male-height.csv")$Inches > print(heights) [1] 63.3 63.6 63.8 64.1 64.3 64.5 64.8 65.0 65.2 65.4 65.6 65.7 65.9 66.0 66.1 [16] 66.2 66.3 66.5 66.6 66.7 66.8 66.9 67.0 67.1 67.2 67.3 67.4 67.5 67.6 67.7 [31] 67.7 67.8 67.9 68.0 68.1 68.1 68.2 68.3 68.3 68.4 68.5 68.6 68.6 68.7 68.8 [46] 68.8 68.9 69.0 69.0 69.1 69.2 69.2 69.3 69.4 69.5 69.5 69.6 69.7 69.8 69.8 [61] 69.9 70.0 70.1 70.2 70.3 70.3 70.4 70.5 70.6 70.7 70.8 70.9 71.0 71.1 71.2 [76] 71.3 71.4 71.5 71.6 71.7 71.8 72.0 72.1 72.2 72.3 72.4 72.5 72.7 72.8 73.0 [91] 73.2 73.4 73.6 73.9 74.1 74.3 74.6 74.8 75.1 75.3 > n = length(heights) > stderr = sd(heights) / sqrt(n) > moe = round(1.96 * stderr, 2) > paste("Estimated average height for men = ", mean(heights), "inches +/-", moe, "inches") [1] "Estimated average height for men = 69.235 inches +/- 0.56 inches"

### Excel Example

Using a simulated sample of heights from 100 American men in this CSV file, this formula can be used to calculate standard error:

=STDEV(A2:A101) / SQRT(COUNTA(A2:A101))

To calculate **margin of error for means** given:

- That standard error in cell B2
- The Z-score for a 95% level of confidence (z = 1.96) in cell B3

=B3 * B2 / SQRT(B4)

## Confidence Intervals for Proportions

If your sample data is dichotomous (e.g. Mac vs PC),
or categorical that can be expressed as dichotomous (e.g. people whose
favorite fruit is apple), what you
are estimating is the **population proportion** in
each group (x% use MAC, y% use PC). In that case,
the confidence interval is:

p ± Z * σ_{p}

based on the standard error for proportions:

σ_{p} = √(p * (1 - p) / n)

Where:

Z is the z-score for the desired confidence interval (1.96 for a 95% level of confidence)

p is the proportion from the sample

n is the size of the sample

For example, in a survey of 32 people asking what operating system their home computer or laptop uses, 13 (40.6%) said Mac OS and 19 (59.4%) said PC (Windoze). To estimate the number of Mac users in the general population:

σ_{p} = √(0.406 * (1 - 0.406) / 32)
= 0.087

p ± 1.96 * 0.087

40.6% ± 17%

### R Example

Using survey responses from this CSV file, we can calculate the confidence interval for proportions in R:

> # Read survey results > os = read.csv("operating-system-survey.csv")$OS > print(os) [1] Mac PC PC Mac PC Mac PC Mac PC PC PC PC Mac PC PC PC Mac Mac PC [20] PC PC PC Mac Mac Mac Mac PC PC Mac Mac PC PC Levels: Mac PC > > # Get the percentage of peo > mac = sum(os == 'Mac') / length(os) > > # Calculate standard error > stderr = sqrt(mac * (1 - mac) / length(os)) > moe = 1.96 * stderr > paste0("Estimated Mac users = ", round(mac * 100, 1), "% +/- ", round(moe * 100, 1), "%") [1] "Estimated Mac users = 40.6% +/- 17%"

### Excel Example

Using survey responses from this CSV file, we can calculate the confidence interval for proportions in Excel.

When dealing with categorical data in Excel, you can use the COUNTIF() function to find the number of cells in your sample that match a particular value. The first parameter is the data and the second parameter is the condition. Note that the second parameter must be enclosed in quotes.

For example, with the CSV file linke above, the data in cells A2:A33 is either "Mac" or "PC". This will return the number of Mac users:

=COUNTIF(A2:A33, "Mac")

You can then get the proportion (percentage) of yes cells with:

=COUNTIF(A2:A22, "Mac") / COUNTA(A2:A22)

If you put your proportion in cell B2
and your sample size in cell B3,
the **standard error for proportions** is:

=SQRT(B2 * (1 - B2) / B3)

If you then put the Z-score for your level of
confidence in cell B3 (for 95% confidence, this is 1.96), the Excel formula for
**margin of error for proportions** is:

=B4 * SQRT(B2 * (1 - B2) / B3)

## Z-Scores For Other Levels of Confidence

The Z-score for a 95% level of confidence is 1.96 standard errors.

In R, the **qnorm()** (normal curve quantile function) can
be used to calculate the z-score for other confidence intervals.
The math for the parameter is a bit confusing because it is based on
margin of error rather than confidence level:

> confidence = c(0.99, 0.95, 0.9, 0.8) > z = qnorm(1 - ((1 - confidence) / 2)) > paste0("Z scores for ", confidence * 100, "% confidence interval = ", round(z, 2)) [1] "Z scores for 99% confidence interval = 2.58" [2] "Z scores for 95% confidence interval = 1.96" [3] "Z scores for 90% confidence interval = 1.64" [4] "Z scores for 80% confidence interval = 1.28"

In Excel, z-scores for other levels of confidence can be calculated with the NORMSINV() function. With the confidence interval in cell B2 (as a percent on a scale of 0 to 1):

=NORMSINV(1 - ((1 - B2) / 2))

## Confidence Intervals With Small Sample Sizes (the t-distribution)

When **the sample size is 30 or less**, the possibility of large errors
increases and the distribution of errors becomes a **Student's T
Distribution** rather than a normal distribution.

The t-distribution also requires the number of **degrees of freedom**
(the size of the sample minus one) for calculation.

The density graph compares the normal distribution with an analagous t distribution with 10 degrees of freedom. Note that the tails of the t distribution are taller, which represents additional area of uncertainty associated with small samples compared to large sample sizes. As degrees of freedom increase, the t distribution begins to approximate the normal distribution.

T is used instead of the z-score to calculate a margin of error for sample means or proportions:

s ± t * σ_{x̄}

p ± t * σ_{p}

### R Examples

Confidence interval for means:

> # Simulated small sample > set.seed(0) > heights = rnorm(15, 69.2, 2.8) > > # 95% confidence > degrees_freedom = length(heights) - 1 > t = qt(0.95, degrees_freedom) > stderr = sd(heights) / sqrt(length(heights)) > moe = t * stderr > > paste("Estimated average height is", round(mean(heights), 1), "inches +/-", round(moe, 1), "inches") [1] "Estimated average height is 69.5 inches +/- 1.4 inches"

Confidence interval for proportions: Using the example above of a survey where 40% of respondendents indicated using a Mac rather than a PC for their home computer or laptop, we can see the wider margin of error for proportions associated with a small sample:

> mac = 0.4 > > # Small Sample (t distribution) > sample_size = 10 > stderr = sqrt(mac * (1 - mac) / sample_size) > t = qt(0.95, sample_size - 1) > moe = t * stderr > > paste0("Estimated Mac users (small sample) = ", round(mac * 100, 1), "% +/- ", round(moe * 100, 1), "%") [1] "Estimated Mac users (small sample) = 40% +/- 28.4%" > > # Large sample (normal distribution) > sample_size = 100 > stderr = sqrt(mac * (1 - mac) / sample_size) > moe = 1.96 * stderr > > paste0("Estimated Mac users (large sample) = ", round(mac * 100, 1), "% +/- ", round(moe * 100, 1), "%") [1] "Estimated Mac users (large sample) = 40% +/- 9.6%"

### Excel Example

In Excel, the **TINV()** function can be used to calculate *t*
for use with standard error to calculate a confidence interval.

To calculate the **margin of error for means** given:

- A level of confidence in cell B2
- The standard deviation in cell B3
- The sample size in cell B4

=TINV(1 - B2, B4 - 1) * B3 / SQRT(B4)

To calculate the **margin of error for proportions**:

=TINV(1 - B3, B4 - 1) * SQRT(B2 * (1 - B2) / B4)

### Excel Confidence Interval Functions

As you might expect, Excel has functions to simplify confidence intervals for means. CONFIDENCE.NORM() can be used to calculate margin of error with sample sizes over 30 and CONFIDENCE.T can be used with sample sizes of 30 or less. Both take the same parameters:

=CONFIDENCE.NORM(alpha, stdev, sample_size) =CONFIDENCE.T(alpha, stdev, sample_size) alpha = 1 minus level of confidence (0.05 for 95% confidence level) stdev = standard deviation of the sample sample_size = count of values in the sample

For example, given a level of confidence in cell B2, the standard deviation in cell B3, and the sample size in cell B4, the margin of error (for a sample size less than 30) using the t-distribution function:

=CONFIDENCE.T(1 - B2, B3, B4)

Excel does not have a convenience function for confidence interval for proportions.

## How Big a Sample Do I Need to Have the Confidence I Want?

Using simple algebra, we can transform the formulas for margin of
error to find the **sample size** (*n*) that we need get the
margin of error (*E*) that we are able to tolerate:

For **population mean** estimates:

n = (Z * s / E)^{2}

For **population proportion** estimates:

n = Z^{2} * p * (1 - p) / E^{2}

### Estimating *s* and *p*

For the mean formula, you need a sample mean (*s*),
and for the proportion formula you need the sample proportion (*p*). But since
you have not yet done the sampling yet, you cannot know these values.

There are three imperfect but practical ways to estimate these values:

**Two-stage sampling design**involves doing a preliminary survey to get an estimate of the sample mean (*s*) or the proportion (*p*)- For means or proportions, when available, you can
**use results from a prior survey or estimate** - For proportions you can
**use**. This causes the*p*= 50% as a worst case scenario*p** (1 -*p*) part of the formula to be its maximimum possible value so the sample size is the largest possible value

Note that the examples below presume a random sample from the entire population you are basing the estimate upon. If you are trying to make estimates about Americans, your random sample would need to be drawn from a list of all Americans with 100% participation. Drawing a perfectly random sample from anything other than a trivial or captive population is usually impossible, requiring more sophisticated modeling techniques to adjust the results and compensate for segments of the population that were undersampled.

### Example Sample Size Estimation in R

Following the examples using male height above, suppose you wish
to get an estimate for female height ±1 inch (*E* = 1)
with 95% confidence (*Z* = 1.96). For *s* you use a value you
have seen on the internet of 63.7 inches:

> z = 1.96 > s = 63.7 > maxerr = 1 > n = (z * s / maxerr)^2 > paste("Minimum sample size =", ceiling(n)) [1] "Minimum sample size = 15589"

Following on the proportions example above, suppose you wish to
get a better estimate of the percent of Americans that use Macs as their
personal home computers or laptops ±5% (*E* = 0.05) at a 95% level of confidence (*Z* = 1.96).
Using the estimated mean of 40% that from the survey given above:

> z = 1.96 > p = 0.4 > e = 0.05 > n = (z^2) * p * (1 - p) / (e^2) > paste("Minimum sample size =", ceiling(n)) [1] "Minimum sample size = 369"

### Example Sample Size Estimation in Excel

Following the examples using male height above, suppose you wish
to get an estimate for female height ±1 inch (*E* = 1)
with 95% confidence (*Z* = 1.96). For *s* you use a value you
have seen on the internet of 63.7 inches. The power() function
is used for the exponent:

=POWER(1.96 * 63.7 / 1, 2)

Following on the proportions example above, suppose you wish to get a better estimate of the percent of Americans that use Macs as their personal home computers or laptops &plusmin;5% at a 95% level of confidence (Z = 1.6). Using the small sample estimated mean of 40% given above:

=POWER(1.96, 2) * 0.4 * (1 - 0.4) / POWER(0.05, 2)