# Sampled Data Exercise

We often want to know something about a group of people (such as how voters in a state feel about a political candidate) or about some geographic phenomenon that covers a large area (such as the volume of an underground resource). However, it is often too difficult or expensive to survey each and every member of a group or dig up every square inch of a resource to get full population data.

In cases like this, what we have to do is take a smaller group, a sample, of the population and use statistical calculations to tell us how much confidence we can have that the data we get from our sample represents the whole population.

This tutorial describes an class exercise in sampled data that involves taking a class survey on some characteristic or opinion. The class represents a sample that can be used to make an inference about the broader student population.

Note that this particular type of sampling is actually convenience sampling, because we are choosing survey respondents based on whether they are conveniently available.

However, for the purposes of this exercise we will evaluate the results as if this were a random sample of all students in our institution, and we will use that sample to estimate a characteristic of that population.

## Survey Question

During class, you will be asked to form teams of two or three people:

• As a team, you will brainstorm two questions (a primary and a backup) that you could ask each member of the class
• Your question should yield a categorical (dichotomous or multichotomous), count, or amount variable
• Your question should be something that you would not mind being asked by a stranger with uncertain motives. Accordingly, you will probably want to avoid questions on topics like health history, political perspective, religious affiliation, financial status, sexuality, etc.
• We will go around the room and declare our questions so each question is unique in the class. During group discussion you should have a primary question and a backup, in case your primary is taken by an earlier team
• Despite those restrictions, it would be helpful if you can find a question that is at least remotely related to your occupational area
• While these questions may not be particularly profound, they may yield interesting or amusing results

Example questions from past variants of this exercise have included:

1. Laptop operating system: Mac OS, Windows, none (multichotomous)
2. Distance in miles from your home town to school: 0 - x (amount)
3. Number of pets: 0 - x (count)

On a sheet of paper (for submission at the end of class), note the following:

2. The full names of your team members

## Capture

In order to preserve participant privacy, we will anonymize our data, identifying subjects by their home zip code.

During class, at least one member of your team will circulate around the room, gathering the following responses from each class member (including themselves):

1. Their zip code
2. Their response to the question for your team

## Processing

At the conclusion of the capture class, each member of the class should take a smartphone picture of the survey data (if done on paper) or get an e-mail of that data.

At least one member of your team (and preferably all members of the team) should enter the survey responses into a Google Sheets spreadsheet. Your spreadsheet should have the following columns:

1. Their zip code
2. Their response to the question for your team

## Analysis: Counts or Amounts

### Descriptive Statistics

If your variable is an amount or count, calculate the descriptive statistics for your values:

• Use the count() function to count the number of responses
• Use the max() function to find the maximum value in your responses
• Use the average() function to find the mean value for your responses
• Use the median() function to find the median value in your responses
• Use the min() function to find the minimum value in your responses
• Remove unnecessary significant digits so the displayed precision reflects the accuracy of the data

### Confidence Interval

Because this is a survey, there is a possiblilty that our sample does not match the overall population. For this number-of-pets example, we may have accidentally gotten group of respondents that have an unusually high (or low) number of pets.

Accordingly, you always need to present the results from a sample with a margin of error and a level of confidence that you have in that margin of error.

The range of values around your percent values plus or minus your margin of error is called the confidence interval.

A commonly-used estimate for the margin of error for means is:

margin_of_error = 1.96 * sample_mean / √sample_count

The 1.96 is a z-score for a 95% level of confidence, and the mean divided by the square-root of the sample count is a rough estimate for standard error.

You can use spreadsheet formulas to calculate these values for your survey:

=1.96 * mean / sqrt(count)
• Add a Margin of Error column and use the sqrt() function to calculate using the formula given above
• Add a Confidence Interval Low column and subtract the margin of error from the mean to get the low values in the 95% confidence interval
• Add a Confidence Interval High column and add the margin of error from the mean to get the high values in the 95% confidence interval

Interpreting the values in the example, we can say with a 95% level of confidence that average number of pets owned by students at this school is between 0.69 and 1.43.

3. Use the zip code column to position the placemarks
4. Use the variable column to title your markers
5. Style the points by your variable column
6. Group by ranges so that different ranges of values display with different icons
7. Optional: Change the icons to bubble icons
8. Give your map a meaningful name
9. Share the map publicly
10. Copy the shared link to give to anyone you want to share the map with

While Google Maps does not directly supporte graduated bubble maps, bubble icons you can use with a map are given below. Right click on the icon, copy the icon image URL, and copy that URL as a custom icon.

### ArcGIS Online Bubble Map

Alternatively, to map your results in ArcGIS Online:

2. Create a new map in ArcGIS Online
4. For location, select the zip code column
5. For Choose an attribute to show, select your quantitative variable column
6. For Select a drawing style choose Counts and Amounts (Size)
7. If you don't like the default color or want to adjust the size of the bubbles, select OPTIONS

8. Save the map with a meaningful name
9. Share the map with everyone to get a link

## Analysis: Categorical Variables

### Descriptive Statistics

If your variable is categorical, calculate the count and percentage of respondents for each category:

• Add a Responses column with the different possible responses for your variable
• Add a Count column and use the countif() function to get the count of responses in your data that match each possible response category
• Add a Total row and use the sum() function to count the total number of responses
• Add a Percent column and divide the response counts by the total number of counts to get the percent of responses in each category
• Format the cells to display as percents rather than as decimal ratios
• Remove unnecessary significant digits so the displayed precision reflects the accuracy of the data

### Confidence Interval

Because this is a survey, there is a possiblilty that our sample does not match the overall population. For this Mac/PC survey example, we may have accidentally gotten an unusually high number of Mac users or an unusually low number of PC users.

Accordingly, you always need to present the results from a sample with a margin of error and a level of confidence that you have in that margin of error.

The range of values around your percent values plus or minus your margin of error is called the confidence interval.

A commonly-used estimate for the margin of error for proportions is:

margin_of_error = 1.96 * √percent * (1 - percent) / sample_count

The 1.96 is a z-score for a 95% level of confidence, and the calculations under the radical are a rough estimate for standard error.

You can use spreadsheet formulas to calculate these values for your survey:

• Add a Margin of Error column and use the sqrt() function to calculate using the formula given above
• Add a Confidence Interval Low column and subtract the margin of error from each percent to get the low values in the 95% confidence interval
• Add a Confidence Interval High column and add the margin of error from each percent to get the high values in the 95% confidence interval

Interpreting the values in the example, we can say with a 95% level of confidence that the percent of students in the broader student population that own Macintosh laptops is between 38% and 70%.

3. Style the points by your categorical variable column
4. Give your map a meaningful name
5. Share the map publicly
6. Copy the shared link to give to anyone you want to share the map with

### ArcGIS Online Categorical Map

Alternatively, to map your results in ArcGIS Online: