# Exploring Data in R: When To Use What Method

The determination of what methods to use when analyzing data in R will be determined by the characteristics of the data and the question(s) you seek to answer with that data.

This tutorial will introduce techniques for exploring and determining the characteristics of data, and methods that can answer different types of questions with those different types of data. This approach proceeds through a sequence of questions:

- How does my data relate to reality?
- How does my data relate to the units of measurement (distribution)?
- How does my data relate to itself (independence)?
- How does my data relate to other variables
- What methods should I use?

## How Does My Data Relate to Reality?

The idealized world of the scientific method is **question-driven**,
with the collection and analysis of data determined by questions and hypotheses:

However, in some cases, especially at the early states of research,
a **data-driven** approach may be more appropriate. This
reverses the question-driven process, with the data coming first
and animating a process of **exploratory data analysis**, which is an approach outlined in this tutorial:

### Metadata

The first thing you should do when you get new data is document where it came from, before you lose track of this information as your project proceeds and other priorities intervene. You may need this metadata in the future, especially if you plan on sharing or publishing the results of your analysis.

Metadata may include:

- The name of the individual(s) or organization(s) that generated the data
- Names and descriptions of individual variables, including units and scales
- Dates for capture and/or compilation
- URL(s) for the source(s) of the data
- Privacy, copyrights, or other restrictions on the use of the data

### Types of Variables

Structured data represents characteristics of entities as **variables**. This section uses
Mosteller and Tukey's (1977) proposed a seven-level typology for levels
of measurement that is a more-robust alternative to
Stanley Steven's (1946)
traditional four-level typology (nominal, ordinal, interval, ratio).

- Qualitative Data (Nominal Data)
**Names**are proper nouns that refer to particular persons, places or things. Examples: names of people, street addresses.**Grades**(categorical data) are classifications. Examples include: states, regions, biological species**Binaries**(dichotomous data) are grades with only two possible values. Examples include: yes/no, high/low, present/absent.

- Quantitative Data (Continuous Data)
**Ranks**(ordinal data) are integers that order objects in a list starting with one. Examples include: top-ten lists, Likert scales.**Counted fractions**are percentage or proportion values bound by zero and one. Often based on categorical data. Examples: election results, successes vs failures.**Counts**are non-negative integers representing numbers of discrete items. Example: population, species richness.**Amounts**are non-negative real numbers. Examples: distance, volume, weight/mass, temperature (Kelvin)**Balances**(interval data) are unbounded positive and negative values Examples: latitude, longitude, temperature (Farenheit or celsius)

### Population vs Sampled Data

Sampled data is an incomplete, but hopefully representative, subset of the broader population being analyzed. Many statistical techniques, such as tests, use the laws of probability to calculate the amount of certainty that observations made on sampled data represent that broder population.

In choosing what method(s) to use with your data, you should note whether the method is specifically designed for use with sampled data.

### Noise and Uncertainty (Fixed X)

Many statistical tests and methods rely on the assumption of **fixed X**,
where the data represents exact values and there is no random "noise"
involved (Zuur et al 2007, 53).

In reality, there is always noise and uncertainty associated with data, layering additional probabilities on top of the analysis and making the rigorous p-values and confidence intervals that statistical techniques produce less precise. Indeed, many phenomena are ontologically fuzzy, and the data used to represent those phenomena reflect the subjectivities of the researchers. This is especially true in the social sciences where the phenomena are impossibly complex and research results can often not be reproduced.

Results of data analysis should always acknowledge and report these uncertainties as well as the implications of these uncertainties.

## How Does My Data Relate to the Units of Measurement (Distribution)?

A **distribution** is the manner in which the values of a variable are
distributed across the range of possible values. The characteristics of a
variables distribution will dictate the types of analysis that can be performed
using that variable.

### Example Data

Many of these examples use a collection of US county data that
you can download here, and
import using the **read.csv()** function:

> counties = read.csv("2017-county-data.csv", as.is=T) > names(counties) [1] "FIPS" "NAME" "AREASQMI" "ST" "CBSA" [6] "CBSPOP2013" "POP2013" "URBAN2013" "URBAN2006" "URBAN1990" [11] "REGION" "WIN2012" "WIN2016" "DEM2012" "PCDEM2012" [16] "GOP2012" "PCGOP2012" "DEM2016" "PCDEM2016" "GOP2016" [21] "PCGOP2016" "CHANGEDEM" "CHANGEGOP" "PCOBESITY" "OBESITY" [26] "MEDHHINC" "MEDHOUSE" "MEDAGE" "POPSQM2013"

### Visualizing Distributions With Histograms and Bar Plots

The quickest way to get a general sense of the characteristics of a
continuous variable distribution is using a **histogram** generated with the **hist()**
function.

For example, this is a histogram of the percent of obese people across US counties approximately follows a normal curve:

>hist(counties$PCOBESITY)

The equivalent of a histogram for categorical data is the
**bar plot**, which consists of horizontal or vertical bars where the
height or width determined by the count of values in each
category.

To create a bar plot, you must first create a **contingency table**
with the **table()** function
that counts the number of values in each category. You can then
feed that table into a **barplot()** function to draw a bar plot.

This example is a contingency table and bar plot of the the rural-urban classification of counties used by the CDC, ranging from 1 for large central metropolitan areas, to 6 for non-core (rural) areas.

The barplot() function returns the x coordinates of the center of each
bar. Those x coordinates can be fed into the **text()** function
to label the bars:

> y = table(counties$URBAN2013) > y 1 2 3 4 5 6 68 368 372 358 641 1333 > x = barplot(y) > text(x = x, y = y / 2, labels = y, font=2)

### Normality

Measurements of many phenomena follow a **normal distribution**, or a
bell curve, with values clustered around a central **mean** and becoming
less frequent as the values move higher or lower away from the mean.

Many of the methods used in statistics are based on the assumption that the
variables that approximate a normal
distribution or can be transformed into something close to a normal
distribution. Tests that rely on variables with a known distribution are
called **parametric tests**.

> standard_deviations = seq(-3, 3, 0.1) > norm = dnorm(standard_deviations) > plot(standard_deviations, norm, type="l")

### Evaluating Normality With Q-Q-Plots

The normality of a distribution can be visualized using
a **QQ-plot** generated with the
**qqnorm()** plotting function, which draws the values on the x-axis in comparison
to what would be expected in a normal distribution on the y-axis. If the distribution
is normal, the dots will align diagonally with the line drawn with
the **qqline()** function.

qqnorm(counties$PCOBESITY) qqline(counties$PCOBESITY)

The mean can be calculated with the **mean()** function, and
the standard deviation (which defines the width of the curve) with the
**sd()** function.

> mean(counties$PCOBESITY, na.rm=T) [1] 30.94169 > sd(counties$PCOBESITY, na.rm=T) [1] 4.459513

As we see for the obesity example, the line is close
but not perfectly diagonal, indicating there may be a slight **skew** (shift left or right)
and/or **kurtosis** (flattening or spreading) from normal.

We can compare the **median()** (the value where have the values are
below and half the values are above) to the **mean()** to detect skew.

> mean(counties$PCOBESITY, na.rm=T) [1] 30.94169 > median(counties$PCOBESITY, na.rm=T) [1] 31.2

The **skewness()** and **kurtosis()** functions from the
**e1071** library can also be used to quantify those deviations, with zero values
meaning perfectly normal, and higher values indicating higher deviation
from normal.

The values below indicate the distribution is slightly skewed to the right (longer left tail) and is slightly more clustered around the mean than in a perfectly normal distribution

> library(e1071) > skewness(counties$PCOBESITY, na.rm=T) [1] -0.3890195 > kurtosis(counties$PCOBESITY, na.rm=T) [1] 1.077061

### Normality Tests of Samples

When working with small numbers of samples (less than 40 or so) from a larger population, there are test functions that can give the probability that the population distribution is normal:

**shapiro.test()**: The Shapiro-Wilk Normality Test. Higher p-values (closer to 1) means to reject the null hypothesis that the distribution IS NOT normal.**ks.test()**: The Kolmogorov-Smirnov Test. Higher p-values (closer to 1) means to reject the null hypothesis that the distribution IS NOT normal.**pearson.test()**The Pearson Chi-square Normality Test from the**nortest**library. Lower p-values (closer to 0) means to reject the reject the null hypothesis that the distribution IS normal.

These tests do not work with large numbers of values, but with many statistical techniques, violations of normality assumptions do not cause major problems when large sample sizes are used. (Ghasemi and Sahediasi 2012).

> # Simulated random sample from a normal distribution > x = rnorm(30) > shapiro.test(x) Shapiro-Wilk normality test data: x W = 0.98185,p-value = 0.8722> ks.test(x, "pnorm", mean=mean(x), sd=sd(x)) One-sample Kolmogorov-Smirnov test data: x D = 0.095656,p-value = 0.9222alternative hypothesis: two-sided > library(nortest) > pearson.test(x) Pearson chi-square normality test data: x P = 9.4667,p-value = 0.09184

### Transformations

Skewed distributions can often be made normal (or close to normal)
using **transformations**.

For example, median household income is skewed to the left (median below the mean) by a handful of extremely wealthy counties:

> hist(counties$MEDHHINC) > mean(counties$MEDHHINC, na.rm=T) [1] 46839.65 > median(counties$MEDHHINC, na.rm=T) [1] 45114

The logarithmic transformation using the **log()** function is commonly
used with distributions like income where there are a large number of
low values (poor people) and fewer higher values (rich people). The
**exp()** function can be used to transform the logarithmically-transformed
values back to the original units

> par(mfrow=c(1,2)) > hist(log(counties$MEDHHINC)) > qqnorm(log(counties$MEDHHINC)) > qqline(log(counties$MEDHHINC)) > exp(mean(log(counties$MEDHHINC), na.rm=T)) [1] 45405.38 > exp(median(log(counties$MEDHHINC), na.rm=T)) [1] 45114

### Evaluating Non-Normal Distributions With Quantiles

Some variables are clearly not normally distributed. For example, core-based statistical areas (CBSA) are used by the US Census Bureau to deliniate metropolitan areas. The populations of these areas tend to be fairly evenly distributed under around 200,000, although there are a handful of extremely large CBSAs around the country's largest cities. A logarithmic transformation makes the histogram clearer, but does not make the distribution normal.

> par(mfrow=c(1,2)) > hist(log(counties$CBSPOP2013)) > qqnorm(log(counties$CBSPOP2013)) > qqline(log(counties$CBSPOP2013)) > exp(mean(log(counties$CBSPOP2013), na.rm=T)) [1] 260334.6 > exp(median(log(counties$CBSPOP2013), na.rm=T)) [1] 187530

For distributions like this, division of values into quantiles using the
**quantile()** function gives a clearer description. The five quantile values
divide the distribution into four parts, with each part having an
equal number of elements. The quantile values are the dividing values between
parts. The 0% quantile is the minimum value, the 50% quantile is the median,
and the 100% quantile is the highest value:

> quantile(counties$CBSPOP2013, na.rm=T) 0% 25% 50% 75% 100% 13200 61029 187530 901700 19831858

## How Does My Data Relate to Other Variables?

Variables will usually be captured as related collections and most statistical work involves in either modeling or testing relationships between variables.

### Univariate Data

Univariate data is a single variable that stands alone.
R stores univariate data as **vectors**.

Univariate data is commonly analyzed with descriptive statistics like mean/median/mode, standard deviation, normality, etc.

Examples of univariate data include ages, weights, income, etc.

### Bivariate / Multivariate Data

Bivariate data is a pair of variables that are related to each other. Bivariate comparisons are often useful on their own or as part of the exploratory process of multvariate data.

Multivariate data is more than two variables that are related to each other,
often with one **dependent variable** and one or more **independent variables** that
influence the dependent variable

A common form of multivariate data is **cross-sectional data**,
where a variety of variables are collected for multiple individuals
at a single point in time.

Differences between sampled variables are assessed with tests like the t-test, chi-square test or ANOVA

Relationships between variables are assessed with correlation or regression

> plot(log(counties$MEDHOUSE), log(counties$MEDHHINC), col="#00000040") > linear_model = lm(log(counties$MEDHHINC) ~ log(counties$MEDHOUSE)) > abline(linear_model, col="red2", lwd=3) > summary(linear_model) Call: lm(formula = log(counties$MEDHHINC) ~ log(counties$MEDHOUSE)) Residuals: Min 1Q Median 3Q Max -0.65517 -0.09869 -0.00429 0.09568 0.69201 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.732363 0.059120 113.88 <2e-16 *** log(counties$MEDHOUSE) 0.609686 0.009021 67.58 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1572 on 3137 degrees of freedom (94 observations deleted due to missingness) Multiple R-squared: 0.5928, Adjusted R-squared: 0.5927 F-statistic: 4567 on 1 and 3137 DF, p-value: < 2.2e-16

### Time-Series Data

Time-series data consists of one or more variables that are observed over time.

# This is demonstration data included with R that consists of # Annual measurements of the level, in feet, of Lake Huron from 1875-1972 plot(LakeHuron)

### Spatial Data

Spatial data consists of one or more variables (also called fields or attributes) measures in point, along lines, within areas (polygons), or as grids (raster) on the earth's surface. Commonly visualized as maps.

library(rgdal) counties = readOGR(".", "2017-county-data") counties = counties[!(counties$ST %in% c("AK", "HI", NA)),] usa_albers = CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs") counties = spTransform(counties, usa_albers) spplot(counties, zcol="WIN2012", col.regions=c("navy", "red2"))

### Panel Data

Panel data is multidimensional data, usually data that has multiple variables on multiple individuals (cross-sectional data) that are observed at multiple periods of time and/or locations on the surface of the earth.

Analysis of panel data is often performed by generating **mixed models**.
Because of the multidimensionality of the data, these models are often
difficult to describe and visualize.

### Heteroskedacity

Homoskedasticity means that the standard deviation of a distribution is the
same throughout its range. The opposite of homoscedacity is **heteroskedasticity**,
and a heteroscedastic distribution can cause inaccuracies when used with
parametric tests, and have poor model fit with parametric models.

Heteroskedasticity is of primary interest when comparing groups of sampled data.

For example, the following is a simulated sample of what different people in a downtown area paid for lunch, along with their annual income. People with lower incomes tend to be more budget-conscious, while wealthier people have a broader range of options and can choose both expensive and inexpensive options. When the variables are plotted in comparison to each other on an X-Y scatter plot, heteroskedasticity causes the plot to spread vertically as the values move to the right.

When comparing two different continuous variables like this, the
Breusch-Pagan heteroskedasticity test can be run using the **bptest()**
function from the **lmtest** library.
The low p-value in this example (below 0.05
for 95% confidence) means we can reject the null hypothesis of homoskedasticity.

> lunch = c(5, 0, 6, 8, 7, 10, 8, 5, 8, 8, + 8, 10, 11, 15, 8, 8, 0, 5, 16, 10, + 8, 9, 25, 8, 0, 40, 25, 8, 10, 10) > income = c(35000, 34000, 40000, 42000, 50000, 52000, 53000, 53000, 54000, 56000, + 56000, 60000, 61000, 61000, 65000, 70000, 71000, 71000, 75000, 76000, + 76000, 80000, 82000, 90000, 100000, 110000, 110000, 120000, 120000, 120000) > plot(log(income), lunch) > library(lmtest) > bptest(lunch ~ log(income)) studentized Breusch-Pagan test data: lunch ~ log(income) BP = 5.6513, df = 1, p-value = 0.01744

In contrast, if the distribution is homoskedastic, the points in the X-Y scatter will be closer to a line. For this example, we assume that everyone spends 1/100th of one percent of their annual income on lunch each day. The Breusch-Pagan p-value is high, indicating that we cannot reject the null hypothesis of homoskedasticity:

> percentage = income * 0.0001 > plot(log(income), percentage) > bptest(percentage ~ log(income)) studentized Breusch-Pagan test data: percentage ~ log(income) BP = 0.29333, df = 1, p-value = 0.5881

If the samples are divided into discrete groups (categorical data),
the Bartlett test of homogeneity of variances can be run using
the **bartlett.test()** function. The boxplot shows a clear
difference in the range of values in each group, which is validated
by the low p-value, indicating that we should reject the null
hypothesis of homoskedasticity.

> wealthy = ifelse(income > 80000, "High", "Low") > boxplot(lunch ~ wealthy) > bartlett.test(lunch, wealthy) Bartlett test of homogeneity of variances data: lunch and wealthy Bartlett's K-squared = 18.84, df = 1, p-value = 1.422e-05

## How Does My Data Relate to Itself (Independence)?

Many statistical tests and models are based on the assumption that values are independent of each other. Unlike normality and homooskedasticity, determination of independence requires consideration of what the data is measuring.

The opposite of independence is **autocorrelation**
where values correlate, or mirror, values separated by a given interval.

### Temporal Autocorrelation

**Time-series** data is generally not independent. For example,
temperatures over time are usually related to each other - days with
similar temperatures tend to cluster together across the regular
cycles of seasons.

Analysis of time-series data often involves **decomposition**
of the data into:

- A seasonal (cyclical) component
- A long-term trend
- The residuals, which may relate to associated variables

With temporal data, we often looking for trends (long-term) or irregularities (deviations from long-term), so this type of decomposition with removal of the seasonal component is useful.

For example, this performs a time series decomposition on historic mean monthly temperatures at Dallas Love Field from the National Weather Service, which can be downloaded here. The CSV file is arranged with columns of months and rows of years (with an annual average in the last column), which requires some rearranging to turn into a univariate time series.

x = read.csv("dallas-avg-temp.csv") x = x[,c(-1, -14)] x = ts(as.vector(t(x)), start=c(1976, 1), frequency=12) plot(stl(x, s.window="periodic"))

### Spatial Autocorrelation

Likewise, **spatial data** is also not independent.
Following
Tobler's First Law of Geography (which is more of a heuristic than a law),
like things and people tend to cluster together. This results in
**spatial autocorrelation**.

While we are often looking for areas of autocorrelation in spatial data (such as crime clusters or hotspots), we are also often looking for relationships of phenomena in space associated with underlying variables (such as the relationship of housing prices to neighborhood income) that necessitate consideration of autocorrelation with techniques like spatial regression.

For example, when creating a simple (non-spatial) bivariate regression model relating median monthly housing costs to median household income, mapping of the residuals show that the model consistently overestimates housing costs in the low-cost Great Plains states, while underestimating housing costs in high-cost coastal megaregions.

> library(rgdal) > counties = readOGR(".", "2017-county-data") > counties = counties[!(counties$ST %in% c("AK", "HI", NA)),] > usa_albers = CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 + +ellps=GRS80 +datum=NAD83 +units=m +no_defs") > counties = spTransform(counties, usa_albers) > counties = counties[!is.na(counties$MEDHHINC),] > model = lm(MEDHOUSE ~ MEDHHINC, data=counties) > summary(model) Call: lm(formula = MEDHOUSE ~ MEDHHINC, data = counties) Residuals: Min 1Q Median 3Q Max -716.60 -85.07 -1.51 85.26 615.17 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -8.300e+01 1.080e+01 -7.686 2.03e-14 *** MEDHHINC 1.745e-02 2.241e-04 77.878 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 151.2 on 3104 degrees of freedom Multiple R-squared: 0.6615, Adjusted R-squared: 0.6614 F-statistic: 6065 on 1 and 3104 DF, p-value: < 2.2e-16 > counties@data$RESIDUALS = residuals(model) > spplot(counties, zcol="RESIDUALS")

## What Methods Should I Use?

Based on the exploration of your data outlined above, you should be able to make a rational decision about the appropriate method(s) to use for evaluating your data.

### Descriptive Statistical Methods: Central Tendency and Variation

#### Minimum / Maximum / Range

- The lowest and highest numbers in a set of values
- Data: A single, continuous variable with any distribution
- R Function: min(), max(), range()
- Excel Functions: =MIN(), =MAX()
- Example Question: What are the highest and lowest stream levels observed during the study period?

#### Median

- The center value of any distribution, where half of the values are below the median and half are above
- Data: A single, continuous variable with any distribution
- R Function: median()
- Spreadsheet Function: =MEDIAN()
- Example Question: What is the median household income in a neighborhood?

#### Mean

- The center value of a normal distribution
- Data: A single normally-distributed, continuous variable
- R Function: mean()
- Spreadsheet Function: =AVERAGE()
- Example Question: What is the mean height of women in the USA?

#### Standard Deviation

- The parameter defining how the range of values are distributed in a normal distribution. 68.27% of the values are within one standard deviation of the mean and 95.45% of the values are within two standard deviations of the mean
- Data: A single normally-distributed, continuous variable
- R Function: sd()
- Spreadsheet Function: STDEV()
- Example Question: What is the range of heights for most women in the USA?

#### Mode

- The most common value in a distribution
- Data: A single dichotomous or categorical variable
- R Function: None
- Spreadsheet Function: =MODE()
- Example Question: What is the most common name for men in the USA?

#### Quantiles

- The boundary values separating a set of ordered values into groups with the same numbers of values
- Data: A single, continuous variable with any distribution
- R Function: quantile()
- Spreadsheet Functions: =QUARTILE(), =PERCENTILE()
- Example Question: What boundary values should I use to separate my values into equally-numbered groups for mapping?

### Statistical Tests

Statistical tests are used with **sampled data** to determine whether an observed data
set is so different from what you would expect under a **null hypothesis** that
you should reject the null hypothesis.
(McDonald 2014).
Statistical tests consider the sample size in determining the probability of
the null hypothesis being incorrect.

#### Correlation

- Tests the significance of a relationship between pairs of values across two different variables
- Data: Two continuous variables
- R Function: cor.test()
- Spreadsheet Function: =CORREL()
- Example Question: Are housing costs in a county directly related to the income of county residents?

#### One-Sample *t*-Test

- Tests the significance of the difference between the mean of a sample and an expected mean
- Data: One normally-distributed, continuous variable and a single expected mean (μ)
- R Function: t.test()
- Spreadsheet Function: =T.TEST()
- Null Hypothesis: The means between the two groups are equal (reject if p < 0.05)
- Example Question: Do men and women differ in the amount of hours they spend watching TV in a given month?

#### Two-Sample *t*-Test

- Tests the significance of the difference between the means of two different samples
- Data:
- Two normally-distributed, continuous variables, OR
- A normally-distributed continuous variable and a parallel dichotomous variable indicating what group each of the values in the first variable belong to

- R Function: t.test()
- Spreadsheet Function: =T.TEST()
- Null Hypothesis: The means between the two groups are equal (reject if p < 0.05)
- Example Question: Do men and women differ in the amount of hours they spend watching TV in a given month?

#### Mann-Whitney U-Test

- Tests the significance of the difference between the means of two different samples. Non-parametric alternative to the t-test
- Data: Two continuous variables
- R Function: wilcox.test()
- Spreadsheet Function: None
- Null Hypothesis: The means between the two groups are equal (reject if p < 0.05)
- Example Question: Do men and women differ in the amount of hours they spend watching TV in a given month?

#### Chi Square Test Goodness of Fit

- Tests the significance of the difference between sampled frequencies of different values and expected frequencies of those values
- Data:
- A categorical variable
- A table of expected frequencies for each of the categories

- R Function: chisq.test()
- Spreadsheet Function: =CHISQ.TEST()
- Null Hypothesis: The relative proportions of one variable are different from the expected proportions (reject if p < 0.05)
- Example Question: Are the voting preferences of voters in my district significantly different from the current national polls?

#### Chi Square Test of Independence

- Tests the significance of the difference between frequences between two different groups
- Data: Two categorical variables
- R Function: chisq.test()
- Spreadsheet Function: =CHISQ.TEST()
- Null Hypothesis: The relative proportions of one variable are independent of the second variable (reject if p < 0.05)
- Example Question: Is there a difference between men and women in political party affiliation?

#### Analysis of Variation (ANOVA)

- Tests the significance of differences between two or more groups
- Data: One or more categorical (independent) variables and one continuous (dependent) variable
- R Function: anova()
- Spreadsheet Function: None
- Example Question: Do low-, middle- and high-income people vary in the amount of time they spend watching TV?

#### Kruskal-Wallace One-Way Analysis of Variance

- Tests the significance of differences between two or more groups. Non-parametric alternative to ANOVA
- Data: One or more categorical (independent) variables and one continuous (dependent) variable
- R Function: kruskal.test()
- Spreadsheet Function: None
- Example Question: Do low-, middle- and high-income people vary in the amount of time they spend watching TV?

### Models

Models are formulas that represent relationships between variables. They are simplified mathematical representations of reality. All models are broken but some models are useful (George Box)

Models are used to:

- Make inferences about a broader population based on sampled data
- To fill in gaps in information or understanding based on sampled data (interpolation)
- Gain understanding into the relationships between forces and influences on phenomena
- Make projections about the future based on events in the past (predictive models)
- Understand where things have been, why they are there, or where they should be (spatial models)

#### Linear Regression

- Assesses the influence that one or more independent variables has on a single variable
- Data: One normally-distributed continuous dependent variable, and one or more normally-distributed continuous independent variables. Dichotomous variables can also be used as independent "dummy" variables.
- R Function: lm()
- Spreadsheet Function: None
- Example Question: What influence does gender, level of education, and ruralness have on median household income in a community?

#### Logistic Regression

- Assesses the influence that one or more independent variables has on a single variable
- Data: One dichotomous dependent variable, and one or more normally-distributed continuous independent variables. Dichotomous variables can also be used as independent "dummy" variables.
- R Function: glm()
- Spreadsheet Function: None
- Example Question: What are the odds of suicide at various levels of substance abuse?