Comparing Categorical Data in R (Chi-square, Kruskal-Wallace)

While categorical data can often be reduced to dichotomous data and used with proportions tests or t-tests, there are situations where you are sampling data that falls into more than two categories and you would like to make hypothesis tests about those categories. This tutorial describes a group of tests that can be used with that type of data.

Comparing Categorical Frequencies with Expected Frequencies (Chi-Squared Goodness of Fit)

Suppose that in a statewide gubernatorial primary, an average of past statewide polls have shown the following results:

Stewart		45%
Macrander	40%
Miller		15%

The Macrander campaign recently rolled out an expensive media campaign and wants to know if there has been any change in voter opinions. In a telephone poll of 200 people in the state, they got the following results:

Stewart		82 people (41%)
Macrander	78 people (39%)
Miller		40 people (20%)

The raw results give some indication of hope. But since this is a poll there is uncertainty that your results reflect an actual change the opinions of the broader population. How certain can we be that this poll reflects an actual change?

The Chi-Squared (Χ2)) goodness-of-fit test compares the counts in different categories with the expected percentages in each category:

Formula for the Chi-Squared Goodness-of-Fit Test

The R function for the Chi-Squared test is chisq.test(). The x parameter is a vector of counts in each of the categories. The p parameter is an expected probabilities for each category.

In the example below, the poll results in the scenario above are compared against the probabilities from past polls. The null hypothesis (as with most tests) is that the sampled probabilities are the same as the population probabilities. The alternative hypothesis is that there is a difference - in this case, that there has been a change in opinions about the candidates.

> poll = c(stewart = 82, macrander = 78, miller = 40)
> past = c(stewart = 0.45, macrander = 0.4, miller = 0.15)

> print(chisq.test(x = poll, p = past))

        Chi-squared test for given probabilities

data:  poll
X-squared = 4.0944, df = 2, p-value = 0.1291

The p-value of 0.1291 indicates that there is a 13% chance that the null hypothesis is correct. This fairly high value means that we fail to reject the null hypothesis. In this case, that means that with this size of a sample, this close of an election, and that small of a change, we cannot have a high level of confidence that the improved results for Macrander in this poll actually reflect an improved result in the broader electorate.

When using categorical data from a vector, such as with a raw list of responses, a you can use a table() to create your x counts. Keep in mind that the table() will order the counts alphabetically, so your probability vector must also be in alphabetical order by the category name.

> responses = c(rep("stewart", 82), rep("macrander", 78), rep("miller", 40))
> poll = table(responses)
> poll
poll
macrander    miller   stewart 
       78        40        82 
> past = c(macrander = 0.4, miller = 0.15, stewart = 0.45)
> print(chisq.test(x = table(poll), p = past))

        Chi-squared test for given probabilities

data:  poll
X-squared = 4.0944, df = 2, p-value = 0.1291

Comparing Two Categorical Variables (Chi-Squared Contingency Analysis)

Suppose that the Macrander campaign would like to know how partisan this election is. If people are largely choosing to vote along party lines, the campaign will seek to get their base voters out to the polls. If people are splitting their ticket, the campaign may focus their efforts more broadly.

The Chi-Squared test can also be used to determine if two categorical variables are independent. What is passed as the parameter is a contingency table created with the table() function that cross-classifies the number of rows that are in the categories specified by the two categorical variables.

The null hypothesis with this test is that the two categories are independent. The alternative hypothesis is that there is some dependency between the two categories.

In the example below, the Macrander campaign took a small poll of 30 people asking who they wished to vote for AND what party they most strongly affiliate with.

# Simulated data
> cand = c(rep("Stewart", 22), rep("Macrander", 21), rep("Miller", 7))
> party=c(rep("Republican", 20), rep("Democratic", 23), rep("None", 7))

> print(table(cand, party))
           party
cand        Democratic None Republican
  Macrander         21    0          0
  Miller             0    7          0
  Stewart            2    0         20

> print(chisq.test(table(cand, party)))

        Pearson's Chi-squared test

data:  table(cand, party)
X-squared = 91.502, df = 4, p-value > 2.2e-16

Warning message:
In chisq.test(table(cand, party)) :
  Chi-squared approximation may be incorrect

The output of table() shows fairly strong relationship between party affiliation and candidates. Democrats tend to vote for Macrander, while Republicans tend to vote for Stewart, while independents all vote for Miller.

This is reflected in the very low p-value from the Chi-Squared test. This indicates that there is a very low probability that the two categories are independent. Therefore we reject the null hypothesis.

In contrast, suppose that the poll results had showed there were a number of people crossing party lines to vote for candidates outside their party. The simulated data below uses the runif() function to randomly choose 50 party names.

The contingency table() shows no clear relationship between party affiliation and candidate. This is validated quantitatively by the Chi-Squared test. The fairly high p-value of 0.4018 indicates a 40% chance that the two categories are independent. Therefore, we fail to reject the null hypothesis and the campaign should focus their efforts on the broader electorate.

> noparty = c("Republican","Democratic","Green")[trunc(runif(50, 1, 4))]

> print(table(cand, noparty))
           noparty
cand        Democratic Green Republican
  Macrander         10     6          5
  Miller             3     0          4
  Stewart           10     6          6

> print(chisq.test(table(cand, noparty)))

        Pearson's Chi-squared test

data:  table(cand, noparty)
X-squared = 4.0313, df = 4, p-value = 0.4018

Warning message:
In chisq.test(table(cand, noparty)) :
  Chi-squared approximation may be incorrect

The warning message given by the chisq.test() function indicates that the sample size is too small to make an accurate analysis. The simulate.p.value = T parameter adds Monte Carlo simulation to the test to improve the estimation and get rid of the warning message. However, the best way to get rid of this message is to get a larger sample.

> print(chisq.test(table(cand, noparty), simulate.p.value = T))

Comparing Categorical and Continuous Variables (ANOVA and Kruskal-Wallis)

Suppose you are performing research into obesity in your city. You take a sample of 30 people in three different neighborhoods (90 people total), collecting information on health and lifestyle. Two variables you collect are height and weight so you can calculate body mass index. Although this index can be misleading for some populations (notably very athletic people), ordinary sedentary people can be classified according to BMI:

< 18.5		Underweight
18.5 - 25	Optimal weight
25 - 30		Overweight
> 30		Obese

Average BMI in the US from 2007-2010 was around 28.6 and rising, standard deviation of around 5.

You would like to know if there is a difference in BMI between different neighborhoods so you can know whether to target specific neighborhoods or make broader city-wide efforts. Since you have more than two groups, you cannot use a t-test().

Analysis of Variance (ANOVA) is a test that you can use when you have a categorical variable and a continuous variable. It is a test that considers variability between means for different categories as well as the variability of observations within groups. There are a wide variety of different extensions of ANOVA that deal with covariance (ANCOVA), multiple variables (MANOVA), and both of those together (MANCOVA). These techniques can become quite complicated and also assume that the values in the continuous variables have a normal distribution.

A somewhat simpler test is the Kruskal-Wallace test which is implemented in R in the kruskal.test() function. The x parameter is a continuous (interval/ratio) variable. The g parameter is the categorical variable representing different groups to which the continuous values belong.

The null hypothesis with a Kruskal-Wallace test is that all the different groups represented by the samples are very similar.

In the example below, the simulated data is a set of BMI and a set of neighborhoods. The three calls to the rnorm() function give random, but normally-distributed, values with slightly different means for each of the three neighborhoods.

Although the output will vary based on the random values in the simulated data, p-value = 0.6976 indicates a 70% chance that the means in the different groups are the same. Therefore, we fail to reject the null hypothesis. We do not have evidence that there is a wide disparity in obesity levels between the three neighborhoods. This is confirmed with the output of the aggregate() command that calculates the means for the different neighborhoods.

# Simulated data
> bmi = c(rnorm(30, 28.5, 5), rnorm(30, 28.6, 5), rnorm(30, 28.7, 5))
> neighborhood = c(rep("deerpark", 30), rep("evanston", 30), rep("millrun", 30))

> kruskal.test(x = bmi, g = as.factor(neighborhood))


        Kruskal-Wallis rank sum test

data:  bmi and as.factor(neighborhood)
Kruskal-Wallis chi-squared = 0.7201, df = 2, p-value = 0.6976

> aggregate(bmi ~ neighborhood, data = data.frame(bmi, neighborhood), FUN=mean, na.rm=T)

  neighborhood      bmi
1     deerpark 30.16947
2     evanston 29.09864
3      millrun 29.14188

However, if we simulate samples where the means in the different neighborhoods are different, the p-value = 0.002235 indicates a 0.2% chance that the means are the same. Therefore, we reject the null hypothesis. There is a difference in BMI among the different neighborhoods and Mill Run should receive priority for anti-obesity efforts.

# Simulated data
> bmi = c(rnorm(30, 27.4, 5), rnorm(30, 28.6, 5), rnorm(30, 30.8, 5))
> neighborhood = c(rep("deerpark", 30), rep("evanston", 30), rep("millrun", 30))

> kruskal.test(x = bmi, g = as.factor(neighborhood))

        Kruskal-Wallis rank sum test

data:  bmi and as.factor(neighborhood)
Kruskal-Wallis chi-squared = 12.207, df = 2, p-value = 0.002235

> aggregate(bmi ~ neighborhood, data = data.frame(bmi, neighborhood), FUN=mean, na.rm=T)

  neighborhood      bmi
1     deerpark 26.35746
2     evanston 29.67376
3      millrun 30.52217

Box Plots

A convienent way of visualizing a comparison between continuous and categorical data is with a box plot, which shows the distribution of a continuous variable across different groups:

Example Box Plot (R code below)

A percentile is the level at which a given percentage of the values in the distribution are below: the 5th percentile means that five percent of the numbers are below that value.

The quartiles divide the distribution into four parts. 25% of the numbers are below the first quartile. 75% are below the third quartile. 50% are below the second quartile, making it the median.

Box plots can be used with both sampled data and population data.

The first parameter to the box plot is a formula: the continuous variable as a function of (the tilde) the second variable. A data= parameter can be added if you are using variables in a data frame.

# Simulated data
> bmi = c(rnorm(30, 27.4, 5), rnorm(30, 28.6, 5), rnorm(30, 30.8, 5))
> neighborhood = c(rep("deerpark", 30), rep("evanston", 30), rep("millrun", 30))

> boxplot(bmi ~ neighborhood)