Testing Relationships Between Two Categorical Variables
By the end of this module, you will be able to:
Two variables are independent when knowing the value of one variable tells you nothing about the other.
Example: Coin flip outcome and dice roll
Knowing you flipped heads tells you NOTHING about what number the die will show.
In research: "Treatment response is independent of sex" means males and females respond equally.
Example: Species and habitat preference
Knowing the species DOES tell you about likely habitat (e.g., fish prefer water!).
In research: "Diagnosis is associated with treatment type" means different diagnoses get different treatments.
Chi-square test of independence asks:
"Are these two categorical variables related, or are they independent?"
A contingency table (also called a cross-tabulation or crosstab) displays frequencies for two categorical variables simultaneously.
Research Question: Is treatment response independent of patient sex?
| Treatment Response | Row Total | ||
|---|---|---|---|
| Sex | Improved | No Improvement | |
| Male | 45 | 15 | 60 |
| Female | 25 | 35 | 60 |
| Column Total | 70 | 50 | 120 |
Reading this table:
Visual pattern: Males seem more likely to improve (75%) than females (42%). Is this difference statistically significant?
Under the null hypothesis (independence), we calculate what frequencies we'd EXPECT if the variables were truly unrelated.
For each cell in the contingency table
Using the treatment response example:
| Improved | No Improvement | |
|---|---|---|
| Male (n=60) |
Observed: 45 Expected: (60 × 70) / 120 = 35 |
Observed: 15 Expected: (60 × 50) / 120 = 25 |
| Female (n=60) |
Observed: 25 Expected: (60 × 70) / 120 = 35 |
Observed: 35 Expected: (60 × 50) / 120 = 25 |
Key insight: If sex and response were independent, we'd expect equal success rates for males and females. The expected values reflect the OVERALL success rate (70/120 = 58%) applied to each group.
Step-by-step for Male/Improved cell:
Interpretation: If there's no relationship between sex and improvement, we'd expect 35 of the 60 males to improve (the same proportion as in the overall sample).
# Method 1: Enter data as a matrix
data <- matrix(c(45, 15, 25, 35),
nrow = 2, byrow = TRUE)
rownames(data) <- c("Male", "Female")
colnames(data) <- c("Improved", "No Improvement")
# View the table
data
# Method 2: If you have raw data
# data <- table(sex, response)
# Simple test
result <- chisq.test(data)
result
# To see expected frequencies
result$expected
# To see standardized residuals (more on this soon!)
result$stdres
Pearson's Chi-squared test with Yates' continuity correction
data: data
X-squared = 15.63, df = 1, p-value = 7.743e-05
What this tells you:
Conclusion: We reject H₀. Treatment response is NOT independent of sex (p < .001). Males show significantly higher improvement rates than females.
r = number of rows
c = number of columns
Examples:
| Table Size | Calculation | df |
|---|---|---|
| 2 × 2 | (2-1) × (2-1) | 1 |
| 2 × 3 | (2-1) × (3-1) | 2 |
| 3 × 3 | (3-1) × (3-1) | 4 |
| 4 × 3 | (4-1) × (3-1) | 6 |
Once you know frequencies in (r-1) rows and (c-1) columns, plus all the marginal totals, the remaining cells are determined. This creates constraints on the data, reducing degrees of freedom.
Research Question: Is habitat preference independent of species?
Data: Three species observed in two habitat types
| Habitat | Total | ||
|---|---|---|---|
| Species | Forest | Grassland | |
| Species A | 35 | 15 | 50 |
| Species B | 20 | 30 | 50 |
| Species C | 25 | 25 | 50 |
| Total | 80 | 70 | 150 |
# Create the table
habitat_data <- matrix(c(35, 15, 20, 30, 25, 25),
nrow = 3, byrow = TRUE)
rownames(habitat_data) <- c("Species A", "Species B", "Species C")
colnames(habitat_data) <- c("Forest", "Grassland")
# Run test
result <- chisq.test(habitat_data)
result
# Check expected frequencies
result$expected
Pearson's Chi-squared test
data: habitat_data
X-squared = 8.259, df = 2, p-value = 0.01611
Expected Frequencies:
| Forest | Grassland | |
|---|---|---|
| Species A | 26.67 | 23.33 |
| Species B | 26.67 | 23.33 |
| Species C | 26.67 | 23.33 |
Interpretation: χ²(2) = 8.26, p = .016. We reject H₀. Habitat preference is NOT independent of species. Looking at the data: Species A prefers forest (70%), Species B prefers grassland (60%), and Species C shows no preference (50/50).
When Chi-square is significant, the next question is: "WHICH cells are driving the effect?"
Standardized residuals tell you how far each cell deviates from expectation, in standard deviation units.
Interpretation guidelines:
# Get standardized residuals
result$stdres
# Visualize them
library(corrplot)
corrplot(result$stdres, is.corr = FALSE,
method = "color", addCoef.col = "black")
Improved No Improvement
Male 2.69 -2.69
Female -2.69 2.69
Interpretation:
All |residuals| > 2, confirming that ALL cells contribute to the significant chi-square!
Chi-square tells you IF variables are related, but not HOW STRONG the relationship is. That's where effect size comes in!
Cramér's V measures the strength of association, ranging from 0 (no association) to 1 (perfect association).
k = smaller of (number of rows, number of columns)
n = total sample size
Interpretation guidelines (for df=1):
| Cramér's V | Interpretation |
|---|---|
| 0.00 - 0.10 | Negligible association |
| 0.10 - 0.30 | Weak association |
| 0.30 - 0.50 | Moderate association |
| > 0.50 | Strong association |
# Calculate Cramér's V in R
library(rcompanion)
cramerV(data)
# Or calculate manually:
chi_sq <- result$statistic
n <- sum(data)
k <- min(nrow(data), ncol(data))
V <- sqrt(chi_sq / (n * (k - 1)))
V