When Chi-Square Works (and When It Doesn't)
By the end of this module, you will be able to:
Chi-square tests make several assumptions. Violating these can lead to incorrect p-values!
β Required: Variables must be categorical (nominal or ordinal)
β Wrong: Don't use chi-square on continuous data
Example error: "Age (years)" should be categorized into age groups first
β Required: Each observation can only be counted once
β Wrong: Same subject measured multiple times, paired data
Example error: Testing 50 people at Time 1 and Time 2 β 100 observations but only 50 independent subjects (use McNemar's test instead!)
β Required: Expected frequencies β₯ 5 in ALL cells
This is the most commonly violated assumption and the main focus of this module!
β Required: Data should come from random or representative sampling
Less about the statistical test and more about study design
Critical Rule: ALL expected frequencies must be β₯ 5
Chi-square distribution is an approximation that works well when expected frequencies are large enough. When expected frequencies are too small (< 5), the approximation breaks down and p-values become unreliable.
Result of violation: Type I error rate increases (false positives!)
# After running chi-square test
result <- chisq.test(data)
# Check expected frequencies
result$expected
# Look for any values < 5
| Improved | No Change | |
|---|---|---|
| Treatment A | 8.5 | 11.5 |
| Treatment B | 3.2 | 6.8 |
Problem: Treatment B / Improved cell has expected frequency of 3.2 < 5
Solution needed: Use Fisher's exact test instead!
WRONG: "All OBSERVED frequencies must be β₯ 5"
CORRECT: "All EXPECTED frequencies must be β₯ 5"
It's okay to have observed frequencies of 0, 1, 2, etc. The rule is about EXPECTED frequencies!
When expected frequencies are < 5, use Fisher's exact test instead of chi-square.
# Same syntax as chi-square!
data <- matrix(c(8, 3, 12, 7), nrow = 2, byrow = TRUE)
rownames(data) <- c("Treatment A", "Treatment B")
colnames(data) <- c("Improved", "No Change")
# Run Fisher's exact test
fisher.test(data)
# For tables larger than 2x2, R may need simulation:
fisher.test(data, simulate.p.value = TRUE)
Fisher's Exact Test for Count Data
data: data
p-value = 0.3561
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.3664 11.8439
sample estimates:
odds ratio
1.5556
Interpretation:
For 2Γ2 tables only, R applies Yates' continuity correction by default.
Chi-square is a continuous distribution used to approximate a discrete distribution (your frequency counts). For 2Γ2 tables with df=1, this approximation can be poor.
Yates' correction: Adjusts the chi-square formula to be more conservative (reduces Type I error).
Effect: Slightly larger p-values (harder to reject Hβ)
# With Yates' correction (default for 2Γ2)
chisq.test(data) # Correction applied automatically
# Without Yates' correction
chisq.test(data, correct = FALSE)
# Compare the results
| Test Version | ΟΒ² | p-value |
|---|---|---|
| With Yates' correction | 3.42 | 0.064 |
| Without correction | 4.57 | 0.032 |
Impact: Without correction β significant; With correction β non-significant
Recommendation: Use the default (with correction) for 2Γ2 tables, especially with smaller samples (n < 100)
If you have expected frequencies < 5, you have several options:
Best option: Always valid, no assumptions about frequencies
Limitation: Computationally intensive for large tables
fisher.test(data)
When appropriate: Categories are theoretically similar
Example: Combine "Slightly Improved" and "Greatly Improved" into "Improved"
# Original: 3 outcome categories
data_3cat <- matrix(c(5, 3, 2, 8, 7, 5), nrow = 2, byrow = TRUE)
colnames(data_3cat) <- c("Worse", "Same", "Better")
# Combine "Worse" and "Same" into "Not Better"
data_2cat <- matrix(c(8, 2, 15, 5), nrow = 2, byrow = TRUE)
colnames(data_2cat) <- c("Not Better", "Better")
# Now expected frequencies may be large enough
chisq.test(data_2cat)
Don't combine categories JUST to get significance! Only combine when theoretically justified.
The most straightforward solution: increase sample size
When to use: Study is ongoing, pilot data suggests small cells
When appropriate: Category has very few observations and isn't central to research question
Example: If studying 3 common species and 1 rare species with only 2 observations, might exclude the rare species
Report clearly: "One species with n=2 was excluded from analysis due to small sample size"
Scenario: Testing if habitat choice differs between two bird species
| Forest | Grassland | Wetland | |
|---|---|---|---|
| Species A | 18 | 12 | 2 |
| Species B | 15 | 10 | 3 |
# Create data
habitat_data <- matrix(c(18, 12, 2, 15, 10, 3), nrow = 2, byrow = TRUE)
rownames(habitat_data) <- c("Species A", "Species B")
colnames(habitat_data) <- c("Forest", "Grassland", "Wetland")
# Try chi-square
result <- chisq.test(habitat_data)
result
# Check expected frequencies
result$expected # Look for values < 5
Warning in chisq.test(habitat_data): Chi-squared approximation may be incorrect
Chi-squared test for given probabilities
X-squared = 0.123, df = 2, p-value = 0.940
Expected frequencies:
Forest Grassland Wetland
Species A 17.60 11.73 2.67
Species B 15.40 10.27 2.33
Problem: Wetland expected frequencies (2.67 and 2.33) are < 5! R even gives a warning!
Solutions:
Option 1: Fisher's Exact Test
fisher.test(habitat_data, simulate.p.value = TRUE)
# Simulation needed for tables larger than 2Γ2
Option 2: Combine Categories
# Combine Grassland and Wetland into "Non-Forest"
combined_data <- matrix(c(18, 14, 15, 13), nrow = 2, byrow = TRUE)
rownames(combined_data) <- c("Species A", "Species B")
colnames(combined_data) <- c("Forest", "Non-Forest")
result2 <- chisq.test(combined_data)
result2$expected # All β₯ 5 now!
result2
Recommendation: Use Fisher's exact test with simulation, OR combine categories if theoretically justified (e.g., if both grassland and wetland are "open habitats")
When chi-square is significant with tables larger than 2Γ2, you know variables are related, but WHERE is the association?
We covered this in Module 3 - this is your first step!
# Cells with |residual| > 2 or 3 drive the effect
result$stdres
Break your large table into smaller 2Γ2 comparisons
Overall test: ΟΒ²(2) = 12.5, p = .002 (significant)
Question: Which treatments differ from each other?
# Original 3Γ2 table
full_data <- matrix(c(45, 15, 30, 30, 20, 40), nrow = 3, byrow = TRUE)
rownames(full_data) <- c("Treatment A", "Treatment B", "Treatment C")
colnames(full_data) <- c("Success", "Failure")
# Compare Treatment A vs B
AB <- full_data[1:2, ]
chisq.test(AB)
# Compare Treatment A vs C
AC <- full_data[c(1,3), ]
chisq.test(AC)
# Compare Treatment B vs C
BC <- full_data[2:3, ]
chisq.test(BC)
# IMPORTANT: With 3 comparisons, consider Bonferroni correction
# Adjusted alpha = 0.05 / 3 = 0.017
Each additional test increases chance of Type I error (false positive)
Bonferroni correction: Divide alpha by number of comparisons
Only call results significant if p < adjusted alpha
Better than testing all possible pairs: decide BEFORE seeing data which comparisons matter
Example: Testing 4 treatments including a control
Planned comparisons:
Skip the treatment-to-treatment comparisons unless theoretically important
Cause: Expected frequencies < 5
Solution: Use Fisher's exact test or combine categories
Cause: Negative values or missing data in your table
Solution: Check for data entry errors, remove NAs
# Remove missing values
clean_data <- na.omit(your_data)
# Check for negatives
summary(your_data)
Cause: Trying to create table with unequal vector lengths
Solution: Verify all rows have same number of columns
Cause: Using simulate argument with regular chi-square instead of Fisher's test
Solution: Use simulation only with fisher.test()
When observations are NOT independent (same subjects measured twice)
50 patients tested before and after treatment (Pass/Fail)
| After: Pass | After: Fail | |
|---|---|---|
| Before: Pass | 20 | 5 |
| Before: Fail | 18 | 7 |
# McNemar's test for paired data
paired_data <- matrix(c(20, 5, 18, 7), nrow = 2)
mcnemar.test(paired_data)
# DO NOT use regular chi-square for paired data!
Testing association while controlling for a third variable (stratified analysis)
# Example: Treatment Γ Outcome, controlling for Sex
library(stats)
mantelhaen.test(array_data) # 3D array: rows Γ cols Γ strata
Testing for linear trend when one variable is ordinal
Example: Does disease prevalence increase with age category?
Age categories: Young β Middle β Old (natural ordering)
For each scenario, decide which test to use and explain why.
Scenario A: 100 patients, 2 treatment groups, 3 outcome categories. All expected frequencies between 8-15.
Scenario B: 30 animals, 2 species, 2 habitat choices. Expected frequencies: 7.5, 7.5, 7.5, 7.5
Scenario C: 25 animals, 2 species, 4 habitat choices. Expected frequencies range from 2.1 to 5.8
Scenario A Answer:
Chi-square test of independence - All expected frequencies β₯ 5 (all between 8-15), sample size adequate (n=100), two categorical variables. This meets all assumptions for chi-square.
Scenario B Answer:
Chi-square test of independence - Although sample is modest (n=30), all expected frequencies are 7.5, which exceeds the β₯5 requirement. This is a 2Γ2 table, so Yates' correction will be applied automatically.
Scenario C Answer:
Fisher's exact test with simulation - Some expected frequencies < 5 (as low as 2.1), which violates chi-square assumption. Fisher's exact test is appropriate. For a 2Γ4 table, use simulate.p.value=TRUE. Alternative: Consider combining habitat categories if theoretically justified to increase expected frequencies.
result <- chisq.test(data)
result$expected # Check this first!
Scenario: Testing if stress level affects immune response in 120 participants
| Strong Response | Moderate Response | Weak Response | Total | |
|---|---|---|---|---|
| Low Stress | 28 | 18 | 4 | 50 |
| High Stress | 12 | 25 | 33 | 70 |
| Total | 40 | 43 | 37 | 120 |
Complete these tasks:
# Your complete analysis here
stress_data <- matrix(c(28, 18, 4, 12, 25, 33), nrow = 2, byrow = TRUE)
rownames(stress_data) <- c("Low Stress", "High Stress")
colnames(stress_data) <- c("Strong", "Moderate", "Weak")
# Step 1: Run test
result <- chisq.test(stress_data)
result
# Step 2: Check expected frequencies
result$expected
# Step 3: Calculate CramΓ©r's V
chi_sq <- result$statistic
n <- sum(stress_data)
k <- min(nrow(stress_data), ncol(stress_data))
V <- sqrt(chi_sq / (n * (k - 1)))
V
# Step 4: Examine residuals
result$stdres
# Step 5: Visualize
barplot(stress_data, beside = TRUE,
col = c("#c8e6c9", "#ffcdd2"),
legend = rownames(stress_data),
xlab = "Immune Response",
ylab = "Frequency",
main = "Immune Response by Stress Level")
# Alternative: proportions
prop_data <- prop.table(stress_data, margin = 1)
barplot(prop_data, beside = TRUE,
col = c("#c8e6c9", "#ffcdd2"),
legend = rownames(stress_data),
xlab = "Immune Response",
ylab = "Proportion",
main = "Immune Response by Stress Level (Proportions)")
Pearson's Chi-squared test
data: stress_data
X-squared = 28.41, df = 2, p-value = 6.815e-07
Expected frequencies:
Strong Moderate Weak
Low Stress 16.67 17.92 15.42
High Stress 23.33 25.08 21.58
All expected frequencies β₯ 5 β
CramΓ©r's V = 0.486
Standardized Residuals:
Strong Moderate Weak
Low Stress 2.78 0.02 -2.93
High Stress -2.34 -0.02 2.46
Complete APA Write-Up:
"A chi-square test of independence was conducted to examine the relationship between stress level and immune response in 120 participants. All expected cell frequencies exceeded 5, meeting the assumptions for chi-square analysis. The association between stress level and immune response was statistically significant, ΟΒ²(2) = 28.41, p < .001, V = 0.49, indicating a moderate to strong relationship."
"Examination of standardized residuals revealed the pattern driving this association. Participants with low stress showed stronger immune responses than expected (56% strong response vs. 33% expected; residual = 2.78) and fewer weak responses (8% vs. 31% expected; residual = -2.93). Conversely, participants with high stress showed weaker immune responses than expected (47% weak vs. 31% expected; residual = 2.46) and fewer strong responses (17% vs. 33% expected; residual = -2.34). Moderate responses did not differ from expectation in either group."
"These findings suggest that higher stress levels are associated with compromised immune function, with high-stress individuals showing substantially weaker immune responses compared to their low-stress counterparts. The moderate-to-strong effect size indicates this is a practically meaningful relationship."
Key Interpretation Points:
Question 1: You have a 2Γ3 table with n=40. One expected frequency is 4.2. What should you do?
Question 2: For a 2Γ2 table, should you turn off Yates' correction?
Question 3: You test 50 patients before and after treatment (same patients). Which test?
Key Takeaways:
π Congratulations! You've completed all four Chi-Square modules!
You now have the skills to analyze categorical data correctly and confidently.
| Situation | Test to Use | R Code |
|---|---|---|
| One variable, test distribution | Goodness-of-fit | chisq.test(obs, p=...) |
| Two variables, all expected β₯ 5 | Test of independence | chisq.test(data) |
| Two variables, any expected < 5 | Fisher's exact | fisher.test(data) |
| Paired/repeated measures | McNemar's test | mcnemar.test(data) |
| Check effect size | CramΓ©r's V | sqrt(ΟΒ²/(n*(k-1))) |
| Find pattern | Standardized residuals | result$stdres |
Always remember: