🔗 Module 3: Chi-Square Test of Independence

Testing Relationships Between Two Categorical Variables

📚 Learning Objectives

By the end of this module, you will be able to:

🔗 What Does "Independence" Mean?

Two variables are independent when knowing the value of one variable tells you nothing about the other.

✅ Independent Variables

Example: Coin flip outcome and dice roll

Knowing you flipped heads tells you NOTHING about what number the die will show.

In research: "Treatment response is independent of sex" means males and females respond equally.

❌ Associated/Dependent Variables

Example: Species and habitat preference

Knowing the species DOES tell you about likely habitat (e.g., fish prefer water!).

In research: "Diagnosis is associated with treatment type" means different diagnoses get different treatments.

Chi-square test of independence asks:
"Are these two categorical variables related, or are they independent?"

📊 Understanding Contingency Tables

A contingency table (also called a cross-tabulation or crosstab) displays frequencies for two categorical variables simultaneously.

Example: Treatment Response by Sex

Research Question: Is treatment response independent of patient sex?

Treatment Response Row Total
Sex Improved No Improvement
Male 45 15 60
Female 25 35 60
Column Total 70 50 120

Reading this table:

  • 45 males improved, 15 males did not improve
  • 25 females improved, 35 females did not improve
  • Total sample size (n) = 120
  • Row totals: Number in each sex category
  • Column totals: Number in each response category

Visual pattern: Males seem more likely to improve (75%) than females (42%). Is this difference statistically significant?

🧮 Calculating Expected Frequencies

Under the null hypothesis (independence), we calculate what frequencies we'd EXPECT if the variables were truly unrelated.

Expected Frequency Formula

Expected = (Row Total × Column Total) / Grand Total

For each cell in the contingency table

📐 Calculate Expected Frequencies

Using the treatment response example:

Improved No Improvement
Male (n=60) Observed: 45
Expected: (60 × 70) / 120 = 35
Observed: 15
Expected: (60 × 50) / 120 = 25
Female (n=60) Observed: 25
Expected: (60 × 70) / 120 = 35
Observed: 35
Expected: (60 × 50) / 120 = 25

Key insight: If sex and response were independent, we'd expect equal success rates for males and females. The expected values reflect the OVERALL success rate (70/120 = 58%) applied to each group.

💻 Running Chi-Square Test of Independence in R

1Create the Contingency Table

# Method 1: Enter data as a matrix data <- matrix(c(45, 15, 25, 35), nrow = 2, byrow = TRUE) rownames(data) <- c("Male", "Female") colnames(data) <- c("Improved", "No Improvement") # View the table data # Method 2: If you have raw data # data <- table(sex, response)

2Run the Chi-Square Test

# Simple test result <- chisq.test(data) result # To see expected frequencies result$expected # To see standardized residuals (more on this soon!) result$stdres

3Interpret the Output

Pearson's Chi-squared test with Yates' continuity correction data: data X-squared = 15.63, df = 1, p-value = 7.743e-05

What this tells you:

  • X-squared = 15.63: Chi-square statistic
  • df = 1: (rows - 1) × (columns - 1) = (2-1) × (2-1) = 1
  • p-value = 0.00007743: Extremely strong evidence against independence
  • Yates' correction: Applied automatically for 2×2 tables (makes test more conservative)

Conclusion: We reject H₀. Treatment response is NOT independent of sex (p < .001). Males show significantly higher improvement rates than females.

🔢 Degrees of Freedom for Test of Independence

df = (r - 1) × (c - 1)

r = number of rows
c = number of columns

Examples:

Table Size Calculation df
2 × 2 (2-1) × (2-1) 1
2 × 3 (2-1) × (3-1) 2
3 × 3 (3-1) × (3-1) 4
4 × 3 (4-1) × (3-1) 6

⚠️ Why (r-1) × (c-1)?

Once you know frequencies in (r-1) rows and (c-1) columns, plus all the marginal totals, the remaining cells are determined. This creates constraints on the data, reducing degrees of freedom.

🎯 Practice Problem 1: Species and Habitat

Research Question: Is habitat preference independent of species?

Data: Three species observed in two habitat types

Habitat Total
Species Forest Grassland
Species A 35 15 50
Species B 20 30 50
Species C 25 25 50
Total 80 70 150
# Create the table habitat_data <- matrix(c(35, 15, 20, 30, 25, 25), nrow = 3, byrow = TRUE) rownames(habitat_data) <- c("Species A", "Species B", "Species C") colnames(habitat_data) <- c("Forest", "Grassland") # Run test result <- chisq.test(habitat_data) result # Check expected frequencies result$expected

🔍 Standardized Residuals: Finding the Pattern

When Chi-square is significant, the next question is: "WHICH cells are driving the effect?"

Standardized residuals tell you how far each cell deviates from expectation, in standard deviation units.

Standardized Residual = (Observed - Expected) / √Expected

Interpretation guidelines:

# Get standardized residuals result$stdres # Visualize them library(corrplot) corrplot(result$stdres, is.corr = FALSE, method = "color", addCoef.col = "black")

Example: Treatment Response Residuals

Improved No Improvement Male 2.69 -2.69 Female -2.69 2.69

Interpretation:

  • Male/Improved: residual = +2.69 → Males improved MORE than expected
  • Male/No Improvement: residual = -2.69 → Males had FEWER non-improvements than expected
  • Female cells: Opposite pattern - fewer improvements, more non-improvements than expected

All |residuals| > 2, confirming that ALL cells contribute to the significant chi-square!

📏 Effect Size: Cramér's V

Chi-square tells you IF variables are related, but not HOW STRONG the relationship is. That's where effect size comes in!

Cramér's V measures the strength of association, ranging from 0 (no association) to 1 (perfect association).

V = √[χ² / (n × (k - 1))]

k = smaller of (number of rows, number of columns)
n = total sample size

Interpretation guidelines (for df=1):

Cramér's V Interpretation
0.00 - 0.10 Negligible association
0.10 - 0.30 Weak association
0.30 - 0.50 Moderate association
> 0.50 Strong association
# Calculate Cramér's V in R library(rcompanion) cramerV(data) # Or calculate manually: chi_sq <- result$statistic n <- sum(data) k <- min(nrow(data), ncol(data)) V <- sqrt(chi_sq / (n * (k - 1))) V