🎲 Module 2: Chi-Square Goodness-of-Fit Test

Testing if Data Matches Expected Distributions

📚 Learning Objectives

By the end of this module, you will be able to:

🎯 What is a Goodness-of-Fit Test?

A goodness-of-fit test answers the question:

"Does my observed data fit an expected pattern?"

Key characteristics:

🐀 Example: Do Rats Prefer Certain Maze Arms?

Research Question: Do rats have equal preference for three maze arms?

Setup: Test 60 rats, each chooses one of three arms (Left, Center, Right)

Null Hypothesis (H₀): Rats choose arms equally (33.3% each)

Expected frequencies: 20 rats per arm (60 ÷ 3 = 20)

Maze Arm Observed Expected
Left 15 20
Center 18 20
Right 27 20
Total 60 60

Chi-square asks: Is the difference between observed (15, 18, 27) and expected (20, 20, 20) statistically significant, or just random variation?

📝 Setting Up Hypotheses

Null Hypothesis (H₀): The observed distribution matches the expected distribution

Alternative Hypothesis (H₁): The observed distribution differs from the expected distribution

⚠️ Important Note About Hypotheses

Unlike t-tests or ANOVA, Chi-square goodness-of-fit is non-directional. We don't predict which category will be higher - just that the pattern differs from expectation.

Common Expected Distributions:

Expected Pattern When to Use Example
Equal proportions Testing if all categories are equally likely Coin flips (50/50), die rolls (1/6 each)
Specified ratios Testing Mendelian ratios, known distributions Genetics (3:1 ratio), survey norms
Population proportions Comparing sample to known population Census data, disease prevalence

The Chi-Square Formula

χ² = Σ [(O - E)² / E]

Where:

Interpretation: Larger χ² values = bigger difference between observed and expected = more evidence against H₀

🧮 Interactive Chi-Square Calculator

Calculate χ² by hand to understand how it works!

Category Observed (O) Expected (E) (O-E)²/E
Category 1 -
Category 2 -
Category 3 -

💻 Running Chi-Square Goodness-of-Fit in R

Now let's run this analysis in R! The function is chisq.test().

1Create Your Observed Data

# Vector of observed frequencies observed <- c(15, 18, 27) # Give them meaningful names names(observed) <- c("Left", "Center", "Right") # Check your data observed

2Run the Chi-Square Test

# Test against equal proportions (default) chisq.test(observed) # OR specify custom expected proportions # For equal: p = c(1/3, 1/3, 1/3) chisq.test(observed, p = c(1/3, 1/3, 1/3))

3Interpret the Output

Chi-squared test for given probabilities data: observed X-squared = 4.3, df = 2, p-value = 0.1165

What this output tells you:

  • X-squared = 4.3: The Chi-square statistic (χ² = 4.3)
  • df = 2: Degrees of freedom (number of categories - 1)
  • p-value = 0.1165: Probability of seeing this difference by chance

Interpretation: p = 0.117 > 0.05, so we fail to reject H₀. The observed distribution doesn't significantly differ from equal proportions. Rats don't show a clear preference.

🔢 Understanding Degrees of Freedom (df)

df = k - 1

k = number of categories

Why k - 1? Once you know the frequencies in (k-1) categories and the total n, the last category is determined. That's one constraint, so we lose one degree of freedom.

Examples:

📊 Visualizing Your Results

Always create a bar plot to visualize observed vs. expected frequencies!

# Create a bar plot comparing observed and expected observed <- c(15, 18, 27) expected <- c(20, 20, 20) categories <- c("Left", "Center", "Right") # Combine into a matrix for plotting data <- rbind(observed, expected) colnames(data) <- categories # Create grouped bar plot barplot(data, beside = TRUE, col = c("#f5576c", "#9c27b0"), legend = c("Observed", "Expected"), main = "Rat Maze Arm Preferences", xlab = "Maze Arm", ylab = "Frequency", ylim = c(0, 30))

🎯 Practice Problem 1: Birth Days

Research Question: Are births equally distributed across days of the week?

Data: Hospital records show the following births:

Day Mon Tue Wed Thu Fri Sat Sun
Births 28 32 29 31 35 18 17

Total births: 190

Expected per day (if equal): 190 ÷ 7 = 27.14

# Your turn! Complete this R code: # Step 1: Enter the observed frequencies births <- c(28, 32, 29, 31, 35, 18, 17) names(births) <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun") # Step 2: Run Chi-square test result <- chisq.test(births) result # Step 3: Visualize barplot(births, col = "#f5576c", main = "Births by Day of Week", ylab = "Number of Births")

🎯 Practice Problem 2: Mendelian Genetics

Research Question: Do flower colors follow Mendelian 3:1 ratio?

Background: According to Mendelian genetics, we expect 75% purple flowers and 25% white flowers.

Observed data: 152 purple, 48 white (n = 200 total)

# Test against 3:1 ratio (0.75 and 0.25 proportions) flowers <- c(152, 48) names(flowers) <- c("Purple", "White") # Specify expected proportions chisq.test(flowers, p = c(0.75, 0.25)) # Calculate expected frequencies manually expected <- 200 * c(0.75, 0.25) expected # Should be 150 and 50

🤔 Check Your Understanding

Question 1: You test 100 animals for color preference. Observed: Red=35, Blue=40, Green=25. You expect equal preferences. What are the expected frequencies?

A) Red=35, Blue=40, Green=25 (same as observed)
B) Red=33.33, Blue=33.33, Green=33.33 (equal split)
C) Red=50, Blue=30, Green=20

Question 2: You get χ² = 8.5, df = 3, p = 0.037. What do you conclude?

A) Reject H₀: Observed distribution differs from expected (p < 0.05)
B) Fail to reject H₀: No significant difference
C) Cannot determine without seeing the data

Question 3: A study has 4 categories. What are the degrees of freedom?

A) df = 4
B) df = 3 (k - 1 = 4 - 1)
C) df = 5

⚠️ Common Mistakes to Avoid

Mistake 1: Using raw data instead of frequencies

# WRONG - Don't use individual observations responses <- c("A", "B", "A", "C", "B", "A") chisq.test(responses) # Error! # CORRECT - Use frequency counts freq <- table(responses) # Converts to frequencies chisq.test(freq)

Mistake 2: Forgetting that totals must match

Your expected frequencies must sum to the same total as your observed frequencies!

❌ Observed total = 100, Expected total = 120 → ERROR

✅ Observed total = 100, Expected total = 100 → CORRECT

Mistake 3: Misinterpreting non-significant results

If p > 0.05, we DON'T conclude "the distributions are the same." We say "we don't have evidence they differ" - absence of evidence isn't evidence of absence!

Mistake 4: Ignoring the assumption about expected frequencies

Chi-square requires expected frequencies ≥ 5 in all categories. We'll cover this more in Module 4!

✍️ Writing Up Your Results

Here's how to report Chi-square goodness-of-fit results in APA style:

Example Write-Up

Study: Rats choosing among three maze arms (n=60)

Results:

"A chi-square goodness-of-fit test was conducted to determine whether rats showed equal preference for three maze arms. The distribution of choices did not significantly differ from equal proportions, χ²(2) = 4.30, p = .117. Rats selected the left arm 15 times (25%), center arm 18 times (30%), and right arm 27 times (45%). Although there was a numerical preference for the right arm, this pattern could be due to chance variation."

Essential elements to include:

🎯 Comprehensive Practice Problem

Research Question: Do neurons fire at uniform rates across four time periods?

Background: You record neural activity in 4 consecutive time periods and count spikes.

Observed spike counts:

Time Period 0-250ms 250-500ms 500-750ms 750-1000ms
Spikes 45 38 52 45

Your tasks:

  1. Calculate the total spikes and expected frequency per period
  2. Write the R code to run the test
  3. Interpret the results
  4. Write up the findings in a sentence
# Complete this analysis: # Step 1: Enter data spikes <- c(45, 38, 52, 45) names(spikes) <- c("0-250ms", "250-500ms", "500-750ms", "750-1000ms") # Step 2: Calculate totals total <- sum(spikes) total # Should be 180 # Step 3: Expected frequency (if uniform) expected_each <- total / 4 expected_each # Should be 45 # Step 4: Run chi-square result <- chisq.test(spikes) result # Step 5: Visualize barplot(spikes, col = c("#f093fb", "#f5576c", "#9c27b0", "#e91e63"), main = "Neural Spike Counts by Time Period", ylab = "Number of Spikes", las = 2) # Rotates x-axis labels abline(h = expected_each, lty = 2, col = "blue", lwd = 2) legend("topright", legend = "Expected", lty = 2, col = "blue", lwd = 2)

📋 Chi-Square Goodness-of-Fit Workflow

Follow these steps every time:

1State hypotheses

H₀: Distribution matches expected; H₁: Distribution differs from expected

2Check assumptions

• Categorical variable
• Independent observations
• Expected frequencies ≥ 5

3Calculate expected frequencies

Based on your null hypothesis (equal proportions, specific ratio, etc.)

4Run test in R

chisq.test(observed, p = c(...))

5Interpret results

Look at χ², df, and p-value. If p < .05, reject H₀

6Visualize and report

Create bar plot, write up results with all key statistics

📝 Module 2 Summary

Key Takeaways:

Next up: Module 3 covers Chi-square Test of Independence - testing relationships between TWO categorical variables!