Instructor's Guide: Teaching Statistical Normality

Complete answer keys, teaching notes, and facilitation guidance for Modules 1-4

Teaching Overview

Learning Sequence

These four modules are designed to be taught over 2-3 class sessions:

  1. Session 1: Modules 1 & 2 (Why normality matters + Visual detection)
  2. Session 2: Module 3 (Statistical tests and the large sample paradox)
  3. Session 3: Module 4 (Transformations and solutions)
Total Time Commitment:

Key Pedagogical Principles

💡 Pro Tip: Students will want rules ("p < .05 means transform!"). Resist this. The goal is thoughtful judgment combining visual inspection, statistical tests, sample size, and robustness considerations.

Module 1: Why Normality Matters

Learning Objectives

Answer Keys

Question 1: Type I Error Rate with Normal Data

Expected result: ~5% (should be close to nominal α = .05)

Good student answer: "The Type I error rate is approximately 5%, which matches our alpha level. This shows the test is working correctly when assumptions are met."

Question 2: Type I Error Rate with Skewed Data (n=20)

Expected result: 8-12% (inflated above 5%)

Good student answer: "With small sample size and skewed data, the Type I error rate increased to about 10%, which is double the intended rate. This means we're rejecting true null hypotheses too often."

Question 3: Type I Error Rate with Skewed Data (n=100)

Expected result: ~5-7% (much closer to nominal, showing robustness)

Good student answer: "With a larger sample size, even though the data is still skewed, the Type I error rate dropped back down to around 5-6%. This demonstrates that larger samples make the test more robust to violations of normality."

Question 4: Why does sample size matter?

Strong answer should mention:

Example: "The Central Limit Theorem tells us that as sample size increases, the sampling distribution of the mean approaches normality regardless of the population distribution. With n=100, even our skewed data produces a nearly normal sampling distribution, so the t-test performs well. With n=20, we don't have enough data for the CLT to fully protect us."

Common Student Misconceptions

⚠️ Misconception #1: "Normality violations always invalidate results"
Correction: Emphasize that robustness depends on sample size and severity of violation. Large samples are quite robust.
⚠️ Misconception #2: "If Shapiro-Wilk p > .05, my data is perfectly normal"
Correction: The test only tells us we don't have strong evidence against normality. Visual inspection is still essential.
💡 Discussion Prompt: "Why do you think statisticians care about Type I errors? What's the real-world consequence of rejecting H₀ when it's true?" This connects abstract concepts to research integrity.

Facilitation Notes

Module 2: Visual Detection of Non-Normality

Learning Objectives

Answer Keys for Practice Datasets

Dataset Distribution Type Histogram Clues Q-Q Plot Clues Verdict
Dataset 1 Normal Bell-shaped, symmetric Points fall on diagonal line ✓ Normal - Proceed with t-test
Dataset 2 Right-skewed Long tail to the right, peak on left Upward curve at right end (heavy right tail) ⚠️ Skewed - Consider transformation or large n
Dataset 3 Bimodal Two distinct peaks S-shaped curve or irregular ✗ Bimodal - Investigate groups, don't transform
Dataset 4 Heavy-tailed Looks nearly normal but with outliers Points deviate at both ends (tails) ⚠️ Heavy tails - Check for outliers, consider robust methods
Dataset 5 Left-skewed Long tail to the left, peak on right Downward curve at left end ⚠️ Skewed - Consider transformation
Dataset 6 Normal (mild noise) Roughly bell-shaped with minor irregularities Points close to line with minor deviations ✓ Approximately normal - Proceed

Field Guide - Expected Entries

Strong Field Guide Characteristics:

Example Strong Entry (Right-skewed):

Right-Skewed Distribution
Histogram: Peak on left, long tail stretching right →
Most values cluster low, few extreme high values
Q-Q plot: Points curve UPWARD on right side
Action: Try log transformation or sqrt
Common in: Reaction times, income data, counts
            

Common Student Struggles

⚠️ Issue #1: Confusing which way the skew goes
Tip: "The skew points in the direction of the TAIL, not the peak. Right-skewed = tail goes right."
Memory aid: "Think of the tail as an arrow pointing in the direction of the skew."
⚠️ Issue #2: Not recognizing bimodal distributions
Tip: Emphasize that bimodality suggests TWO GROUPS. Don't transform - investigate! "Two peaks = two populations mixed together."
⚠️ Issue #3: Over-interpreting minor irregularities
Tip: Real data is messy. Minor wiggles are normal. Look for CLEAR patterns, not perfection.

Facilitation Strategy

💡 Recommended Approach:
  1. Individual exploration (10 min): Have students look at all 6 datasets quietly first
  2. Partner discussion (15 min): Compare observations and build field guides together
  3. Whole class reveal (15 min): Go through each dataset, have students share what they saw
  4. Key moment: When revealing bimodal dataset, ask "What might cause two peaks in real research?" (e.g., male/female, treatment/control, two species)
⏱️ Timing Reality Check:

This module WILL take longer than you expect (45-65 minutes typically). Students need time to:

Don't rush this. Pattern recognition is the most valuable skill they'll learn.

Discussion Questions

  1. "Why do you think Q-Q plots are more sensitive than histograms for detecting problems?"
  2. "What would you do if your histogram and Q-Q plot gave conflicting information?"
  3. "Dataset 3 was bimodal. In real research, what might cause this?" (Great for critical thinking)
  4. "Which distribution type do you think is most common in neuroscience/psychology data? Why?"

Module 3: Statistical Tests for Normality

Learning Objectives

Answer Keys

Questions 1-2: Recording Results

Expected pattern:

Key insight: Same distribution type, different conclusions!

Questions 2-3: Comparing Visuals

Question 2 (Histograms): Should look very similar (both slightly right-skewed)

Question 3 (Q-Q plots): Both should show similar patterns (slight deviation from line)

The paradox: They LOOK the same but get different p-values!

Question 4: Pattern as n Increases

Correct observation: p-value tends to decrease as sample size increases

Strong explanation: "Larger samples give the test more 'power' to detect even tiny deviations from perfect normality. With small samples, the test might miss problems. With large samples, it detects everything—even deviations too small to matter practically."

Advanced insight: Some students might note that this creates a dilemma: when you have enough data to reliably detect violations, you also have enough data that violations don't matter much!

Question 5: The Big Question - What Should You Do?

Ideal answer components:

Example strong answer:

"When I have a large sample (n > 50) and Shapiro-Wilk says p < .05 but the Q-Q plot looks only mildly skewed, I would proceed with the t-test because: (1) the t-test is robust to moderate non-normality with large samples due to the Central Limit Theorem, and (2) the Shapiro-Wilk test is overly sensitive with large samples, detecting trivial deviations that don't affect the validity of the test. I would note the mild skewness in my write-up but wouldn't transform unless the visual inspection showed severe problems."

Question 6: Robustness Demonstration

Expected result: Even with "failed" Shapiro-Wilk (p < .05), 95% CIs should still achieve ~95% coverage with n=100

Good answer: "Even though the Shapiro-Wilk test rejected normality, the confidence intervals still had approximately 95% coverage. This demonstrates that with large samples, the t-test is robust - it works correctly even when the formal assumption test says normality is violated."

The Large Sample Paradox - Teaching It Well

💡 The Key Teaching Moment:

This is the most important conceptual hurdle in the entire module series. Students need to understand:

  1. Small samples: Test lacks power (might miss problems) BUT violations matter more
  2. Large samples: Test has high power (detects tiny problems) BUT violations matter less
  3. The paradox: The Shapiro-Wilk test is MOST likely to "fail" when you LEAST need to worry about it!

Effective analogy: "It's like a smoke detector that gets more sensitive the bigger your fire extinguisher is. When you have a small extinguisher (small n), it might not detect smoke (low power). When you have a huge extinguisher (large n), it goes off at the tiniest wisp of smoke (high power), but you don't need to worry because you can handle it."

Common Student Reactions

⚠️ Frustration #1: "So when DO I trust the p-value?!"
Response: "Always trust your EYES first. The p-value is just one piece of evidence. With n > 50, visual inspection matters more."
⚠️ Frustration #2: "This seems wishy-washy. I want a clear rule!"
Response: "Statistics isn't about following rules blindly - it's about informed judgment. That's why we're training your pattern recognition skills AND showing you the tests. Real data analysis requires both."

Decision Framework Table - Expected Understanding

Scenario Visual Check Shapiro-Wilk Sample Size Recommendation
1 Looks normal p > .05 Any ✓ Proceed with confidence
2 Clearly skewed p < .05 Small (n<30) Transform or use non-parametric
3 Mildly skewed p < .05 Large (n>50) Proceed (robust), note in write-up
4 Looks fine p < .05 Large (n>100) Trust your eyes, ignore p-value
5 Severe problems Any Any Transform or non-parametric
6 Bimodal Any Any Don't transform - investigate groups!
💡 Have students add this table to their notes! It's a practical reference they'll use in every future analysis.

Facilitation Notes

Module 4: Data Transformations

Learning Objectives

Answer Keys

Question 1: Why Transformations Work

Strong answer includes:

Example: "A log transformation works on right-skewed data because it compresses large values proportionally more than small values. For instance, log(100) - log(10) = 1, but 100 - 10 = 90. This 'pulls in' the extreme right tail, making the distribution more symmetric."

Practice Dataset Results

Right-skewed dataset:

Left-skewed dataset:

Heavy-tailed dataset:

Question 2: When NOT to Transform

Key scenarios students should identify:

  1. Bimodal distributions - transformation won't fix underlying two-group structure
  2. Large samples with mild skew - robustness makes transformation unnecessary
  3. Data with meaningful zeros - log(0) is undefined
  4. When interpretability matters more than normality - original scale may be more meaningful

Example answer:

"You should NOT transform if: (1) you have a bimodal distribution because this suggests two distinct groups that should be analyzed separately, not squashed together, (2) you have a large sample (n > 100) with only mild skewness because the Central Limit Theorem makes the t-test robust anyway, or (3) your data has meaningful zeros (like reaction times or counts) and you'd lose interpretability with a log transformation."

Question 3: Interpretation Challenge

Scenario: Original data in milliseconds, mean = 450ms. After log transformation, mean = 6.1.

What students need to understand:

Strong interpretation:

"The mean on the log scale is 6.1, which corresponds to a geometric mean of approximately 445ms on the original scale (exp(6.1) = 445). This is slightly lower than the arithmetic mean of 450ms because geometric means are pulled down by skewness. When we report results, we should either back-transform our estimates or clearly state we're working on the log scale."

Transformation Selection Guide - What Students Should Learn

Distribution Shape First Try If That Doesn't Work R Code
Moderate right skew Square root Log sqrt(x) or log(x)
Severe right skew Log Inverse log(x) or 1/x
Left skew Reflect then sqrt Reflect then log sqrt(max(x)-x)
Heavy tails (both ends) Check for outliers first! Winsorize or robust methods Don't transform blindly
Bimodal DON'T TRANSFORM Investigate groups Split dataset or add grouping variable

Common Student Mistakes

⚠️ Mistake #1: "I tried log but it made it worse!"
Likely cause: Data was left-skewed, not right-skewed. Need to reflect first.
Teaching moment: "Always look at the DIRECTION of skew before choosing transformation."
⚠️ Mistake #2: Transforming data with zeros or negative values
Problem: log(0) is undefined, log(negative) is complex
Solution: Add small constant: log(x + 1) or log(x + 0.5)
Caution: This changes interpretation! Mention in write-up.
⚠️ Mistake #3: Forgetting which scale they're on
Symptoms: "The mean reaction time was 6.1 milliseconds" (impossible - that's on log scale!)
Prevention: Always label transformed variables clearly: log_rt not just rt
⚠️ Mistake #4: Over-transforming
Example: Trying 5 different transformations to get p > .05
Problem: This is p-hacking! Choose transformation based on distribution shape, not p-value
Teaching point: "The goal isn't to maximize p-value - it's to make the distribution more symmetric."

Teaching Tips

💡 Make It Visual:

The "before and after" visual comparison is incredibly powerful. Have students:

  1. Save screenshot of "before" histogram and Q-Q plot
  2. Apply transformation
  3. Compare side-by-side with "after" plots
  4. "Wow, it actually worked!" moments are great for learning
💡 Real-World Context:

Explain why certain data types are naturally skewed:

"When you understand WHY data is skewed, you can predict what you'll need to do!"

Advanced Discussion Questions

  1. "If we transform our data, analyze it, and report results - are we being honest with our readers? How should we report this?"
  2. "Some researchers always analyze data on original scale even if skewed, arguing for interpretability. Others always transform to meet assumptions. Who's right?"
  3. "What would you do if your data required a transformation for normality, but your reader expects results in the original units (like milliseconds)?"
  4. "Can you think of any situations where the log scale is actually MORE meaningful than the original scale?" (Hint: fold-change, ratios, pH)

Facilitation Notes

Assessment & Grading Guidance

Formative Assessment Throughout Modules

Check for Understanding - Key Moments:

Summative Assessment Options

Option 1: Take-Home Analysis Assignment

Prompt: Provide students with 3 datasets (small n with clear skew, large n with mild skew, bimodal). Ask them to:

  1. Create and interpret visual diagnostics
  2. Run and interpret Shapiro-Wilk tests
  3. Make and justify decisions about transformation
  4. If transforming, show before/after comparison
  5. Explain their reasoning using concepts from all 4 modules

Grading rubric elements:

Option 2: In-Class Practical Exam

Format: 50-minute exam, students analyze 2 datasets using RStudio

Dataset 1 (20 points): Small sample, clear violation

Dataset 2 (20 points): Large sample, mild violation

Short answer (10 points):

Option 3: Peer Teaching Exercise

Format: Partners create a 5-minute "mini-lesson" teaching ONE concept to the class

Topics to assign:

Assessment criteria:

Benefit: Best way to cement understanding is teaching others!

Red Flags in Student Work

⚠️ Red Flag #1: "The Shapiro-Wilk test showed p = .03, so the data is not normal."
Issue: Treating p-value as binary truth rather than evidence
Look for: Nuanced discussion of sample size, visual inspection, severity
⚠️ Red Flag #2: Reporting means on transformed scale without back-transformation
Issue: "Mean log reaction time was 6.1 ms" (nonsensical units)
Look for: Either back-transformed results OR clear statement of scale
⚠️ Red Flag #3: Trying multiple transformations to get p > .05
Issue: P-hacking to "pass" normality test
Look for: Transformation choice justified by distribution shape, not p-value
⚠️ Red Flag #4: Transforming bimodal data
Issue: Fundamental misunderstanding - two groups shouldn't be squashed
Look for: Recognition that bimodality means investigate, don't transform

What Success Looks Like

A student who truly "gets it" will:

Troubleshooting & FAQs

Technical Issues

Issue: "The interactive modules won't run on student computers"
Solutions:
Issue: "Simulations are running too slowly"
Solutions:
Issue: "Students are getting different results from each other"
Response:

Pedagogical Questions

Q: "Should I teach parametric vs. non-parametric tests alongside this?"
A: These modules focus on checking assumptions and transformations. If you want to add non-parametric alternatives (Mann-Whitney, Kruskal-Wallis, Wilcoxon), consider creating Module 5 as an extension. Students need to master normality checking first before learning when to abandon parametric tests entirely.
Q: "My students have limited R experience - will they struggle?"
A: The modules include all necessary R code. Students primarily need to:
If very new to R, consider a 15-minute R basics review first.
Q: "What if students ask about other assumption tests (Levene's, etc.)?"
A: Great question! These modules focus on normality specifically. You can mention:
Consider creating supplementary materials if your course requires extensive assumption testing.

Frequently Asked Student Questions

Student Question Suggested Response
"Why can't we just always use non-parametric tests?" "Non-parametric tests have lower power - they're less likely to detect real effects when they exist. Parametric tests are more efficient when assumptions are reasonably met. Also, parametric tests can handle covariates and complex designs more easily."
"Do real researchers actually check all these assumptions?" "Yes! Good researchers check assumptions. However, with experience, you learn when violations are likely and can sometimes predict what you'll find. But formal checking is important, especially when publishing."
"What if my advisor tells me different rules than what we learned here?" "Statistical practice varies across fields and even researchers. These modules teach you the concepts and reasoning. In practice, discuss with your advisor and justify your choices. What matters is thoughtful decision-making, not rigid rules."
"Can I just use robust standard errors instead?" "That's an advanced technique! Robust SEs help with heteroscedasticity and some violations, but they're not a cure-all. For now, master the basics of checking assumptions and transforming. Robust methods are a good next step."
"This seems like a lot of work for one assumption..." "It does! But normality is one of the most commonly violated assumptions, and mishandling it can invalidate your results. The time invested now will save you from making serious errors later. Plus, this pattern recognition skill transfers to other assumptions too."

Adaptation Suggestions

For Different Course Levels

For 100-level / Intro Courses:

For 300-level / Advanced Courses:

For Graduate Courses:

For Different Disciplines

Psychology/Social Sciences:

Neuroscience/Biology:

Animal Behavior:

For Different Time Constraints

If you only have ONE class session (75 min):

  1. Module 1 brief version (10 min): Why it matters
  2. Module 2 streamlined (30 min): Visual detection with 3 datasets instead of 6
  3. Module 3 core concept (20 min): Large sample paradox only
  4. Module 4 overview (15 min): Show one transformation example, provide handout for others

If you have TWO full sessions (150 min):

  1. Session 1: Modules 1 & 2 in full depth
  2. Session 2: Modules 3 & 4 in full depth
  3. This is the recommended pacing

If you have extended time (3+ sessions):

  1. Session 1: Module 1 + Discussion
  2. Session 2: Module 2 + Field guide creation
  3. Session 3: Module 3 + Paradox deep dive
  4. Session 4: Module 4 + Real data practice
  5. Session 5: Integration activity / assessment

Resources & References

For Further Reading (Instructor)

Student-Friendly Resources

R Package Documentation

Final Thoughts for Instructors

The Goal Is Decision-Making, Not Rule-Following

The most important thing students should learn from these modules is how to think about assumptions, not just how to run tests. Good statistical practice requires:

Students who leave saying:

Common Instructor Concerns

"Won't this complexity confuse students who just want clear rules?"

Short-term: Maybe. Long-term: No. Students who learn nuanced thinking become better researchers. Those who learn rigid rules become frustrated when real data doesn't fit the rules (which it never does).

"This takes a lot of class time for one assumption..."

True. But normality is violated more often than any other assumption, and mishandling it invalidates many analyses. Time invested here prevents major errors later. Plus, the thinking skills transfer to other assumptions.

"What if students get different answers from what I expect?"

Good! If they can justify their decision using concepts from the modules, that's more valuable than getting the "right" answer. Statistical analysis isn't always black and white.

You've Got This!

These modules represent a complete, research-based approach to teaching normality testing. They've been designed to:

Remember: The goal isn't perfection - it's thoughtful, informed decision-making. Help your students become statistical thinkers, not just test-runners.

Good luck, and enjoy teaching!