📚 Instructor's Guide: Regression Teaching Modules

Comprehensive Teaching Guide for Modules 1-4

Complete with Answer Keys, Teaching Tips, Common Errors, and Facilitation Notes

📑 Table of Contents

🎯 Course Overview & Pedagogical Approach

Course Philosophy

These modules emphasize practical application over theoretical derivation. Students learn regression as a research tool for answering scientific questions, not as a mathematical exercise. The focus is on:

Learning Objectives Across All Modules

By the end of the complete module series, students should be able to:

  1. Articulate when regression is the appropriate statistical test
  2. Run simple and multiple regression analyses in R
  3. Interpret regression output including coefficients, p-values, and R²
  4. Check and interpret diagnostic plots for assumption violations
  5. Make informed decisions when assumptions are violated
  6. Distinguish between practical and statistical significance
  7. Interpret and explain interaction effects
  8. Build and compare multiple regression models

Recommended Pacing

⏱️ TOTAL TIME: 6-8 hours of instruction + practice
💡 Teaching Tip

Don't rush Module 1! Students who understand WHY regression matters and WHEN to use it have much better intuition for interpreting results later. The conceptual foundation pays dividends.

Prerequisites

Students should have prior experience with:

Materials Needed

📊 Module 1: Why Regression Matters

⏱️ Estimated Time: 60-90 minutes

Learning Objectives

Key Concepts to Emphasize

  1. Correlation vs. Regression: Correlation measures strength/direction; regression makes predictions
  2. Dependent vs. Independent: Regression assumes directional relationship (unlike correlation)
  3. Practical Utility: Regression quantifies how much Y changes per unit of X
  4. R² Interpretation: Variance explained is NOT the same as causation

Interactive Element: Drag-and-Drop Activity

✅ Answer Key

Correlation Questions:

  • "Is there a relationship between study time and exam scores?"
  • "Are anxiety and depression correlated?"

Regression Questions:

  • "For every additional hour of sleep, how much does reaction time improve?"
  • "Can we predict weight from height?"
  • "How much do test scores increase per hour of tutoring?"
💡 Teaching Tip

Students often struggle with the distinction between correlation and regression. Use this framing: "Correlation asks IF, regression asks HOW MUCH." Correlation = are they related? Regression = by how much does Y change when X changes?

⚠️ Common Student Error

Confusing R² with effect size: Students often think R² = 0.25 is "small." Emphasize that 25% of variance explained can be huge in behavioral science! Compare to real-world examples: even weather forecasts don't explain all variance.

Check Your Understanding Questions

✅ Question 1: Correlation vs. Regression

Q: "A researcher wants to know if stress and cortisol levels are related. Should they use correlation or regression?"

A: Either is appropriate if the goal is just to determine if they're related. Use regression if you want to predict cortisol from stress (treating stress as the predictor) OR if you want to quantify "how much does cortisol change per unit of stress?"

Discussion point: This is a great question for class discussion because it reveals that the choice depends on the research question, not just the variables.

✅ Question 2: Interpreting R²

Q: "A regression model predicting GPA from hours studied has R² = 0.18. Is this a good model?"

A: Yes, this is quite good for behavioral research! 18% of variance in GPA is explained by study time alone. Many other factors affect GPA (prior knowledge, test anxiety, course difficulty, etc.), so explaining nearly 1/5 of variance with one predictor is meaningful.

Important clarification: "Good" depends on context. In physics, R² = 0.18 might be disappointing. In social sciences, it's respectable. Emphasize that perfect prediction (R² = 1.0) is neither expected nor realistic in complex human behavior.

Real-World Example Discussion

💡 Facilitation Notes

The module uses body mass and metabolic rate as an example. Here are discussion prompts:

  • "Why is it useful to know that metabolic rate increases 30 kcal per kg of body mass?" (Answer: Dietary recommendations, medical dosing, energy needs)
  • "Could we reverse this and predict body mass from metabolic rate?" (Yes, but interpretation changes - we're usually more interested in predicting energy needs)
  • "What other factors might affect metabolic rate?" (Age, muscle mass, genetics, activity level - sets up multiple regression in Module 4)

Extension Activity for Advanced Students

Have students find a research article in their field that uses regression and identify:

  1. The dependent variable
  2. The independent variable(s)
  3. The reported R² value
  4. How the authors interpreted the practical significance

📈 Module 2: Fitting & Interpreting Simple Regression

⏱️ Estimated Time: 120-150 minutes

Learning Objectives

Key R Commands Students Must Master

# Basic regression model
model <- lm(dependent_var ~ independent_var, data = dataset)

# View output
summary(model)

# Predictions
predict(model, newdata = data.frame(independent_var = c(new_value)))

# Plotting
plot(independent_var, dependent_var)
abline(model, col = "blue", lwd = 2)
            

Practice Dataset: Sleep and Reaction Time

✅ Expected Output

When students run the sleep/reaction time analysis, they should get:

  • Intercept: ~250 ms (reaction time with 0 hours sleep - not interpretable!)
  • Slope: ~-15 ms per hour of sleep
  • R²: ~0.65 (65% of variance explained)
  • p-value: < 0.001 (highly significant)
💡 Teaching Tip: Interpreting the Intercept

Students WILL ask: "What does the intercept mean if someone can't sleep 0 hours and be alive?"

Your response: "Great question! The intercept is often uninterpretable because it's an extrapolation beyond your data range. You probably didn't measure anyone with 0 hours of sleep. The intercept is mathematically necessary but not always practically meaningful. Focus on the slope - that's the actionable part."

Check Your Understanding Questions

✅ Question 1: Interpreting Coefficients

Q: "A regression predicting depression scores from social support hours per week gives: Intercept = 45, Slope = -2.5. What does this mean?"

A:

  • Intercept (45): Predicted depression score when social support = 0 hours/week
  • Slope (-2.5): For each additional hour of social support per week, depression scores decrease by 2.5 points
  • Example: Someone with 10 hrs/week social support would have predicted depression = 45 + (-2.5)(10) = 20

Key point to emphasize: The slope tells us the RATE OF CHANGE. It's the "bang for your buck" - how much benefit per unit of the intervention.

✅ Question 2: Statistical vs. Practical Significance

Q: "A study with 10,000 participants finds that daily chocolate consumption (grams) predicts weight gain (kg/year): β = 0.002, p < 0.001. Is this meaningful?"

A: Statistically significant but practically trivial.

  • The p-value is tiny because of the huge sample size
  • But the effect is tiny: 0.002 kg per gram of chocolate
  • You'd need to eat 500g of chocolate daily to gain 1 kg/year
  • Lesson: Always consider effect size alongside p-values!
⚠️ Common Student Error: Confusing Correlation and Slope

Error: "The correlation is 0.7, so the slope must be 0.7 too."

Correction: Correlation (r) and slope (β) are related but NOT the same:

  • r is standardized (-1 to +1), independent of units
  • β depends on units (can be any value)
  • Relationship: β = r × (SD_y / SD_x)

Example: If predicting weight (kg) from height (cm), a strong correlation (r = 0.9) might give β = 0.5 kg per cm. The numbers are different because they measure different things!

Practice Problems with Answer Keys

✅ Practice Problem 1: Running the Analysis

Dataset: Temperature (°F) and Ice Cream Sales ($1000s)

Task: Run regression and interpret results

R Code:

# Given data
temperature <- c(65, 70, 75, 80, 85, 90, 95)
sales <- c(12, 15, 18, 22, 28, 32, 38)

# Run regression
model <- lm(sales ~ temperature)
summary(model)

# Plot
plot(temperature, sales, pch = 19, col = "blue",
     xlab = "Temperature (°F)", ylab = "Sales ($1000s)")
abline(model, col = "red", lwd = 2)
                

Expected Results:

  • Intercept ≈ -40 (not meaningful - can't have negative sales)
  • Slope ≈ 0.87 ($870 more in sales per degree)
  • R² ≈ 0.98 (temperature explains 98% of variance)
  • p-value < 0.001

Interpretation: "For each 1°F increase in temperature, ice cream sales increase by approximately $870. Temperature is an excellent predictor of sales, explaining 98% of the variance."

✅ Practice Problem 2: Making Predictions

Q: Using the ice cream model above, predict sales when temperature is 88°F.

R Code:

predict(model, newdata = data.frame(temperature = 88))
                

A: Sales ≈ $36,560

By hand calculation: Sales = -40 + 0.87(88) = 36.56 thousand dollars

💡 Teaching Tip: Live Coding Demonstration

For this module, consider doing a live coding demonstration where you:

  1. Load a simple dataset
  2. Create a scatter plot first (always visualize!)
  3. Run lm() and show the output
  4. Add regression line to plot
  5. Make a prediction for a new value

Narrate your thought process: "Before I run regression, I always look at the data to check for outliers and non-linearity..."

Common Questions & How to Answer Them

❓ "Why is my p-value showing as '<2e-16'?"

A: That's R's way of saying the p-value is extremely small (less than 0.0000000000000002). It means "definitely statistically significant." R can't display the exact value because it's beyond the precision limit. Just report it as p < 0.001.

❓ "Can I predict values outside my data range?"

A: Technically yes, but be very cautious! This is called extrapolation and assumes the linear relationship continues beyond your data range. Often, relationships change outside the range you measured. Example: The relationship between fertilizer and crop yield might be linear from 0-100 lbs/acre, but at 500 lbs/acre, you might kill the plants! Stick to interpolation (within your data range) when possible.

Troubleshooting Student Issues

Student Issue Likely Cause Solution
"Error: object not found" Didn't load data or typo in variable name Check spelling, ensure data loaded with ls()
"All my p-values are 1.0" Variables backwards in formula Check: lm(Y ~ X) not lm(X ~ Y)
"My R² is negative" This shouldn't happen with lm() Likely misreading adjusted R² (can be slightly negative)
"Regression line doesn't appear on plot" Variables in plot don't match model Ensure plot axes match lm() formula exactly

🔍 Module 3: Checking Assumptions & Diagnostics

⏱️ Estimated Time: 90-120 minutes

Learning Objectives

The Four Key Assumptions

  1. Linearity: Relationship between X and Y is linear
  2. Independence: Observations are independent
  3. Homoscedasticity: Variance of residuals is constant
  4. Normality: Residuals are normally distributed
💡 Teaching Philosophy

Emphasize pragmatism over perfectionism. Students often panic when they see any deviation from perfect assumptions. Your message should be: "Regression is robust to minor violations. We check assumptions to identify SERIOUS problems, not to achieve perfection."

Essential R Commands

# Generate all four diagnostic plots at once
par(mfrow = c(2, 2))
plot(model)

# Or individual plots:
plot(model, which = 1)  # Residuals vs Fitted (linearity & homoscedasticity)
plot(model, which = 2)  # Q-Q plot (normality)
plot(model, which = 3)  # Scale-Location (homoscedasticity)
plot(model, which = 5)  # Residuals vs Leverage (influential points)
            

Diagnostic Plot Interpretation Guide

Plot 1: Residuals vs. Fitted

✅ What to Look For
Pattern Meaning Action
Random scatter around 0 ✅ Good! Assumptions met Proceed with interpretation
Curved pattern (U-shape or inverted U) ⚠️ Non-linearity Consider transformation or polynomial terms
Funnel shape (widening or narrowing) ⚠️ Heteroscedasticity Consider transformation or weighted regression

Plot 2: Q-Q Plot

✅ What to Look For
Pattern Meaning Action
Points fall along diagonal line ✅ Residuals are normal Proceed with interpretation
Points deviate at the ends but middle is good ⚠️ Heavy-tailed distribution Usually OK with large n; consider robust methods
S-shaped curve ⚠️ Skewed residuals Consider transformation (log, sqrt)
⚠️ Common Student Misconception

"My Q-Q plot isn't perfectly straight, so I can't use regression!"

Correction: Minor deviations are fine! With large samples (n > 30), regression is robust to non-normality. Focus on extreme deviations, especially systematic patterns. A few points off the line at the extremes is usually not a problem.

Plot 3: Scale-Location

✅ What to Look For

Ideal: Random scatter with horizontal red line (constant variance)

Problem: Upward or downward trend (variance changes with fitted values)

This is another check for homoscedasticity - if both Plot 1 and Plot 3 look good, you're safe!

Plot 4: Residuals vs. Leverage

✅ What to Look For

Cook's Distance contours: Points outside dotted lines are influential

High leverage + large residual = TROUBLE: These points are pulling the regression line

Decision guide:

  • Cook's D > 1: Definitely investigate
  • Cook's D > 0.5: Worth checking
  • Cook's D < 0.5: Probably fine

Practice Dataset Analysis

✅ Good Regression Example

Dataset: Study time vs. exam scores (provided in module)

Expected findings:

  • Residuals vs Fitted: Random scatter, no pattern
  • Q-Q Plot: Points follow diagonal line closely
  • Scale-Location: Horizontal red line, equal variance
  • Leverage: No points beyond Cook's distance contours

Conclusion: All assumptions met; results are trustworthy

✅ Problematic Regression Example

Dataset: Age vs. income (provided in module)

Expected findings:

  • Residuals vs Fitted: Funnel shape (heteroscedasticity)
  • Reason: Income variance increases with age (early career = similar salaries; mid-career = huge range)
  • Solution: Log-transform income OR use robust standard errors

R code for transformation:

# Log transform the dependent variable
model_log <- lm(log(income) ~ age, data = dataset)
par(mfrow = c(2, 2))
plot(model_log)  # Check if diagnostics improve
                

Check Your Understanding Questions

✅ Question 1: Interpreting Diagnostic Plots

Q: "My residuals vs. fitted plot shows a clear U-shape. What does this mean and what should I do?"

A: Non-linearity detected! The relationship between X and Y is not linear.

Options:

  1. Add a quadratic term: lm(Y ~ X + I(X^2))
  2. Transform variables: Try log, sqrt, or reciprocal transformations
  3. Use non-linear regression: If relationship is clearly non-linear

Example interpretation: "The curved pattern suggests the relationship between study time and test scores is non-linear. Perhaps there are diminishing returns - the first few hours of study help a lot, but after 10 hours, additional study helps less."

✅ Question 2: When Violations Matter

Q: "My Q-Q plot shows slight deviation at the tails, but my sample size is 200. Should I be concerned?"

A: No, probably not. With n = 200:

  • Central Limit Theorem provides protection
  • Minor normality violations have minimal impact on inference
  • Focus on major deviations, not minor imperfections

When to worry about normality violations:

  • Small samples (n < 30) with severe skew
  • Extreme outliers pulling results
  • When making predictions at extreme values
✅ Question 3: Dealing with Outliers

Q: "One data point has very high leverage. Should I delete it?"

A: Not automatically! Follow this decision tree:

  1. Is it a data entry error? (e.g., age = 250) → Fix or remove
  2. Is it a valid but extreme observation? → Keep it, but report sensitivity analysis
  3. Does removing it change conclusions?
    • If YES → Report both analyses and discuss why
    • If NO → Keep it and mention robustness

Best practice: "We identified one participant with unusually high [X value]. Analyses with and without this participant yielded similar results (β = 0.45 vs. 0.43), suggesting findings are robust."

Decision Flowchart for Assumption Violations

💡 Teaching Aid: Print This Flowchart

Give students this decision tree to reference:

LINEARITY VIOLATION (curved residuals vs fitted)
└─→ Try transformations (log, sqrt, polynomial)
└─→ If still curved: non-linear regression or GAM

HETEROSCEDASTICITY (funnel shape)
└─→ Transform DV (log, sqrt)
└─→ OR use robust standard errors
└─→ OR use weighted least squares

NORMALITY VIOLATION (Q-Q plot deviation)
└─→ Sample size > 30? → Usually OK, proceed
└─→ Sample size < 30? → Consider:
    ├─→ Bootstrap confidence intervals
    ├─→ Non-parametric alternatives
    └─→ Transformation

INFLUENTIAL POINTS (high Cook's D)
└─→ Check for data entry errors
└─→ Run sensitivity analysis
└─→ Report with and without influential cases
                

Common Student Questions & Answers

❓ "Do I need to check assumptions BEFORE running regression?"

A: No! Run the regression first, THEN check assumptions using diagnostic plots. You need the model to generate residuals, which you then examine. The workflow is: (1) Fit model, (2) Check diagnostics, (3) Revise if needed, (4) Interpret final model.

❓ "My diagnostics look perfect. Does that mean my model is correct?"

A: Not necessarily! Good diagnostics mean your model is appropriate for the data you have, but don't prove causation or guarantee you've included all relevant variables. You could still be missing important predictors or have reverse causation issues.

❓ "Which assumption is most important?"

A: Linearity is most critical because if the relationship isn't linear, all your coefficients are wrong. Normality is least critical (especially with large n). Independence violations are serious but can't be fixed with transformations - they require different analysis methods (e.g., mixed models for clustered data).

Group Activity: Diagnostic Detective

💡 Active Learning Exercise (20-30 minutes)

Setup: Provide 4 different diagnostic plot sets (good, non-linear, heteroscedastic, influential point)

Task: In pairs, students identify:

  1. Which assumptions are violated (if any)
  2. What they'd do to fix the problem
  3. Whether the violation is serious enough to worry about

Debrief: Discuss as class, emphasizing that often multiple approaches are acceptable

🎯 Module 4: Multiple Regression

⏱️ Estimated Time: 120-150 minutes

Learning Objectives

Key Conceptual Shift from Simple to Multiple Regression

💡 Critical Concept

This is THE BIG IDEA students must grasp:

In simple regression: β tells you how much Y changes per unit of X

In multiple regression: β tells you how much Y changes per unit of X holding all other predictors constant

Use this analogy: "It's like asking 'What's the unique effect of sleep on reaction time, after accounting for caffeine intake and stress level?'"

Essential R Commands

# Multiple regression with several predictors
model <- lm(Y ~ X1 + X2 + X3, data = dataset)
summary(model)

# With interaction term
model_int <- lm(Y ~ X1 * X2, data = dataset)  # Includes X1, X2, and X1:X2

# Check for multicollinearity
library(car)
vif(model)  # VIF > 10 is problematic, > 5 warrants attention

# Compare models
anova(model1, model2)  # Are they significantly different?
AIC(model1, model2)    # Lower AIC = better fit
            

Practice Dataset: Predicting Academic Performance

✅ Example Analysis

Research Question: What predicts final exam scores?

Predictors: Study hours, sleep hours, previous GPA

R Code:

# Load data (provided in module)
# Run multiple regression
model <- lm(exam_score ~ study_hours + sleep_hours + previous_gpa, 
            data = student_data)
summary(model)
                

Expected Output:

  • Intercept: ~30 (baseline score with all predictors = 0)
  • Study hours β: ~2.5 (each hour increases score by 2.5 points, holding sleep and GPA constant)
  • Sleep hours β: ~1.8 (each hour improves score by 1.8 points, holding study and GPA constant)
  • Previous GPA β: ~15 (each GPA point predicts 15-point higher exam score, holding study and sleep constant)
  • R²: ~0.72 (72% of variance explained by all three predictors together)
  • Adjusted R²: ~0.70 (accounts for number of predictors)
💡 Language Matters

Teach students to say: "Controlling for..." or "Holding constant..." or "Adjusting for..."

Example: "Study time predicts exam scores even after controlling for previous academic performance and sleep."

This language is crucial for showing they understand multiple regression!

Check Your Understanding Questions

✅ Question 1: Interpreting Coefficients

Q: "In a model predicting salary from years of experience, education level, and hours worked per week, the coefficient for experience is β = 2500. In a simple regression with only experience, β = 3200. Why did it change?"

A: This is EXPECTED and important!

  • In simple regression: β = 3200 captures ALL effects of experience (including indirect effects through education and hours)
  • In multiple regression: β = 2500 is the UNIQUE effect of experience after removing effects that overlap with education and hours
  • The difference (700) represents effects that experience shares with other variables (confounding)

Key insight: Multiple regression gives you the "pure" effect of each predictor, which is usually smaller than the simple regression coefficient.

✅ Question 2: R² vs. Adjusted R²

Q: "My R² = 0.45 but Adjusted R² = 0.41. Which should I report?"

A: Report both, but emphasize Adjusted R² for multiple regression.

Why they differ:

  • R² ALWAYS increases when you add predictors (even useless ones!)
  • Adjusted R² penalizes you for adding weak predictors
  • The gap tells you if you're overfitting

Interpretation: "The model explains 45% of variance in exam scores. After adjusting for the number of predictors (preventing overfitting), the model accounts for 41% of variance."

✅ Question 3: Multicollinearity

Q: "I'm predicting weight from height (inches), height (cm), and age. My VIF values are huge. What's wrong?"

A: Height in inches and height in cm are perfectly correlated! They're the same variable in different units.

Multicollinearity problem: When predictors are highly correlated, the model can't separate their unique effects.

Solution: Remove one of the redundant height measures. Keep only height in one unit.

General VIF guidelines:

  • VIF < 5: No problem
  • VIF 5-10: Moderate concern, investigate
  • VIF > 10: Serious problem, must address

Interaction Effects: The Trickiest Part

💡 Teaching Strategy for Interactions

Use this framing: "An interaction means the effect of X1 DEPENDS ON the level of X2."

Example students relate to: "The effect of studying depends on sleep. When well-rested, studying helps a lot. When exhausted, studying doesn't help much."

✅ Interaction Example with Answer Key

Research Question: Does caffeine improve cognitive performance differently for morning vs. evening people?

R Code:

# Model WITHOUT interaction
model1 <- lm(performance ~ caffeine + chronotype, data = dataset)

# Model WITH interaction
model2 <- lm(performance ~ caffeine * chronotype, data = dataset)

# Compare models
anova(model1, model2)  # Is interaction significant?
summary(model2)  # Interpret coefficients
                

Expected Results:

  • Caffeine main effect: β = 5 (effect for morning people, baseline)
  • Chronotype main effect: β = -10 (evening people start 10 points lower)
  • Interaction: β = -3 (caffeine helps evening people 3 points LESS than morning people)

Interpretation: "Caffeine improves performance by 5 points for morning people, but only 2 points for evening people (5 - 3 = 2). The benefit of caffeine depends on chronotype."

⚠️ Common Student Error: Ignoring Main Effects

Error: "The interaction is significant, so I can ignore the main effects!"

Correction: NO! When you have a significant interaction, you MUST interpret it in context of the main effects. The interaction MODIFIES the main effects; it doesn't replace them.

Proper interpretation includes:

  1. Main effect of X1 (effect when X2 = 0 or at reference level)
  2. Main effect of X2 (effect when X1 = 0 or at reference level)
  3. Interaction (how the effect of X1 changes across levels of X2)

Model Comparison & Selection

✅ When to Add Predictors

Good reasons to add a predictor:

  • Theory suggests it's important
  • It improves Adjusted R² (not just R²!)
  • It's a confounder you need to control for
  • AIC/BIC decreases (better fit accounting for complexity)

Bad reasons to add predictors:

  • "I have the data, might as well throw it in"
  • "R² went from 0.40 to 0.401" (trivial increase)
  • Fishing for significant results
✅ Comparing Nested Models

Q: "Should I include age as a predictor?"

A: Test it statistically:

model_without_age <- lm(Y ~ X1 + X2, data = dataset)
model_with_age <- lm(Y ~ X1 + X2 + age, data = dataset)

# F-test for nested models
anova(model_without_age, model_with_age)
                

Decision:

  • If p < 0.05: Age significantly improves the model → Keep it
  • If p > 0.05: Age doesn't add meaningful information → Drop it (parsimony!)

Advanced Practice Problem

✅ Comprehensive Analysis Exercise

Scenario: Predicting depression scores from social support, exercise minutes/week, and sleep quality (1-10 scale)

Tasks for students:

  1. Run multiple regression with all three predictors
  2. Check VIF for multicollinearity
  3. Test if social support × exercise interaction improves model
  4. Check diagnostic plots
  5. Write up results in APA style

Expected findings:

  • All three predictors significant (p < 0.05)
  • VIF values < 3 (no multicollinearity)
  • Interaction NOT significant (p = 0.32)
  • Final model R² = 0.58

Model write-up:

"A multiple regression was conducted to predict depression scores from social support, exercise, and sleep quality. The overall model was significant, F(3, 96) = 44.3, p < .001, R² = .58. All three predictors contributed uniquely to the model. Social support (β = -2.1, p < .001) and sleep quality (β = -3.4, p < .001) were the strongest predictors, followed by exercise (β = -0.08, p = .02). Together, these variables explained 58% of the variance in depression scores."

Common Questions About Multiple Regression

❓ "How many predictors can I include?"

A: Rule of thumb: At least 10-15 observations per predictor

  • With n = 100: Safely include up to 6-10 predictors
  • With n = 30: Maximum 2-3 predictors
  • Fewer predictors = more statistical power and less overfitting

Prioritize theoretical importance over just adding every variable you have!

❓ "What if my predictors are correlated?"

A: Some correlation is expected and OK!

  • r < 0.7 between predictors: Usually fine
  • r = 0.7-0.9: Watch for multicollinearity (check VIF)
  • r > 0.9: Serious problem - consider dropping one or combining them

In real research, many variables ARE correlated. That's partly why we use multiple regression - to separate their unique effects!

❓ "Should I center my variables before creating interactions?"

A: Yes, for continuous variables!

Why: Centering (subtracting the mean) makes main effects easier to interpret and reduces multicollinearity between main effects and interaction terms.

# Center variables
dataset$X1_centered <- scale(dataset$X1, scale = FALSE)
dataset$X2_centered <- scale(dataset$X2, scale = FALSE)

# Then create interaction
model <- lm(Y ~ X1_centered * X2_centered, data = dataset)
                

📋 Assessment Strategies & Rubrics

Formative Assessment Throughout Modules

Each module includes "Check Your Understanding" questions. Use these to gauge comprehension before moving forward.

Summative Assessment: Comprehensive Regression Project

💡 Recommended Final Project

Task: Students analyze a provided dataset (or collect their own) and complete a full regression analysis with write-up.

Components:

  1. Research question and hypotheses (10%)
  2. Data exploration and visualization (15%)
  3. Regression analysis (simple or multiple) (20%)
  4. Diagnostic checking and interpretation (25%)
  5. Written interpretation of results (20%)
  6. Discussion of limitations and assumptions (10%)

Detailed Grading Rubric

Component Exemplary (A) Proficient (B) Developing (C) Needs Work (D/F)
Research Question Clear, specific, appropriate for regression; well-justified Clear and appropriate but could be more specific Vague or marginally appropriate for regression Unclear or inappropriate for regression
Code Quality Correct syntax, well-commented, efficient, reproducible Correct with minor errors, adequate comments Works but has errors or is poorly organized Multiple errors or doesn't run
Diagnostic Checking All plots generated, correctly interpreted, appropriate actions taken for violations Plots generated, mostly correct interpretation Some plots missing or misinterpreted Diagnostics not checked or major misinterpretations
Interpretation Coefficients, p-values, R² all correctly interpreted in plain language with context Mostly correct interpretation, minor errors or lacks context Some correct elements but major misunderstandings Fundamental misinterpretation of results
Statistical vs. Practical Significance Discusses both, provides context for effect sizes Mentions both but limited discussion Focuses only on p-values Doesn't distinguish between types of significance
Limitations Identifies multiple relevant limitations, discusses implications Identifies some limitations appropriately Generic limitations without specificity No discussion of limitations

Quick Checks for Understanding

✅ Minute Paper Prompts

Use these at the end of each module (3-5 minutes):

  • Module 1: "In one sentence, explain when you'd use regression instead of correlation."
  • Module 2: "If β = -5.2, what does this tell you about the relationship?"
  • Module 3: "Name one diagnostic plot and what it checks for."
  • Module 4: "What does it mean to say a predictor is significant 'controlling for' other variables?"

Concept Application Questions (No R Required)

✅ Conceptual Quiz Items

Question 1: A researcher finds that hours of TV watching predicts lower GPA (β = -0.15, p = 0.003). What does β = -0.15 mean?

Answer: For each additional hour of TV per day, GPA decreases by 0.15 points on average.

Question 2: You run a regression and get R² = 0.09. Your colleague says "That's terrible, only 9% explained!" How do you respond?

Answer: It depends on the field and complexity of the outcome. In behavioral science with many unmeasured influences, 9% can be meaningful. Consider effect size in context, not just the percentage.

Question 3: Your residuals vs. fitted plot shows a clear funnel shape. What assumption is violated and what should you do?

Answer: Homoscedasticity (constant variance) is violated. Consider log-transforming the DV, using robust standard errors, or weighted least squares.

Question 4: In multiple regression predicting salary from education and experience, the coefficient for education is β = 5000. Does this mean education causes higher salary?

Answer: No! Regression shows association, not causation. Many confounders could explain this relationship (e.g., family background, ability, motivation).

Practical Skills Checklist

Use this to verify students can perform essential tasks:

Skill Can Do Needs Practice
Load data into R
Run simple regression with lm()
Interpret summary() output
Generate diagnostic plots
Identify assumption violations from plots
Run multiple regression
Check VIF for multicollinearity
Test interaction effects
Compare nested models
Write results in scientific format

🔧 Technical Troubleshooting

Common R Error Messages & Solutions

Error Message Cause Solution
Error in lm.fit: 0 (non-NA) cases All data is missing (NA) Check data import; look for NAs with summary(dataset)
Error: could not find function "lm" Base R not loaded (rare) Restart R session
Error: object 'variable_name' not found Typo or data not loaded Check spelling; verify data with ls() and names(dataset)
Warning: essentially perfect fit R² = 1.0, deterministic relationship Check for duplicate variables or data entry errors
Error in plot.window: need finite xlim values Trying to plot with infinite or NA values Remove NAs: na.omit(dataset) or use na.rm = TRUE
Error: variable lengths differ X and Y have different numbers of observations Check data dimensions with length(); ensure complete cases

Module-Specific Technical Issues

Module 1 (Interactive Visualizations)

⚠️ Issue: Interactive elements not working

Causes:

  • JavaScript disabled in browser
  • Old browser version
  • Pop-up blockers interfering

Solutions:

  • Test in Chrome or Firefox (most compatible)
  • Disable pop-up blockers for the site
  • Hard refresh: Ctrl+Shift+R (Windows) or Cmd+Shift+R (Mac)
  • Have backup static images ready to project

Module 2 (Running Regression in R)

⚠️ Issue: Students can't create scatter plot with regression line

Most common error: Wrong variable order

# WRONG
plot(dependent_var, independent_var)  # X and Y reversed!

# CORRECT
plot(independent_var, dependent_var)  # X first, Y second
abline(model)
                

Teaching tip: Remind students: "X goes on horizontal axis (first), Y on vertical (second)"

Module 3 (Diagnostic Plots)

⚠️ Issue: Plots appear too small or cramped

Solution:

# Before plotting, expand the plot window
par(mfrow = c(2, 2))  # 2x2 grid
plot(model)

# Reset to single plot
par(mfrow = c(1, 1))
                

Or use RStudio's "Zoom" button in the Plots pane

Module 4 (Multiple Regression)

⚠️ Issue: VIF() function not found

Cause: Package 'car' not installed

Solution:

# Install once per computer
install.packages("car")

# Load every session
library(car)
vif(model)
                

Data Import Issues

💡 Preventing Data Import Problems

Provide students with clean datasets:

  • Save as .csv (most universal)
  • No special characters in variable names
  • No spaces in variable names (use underscores)
  • First row = variable names
  • Missing data coded as NA (not blanks, "missing", or 999)

Standard import code to give students:

# Set working directory (modify path)
setwd("C:/Users/YourName/Documents/Stats")

# Import CSV
data <- read.csv("filename.csv", header = TRUE)

# Check import success
str(data)      # View structure
head(data)     # View first 6 rows
summary(data)  # Check for issues
                

Computer Lab Setup Recommendations

Before First Class Session

During Class

Alternative Approaches If Technology Fails

💡 Backup Plans

If R crashes or lab computers fail:

  1. Use RStudio Cloud: Browser-based R (requires internet)
  2. Demonstrate live: Project your screen, have students follow conceptually
  3. Switch to interpretation: Provide pre-run output, focus on interpretation rather than coding
  4. Paper-based practice: Use printed output for interpretation exercises

Student Support Resources

✅ Recommended Resources to Share

R Help:

  • Built-in help: ?lm or help(lm)
  • Quick-R: https://www.statmethods.net/
  • R for Data Science (free online): https://r4ds.had.co.nz/

Regression Tutorials:

  • UCLA Stats Consulting: https://stats.oarc.ucla.edu/r/
  • Regression assumptions: http://www.sthda.com/english/articles/39-regression-model-diagnostics/

Stack Overflow:

  • Teach students to search: "R regression [their specific problem]"
  • Most common errors have been answered!

🎓 Final Teaching Tips & Philosophy

Creating a Supportive Learning Environment

💡 Normalize Struggle

Say things like:

  • "Statistics is hard. If this feels challenging, you're doing it right."
  • "I still Google R error messages all the time. That's normal."
  • "There's rarely one 'right' answer in real data analysis. Multiple approaches can be valid."

Students fear looking "stupid" in stats classes. Your openness about difficulty and normalizing errors reduces anxiety.

Balancing Rigor with Practicality

These modules intentionally prioritize application over derivation. You may have students who want more mathematical detail. Here's how to handle that:

💡 For mathematically curious students

If students ask about formulas/derivations:

  • "That's a great question! The mathematical details are beyond our scope, but if you're interested, I can point you to resources."
  • Offer supplementary readings for theory
  • Frame it as: "We're focusing on using the tool correctly. Understanding the engine under the hood is great, but you don't need to be a mechanic to drive a car."

Common Pedagogical Challenges

Challenge 1: Students Who Just Want Cookbook Steps

⚠️ "Just tell me what buttons to click!"

Why this happens: Statistics anxiety makes students crave certainty and procedures

How to respond: Balance structure with thinking

  • Provide checklists and workflows (students appreciate structure)
  • BUT require them to explain WHY they're making each decision
  • Use case studies where the "right" answer depends on context

Example: "Here's the workflow for running regression [give checklist]. Now apply it to this dataset and explain why you made each choice."

Challenge 2: "My p-value is 0.06, what do I do?!"

💡 Addressing p-value worship

Students obsess over p = 0.05 cutoff. Teach nuance:

  • "p = 0.06 and p = 0.04 are not fundamentally different"
  • "Focus on effect size, confidence intervals, and practical significance"
  • "The 0.05 threshold is a convention, not a law of nature"

This is a perfect opportunity to discuss the replication crisis and how blind p-value chasing has harmed science!

Challenge 3: Students Analyze Without Looking at Data

⚠️ Running analyses blind

Students jump straight to regression without exploring data first

Prevention strategy: Make visualization MANDATORY before any analysis

  • Require students to submit scatter plot before running regression
  • Have "what do you notice about the data?" discussion before analysis
  • Show examples of how visualization catches data entry errors

Mantra: "Always look at your data before analyzing it. Always."

Adapting for Different Student Levels

For struggling students:

For advanced students:

Connecting to Students' Research Interests

💡 Making it relevant

First day activity: Have students share their research interests

Throughout modules, connect examples to their fields:

  • Neuroscience: Predicting reaction time from brain activation
  • Animal behavior: Territory size predicting mating success
  • Sensation/Perception: Contrast sensitivity as a function of spatial frequency
  • Clinical: Treatment duration predicting symptom improvement

When students see regression as a tool for THEIR questions, engagement skyrockets!

Building Statistical Intuition

Beyond procedural knowledge, aim to develop students' statistical intuition:

✅ Questions that build intuition
  • "Before we run the analysis, what do you predict we'll find?"
  • "Does this result make sense? Why or why not?"
  • "If you were peer-reviewing this study, what would concern you?"
  • "How would you explain this finding to someone without statistics training?"
  • "What additional data would strengthen these conclusions?"

Managing Your Time

Activity Time Investment Worth it?
Creating custom datasets for your field 2-3 hours ✓ YES - massively increases engagement
Detailed individualized feedback on code High (10+ min/student) ⚠️ MAYBE - use group feedback for common errors instead
Live coding demonstrations 10-15 min/module ✓ YES - students learn by watching you think through problems
Creating answer keys for all practice problems 3-4 hours ✓ YES - saves time answering same questions repeatedly
Office hours before assignments due Variable ✓ YES - prevents frustration and late submissions

Assessment Time-Savers

💡 Efficient grading strategies
  • Use rubrics religiously: Speeds grading and ensures consistency
  • Grade holistically: Don't mark every tiny error; focus on major concepts
  • Provide group feedback: Make a document of "common errors on Assignment X" instead of repeating the same comment
  • Use completion grades for practice: Full credit if they attempted all parts; save detailed grading for major assessments
  • Peer review: Have students review each other's work (with a rubric) before submitting

Measuring Success

How do you know if your teaching is working? Look for these signs:

✅ Success indicators
  • Students ask "why" not just "how": Shows conceptual engagement
  • Students catch their own errors: Developing self-monitoring
  • Students connect material to their research: Transfer of learning
  • Students debate interpretation: Shows they're thinking critically, not just following recipes
  • Decreased anxiety over the semester: Building confidence

Continuous Improvement

Keep notes on what works and what doesn't:

Revise modules based on this feedback. Teaching statistics is iterative!

🎉 You've Got This!

Teaching regression can feel daunting, but remember: your enthusiasm and support matter more than perfect explanations. Students will struggle - that's part of learning. Your role is to guide them through the struggle with patience and clarity.

Key takeaways for successful teaching:

Good luck with your regression modules! 📊✨

Questions or feedback on this instructor's guide?

Keep notes on what works in your specific teaching context and adapt these materials accordingly!