📚 Instructor's Guide: Regression Teaching Modules

Comprehensive Teaching Guide for Modules 1-4

Complete with Answer Keys, Teaching Tips, Common Errors, and Facilitation Notes

📑 Table of Contents

→ Course Overview & Pedagogical Approach
→ Module 1: Why Regression Matters
→ Module 2: Fitting & Interpreting Simple Regression
→ Module 3: Checking Assumptions & Diagnostics
→ Module 4: Multiple Regression
→ Assessment Strategies & Rubrics
→ Technical Troubleshooting

🎯 Course Overview & Pedagogical Approach

Course Philosophy

These modules emphasize practical application over theoretical derivation. Students learn regression as a research tool for answering scientific questions, not as a mathematical exercise. The focus is on:

Interpretation: What do the results mean?
Application: When should I use this test?
Diagnostics: How do I know if my results are trustworthy?
Decision-making: What do I do when assumptions are violated?

Learning Objectives Across All Modules

By the end of the complete module series, students should be able to:

Articulate when regression is the appropriate statistical test
Run simple and multiple regression analyses in R
Interpret regression output including coefficients, p-values, and R²
Check and interpret diagnostic plots for assumption violations
Make informed decisions when assumptions are violated
Distinguish between practical and statistical significance
Interpret and explain interaction effects
Build and compare multiple regression models

Recommended Pacing

⏱️ TOTAL TIME: 6-8 hours of instruction + practice

Module 1: 1-1.5 hours (conceptual foundation)
Module 2: 2-2.5 hours (hands-on practice with simple regression)
Module 3: 1.5-2 hours (diagnostics and troubleshooting)
Module 4: 2-2.5 hours (multiple regression complexity)

💡 Teaching Tip

Don't rush Module 1! Students who understand WHY regression matters and WHEN to use it have much better intuition for interpreting results later. The conceptual foundation pays dividends.

Prerequisites

Students should have prior experience with:

Basic R syntax (reading data, creating vectors, basic plotting)
Descriptive statistics (mean, SD, variance)
Hypothesis testing concepts (null hypothesis, p-values)
Correlation (Pearson's r)

Materials Needed

R and RStudio installed on student computers
Practice datasets (provided in modules)
Web browser with JavaScript enabled for interactive modules
Optional: Projector for live coding demonstrations

📊 Module 1: Why Regression Matters

⏱️ Estimated Time: 60-90 minutes

Learning Objectives

Distinguish between correlation and regression
Understand when regression is the appropriate analysis
Recognize research questions that call for regression
Interpret the practical meaning of regression results

Key Concepts to Emphasize

Correlation vs. Regression: Correlation measures strength/direction; regression makes predictions
Dependent vs. Independent: Regression assumes directional relationship (unlike correlation)
Practical Utility: Regression quantifies how much Y changes per unit of X
R² Interpretation: Variance explained is NOT the same as causation

Interactive Element: Drag-and-Drop Activity

✅ Answer Key

Correlation Questions:

"Is there a relationship between study time and exam scores?"
"Are anxiety and depression correlated?"

Regression Questions:

"For every additional hour of sleep, how much does reaction time improve?"
"Can we predict weight from height?"
"How much do test scores increase per hour of tutoring?"

💡 Teaching Tip

Students often struggle with the distinction between correlation and regression. Use this framing: "Correlation asks IF, regression asks HOW MUCH." Correlation = are they related? Regression = by how much does Y change when X changes?

⚠️ Common Student Error

Confusing R² with effect size: Students often think R² = 0.25 is "small." Emphasize that 25% of variance explained can be huge in behavioral science! Compare to real-world examples: even weather forecasts don't explain all variance.

Check Your Understanding Questions

✅ Question 1: Correlation vs. Regression

Q: "A researcher wants to know if stress and cortisol levels are related. Should they use correlation or regression?"

A: Either is appropriate if the goal is just to determine if they're related. Use regression if you want to predict cortisol from stress (treating stress as the predictor) OR if you want to quantify "how much does cortisol change per unit of stress?"

Discussion point: This is a great question for class discussion because it reveals that the choice depends on the research question, not just the variables.

✅ Question 2: Interpreting R²

Q: "A regression model predicting GPA from hours studied has R² = 0.18. Is this a good model?"

A: Yes, this is quite good for behavioral research! 18% of variance in GPA is explained by study time alone. Many other factors affect GPA (prior knowledge, test anxiety, course difficulty, etc.), so explaining nearly 1/5 of variance with one predictor is meaningful.

Important clarification: "Good" depends on context. In physics, R² = 0.18 might be disappointing. In social sciences, it's respectable. Emphasize that perfect prediction (R² = 1.0) is neither expected nor realistic in complex human behavior.

Real-World Example Discussion

💡 Facilitation Notes

The module uses body mass and metabolic rate as an example. Here are discussion prompts:

"Why is it useful to know that metabolic rate increases 30 kcal per kg of body mass?" (Answer: Dietary recommendations, medical dosing, energy needs)
"Could we reverse this and predict body mass from metabolic rate?" (Yes, but interpretation changes - we're usually more interested in predicting energy needs)
"What other factors might affect metabolic rate?" (Age, muscle mass, genetics, activity level - sets up multiple regression in Module 4)

Extension Activity for Advanced Students

Have students find a research article in their field that uses regression and identify:

The dependent variable
The independent variable(s)
The reported R² value
How the authors interpreted the practical significance

📈 Module 2: Fitting & Interpreting Simple Regression

⏱️ Estimated Time: 120-150 minutes

Learning Objectives

Run simple linear regression in R using lm()
Interpret the intercept and slope coefficients
Understand and interpret p-values, R², and confidence intervals
Create and interpret scatter plots with regression lines
Make predictions using the regression equation

Key R Commands Students Must Master

# Basic regression model
model <- lm(dependent_var ~ independent_var, data = dataset)

# View output
summary(model)

# Predictions
predict(model, newdata = data.frame(independent_var = c(new_value)))

# Plotting
plot(independent_var, dependent_var)
abline(model, col = "blue", lwd = 2)

Practice Dataset: Sleep and Reaction Time

✅ Expected Output

When students run the sleep/reaction time analysis, they should get:

Intercept: ~250 ms (reaction time with 0 hours sleep - not interpretable!)
Slope: ~-15 ms per hour of sleep
R²: ~0.65 (65% of variance explained)
p-value: < 0.001 (highly significant)

💡 Teaching Tip: Interpreting the Intercept

Students WILL ask: "What does the intercept mean if someone can't sleep 0 hours and be alive?"

Your response: "Great question! The intercept is often uninterpretable because it's an extrapolation beyond your data range. You probably didn't measure anyone with 0 hours of sleep. The intercept is mathematically necessary but not always practically meaningful. Focus on the slope - that's the actionable part."

Check Your Understanding Questions

✅ Question 1: Interpreting Coefficients

Q: "A regression predicting depression scores from social support hours per week gives: Intercept = 45, Slope = -2.5. What does this mean?"

Intercept (45): Predicted depression score when social support = 0 hours/week
Slope (-2.5): For each additional hour of social support per week, depression scores decrease by 2.5 points
Example: Someone with 10 hrs/week social support would have predicted depression = 45 + (-2.5)(10) = 20

Key point to emphasize: The slope tells us the RATE OF CHANGE. It's the "bang for your buck" - how much benefit per unit of the intervention.

✅ Question 2: Statistical vs. Practical Significance

Q: "A study with 10,000 participants finds that daily chocolate consumption (grams) predicts weight gain (kg/year): β = 0.002, p < 0.001. Is this meaningful?"

A: Statistically significant but practically trivial.

The p-value is tiny because of the huge sample size
But the effect is tiny: 0.002 kg per gram of chocolate
You'd need to eat 500g of chocolate daily to gain 1 kg/year
Lesson: Always consider effect size alongside p-values!

⚠️ Common Student Error: Confusing Correlation and Slope

Error: "The correlation is 0.7, so the slope must be 0.7 too."

Correction: Correlation (r) and slope (β) are related but NOT the same:

r is standardized (-1 to +1), independent of units
β depends on units (can be any value)
Relationship: β = r × (SD_y / SD_x)

Example: If predicting weight (kg) from height (cm), a strong correlation (r = 0.9) might give β = 0.5 kg per cm. The numbers are different because they measure different things!

Practice Problems with Answer Keys

✅ Practice Problem 1: Running the Analysis

Dataset: Temperature (°F) and Ice Cream Sales ($1000s)

Task: Run regression and interpret results

R Code:

# Given data
temperature <- c(65, 70, 75, 80, 85, 90, 95)
sales <- c(12, 15, 18, 22, 28, 32, 38)

# Run regression
model <- lm(sales ~ temperature)
summary(model)

# Plot
plot(temperature, sales, pch = 19, col = "blue",
     xlab = "Temperature (°F)", ylab = "Sales ($1000s)")
abline(model, col = "red", lwd = 2)

Expected Results:

Intercept ≈ -40 (not meaningful - can't have negative sales)
Slope ≈ 0.87 ($870 more in sales per degree)
R² ≈ 0.98 (temperature explains 98% of variance)
p-value < 0.001

Interpretation: "For each 1°F increase in temperature, ice cream sales increase by approximately $870. Temperature is an excellent predictor of sales, explaining 98% of the variance."

✅ Practice Problem 2: Making Predictions

Q: Using the ice cream model above, predict sales when temperature is 88°F.

R Code:

predict(model, newdata = data.frame(temperature = 88))

A: Sales ≈ $36,560

By hand calculation: Sales = -40 + 0.87(88) = 36.56 thousand dollars

💡 Teaching Tip: Live Coding Demonstration

For this module, consider doing a live coding demonstration where you:

Load a simple dataset
Create a scatter plot first (always visualize!)
Run lm() and show the output
Add regression line to plot
Make a prediction for a new value

Narrate your thought process: "Before I run regression, I always look at the data to check for outliers and non-linearity..."

Common Questions & How to Answer Them

❓ "Why is my p-value showing as '<2e-16'?"

A: That's R's way of saying the p-value is extremely small (less than 0.0000000000000002). It means "definitely statistically significant." R can't display the exact value because it's beyond the precision limit. Just report it as p < 0.001.

❓ "Can I predict values outside my data range?"

A: Technically yes, but be very cautious! This is called extrapolation and assumes the linear relationship continues beyond your data range. Often, relationships change outside the range you measured. Example: The relationship between fertilizer and crop yield might be linear from 0-100 lbs/acre, but at 500 lbs/acre, you might kill the plants! Stick to interpolation (within your data range) when possible.

Troubleshooting Student Issues

Student Issue	Likely Cause	Solution
"Error: object not found"	Didn't load data or typo in variable name	Check spelling, ensure data loaded with `ls()`
"All my p-values are 1.0"	Variables backwards in formula	Check: `lm(Y ~ X)` not `lm(X ~ Y)`
"My R² is negative"	This shouldn't happen with `lm()`	Likely misreading adjusted R² (can be slightly negative)
"Regression line doesn't appear on plot"	Variables in plot don't match model	Ensure plot axes match `lm()` formula exactly

🔍 Module 3: Checking Assumptions & Diagnostics

⏱️ Estimated Time: 90-120 minutes

Learning Objectives

Understand the four key assumptions of linear regression
Generate and interpret diagnostic plots in R
Identify violations of assumptions from plots
Make informed decisions about how to address violations
Understand when violations matter vs. when they're trivial

The Four Key Assumptions

Linearity: Relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Variance of residuals is constant
Normality: Residuals are normally distributed

💡 Teaching Philosophy

Emphasize pragmatism over perfectionism. Students often panic when they see any deviation from perfect assumptions. Your message should be: "Regression is robust to minor violations. We check assumptions to identify SERIOUS problems, not to achieve perfection."

Essential R Commands

# Generate all four diagnostic plots at once
par(mfrow = c(2, 2))
plot(model)

# Or individual plots:
plot(model, which = 1)  # Residuals vs Fitted (linearity & homoscedasticity)
plot(model, which = 2)  # Q-Q plot (normality)
plot(model, which = 3)  # Scale-Location (homoscedasticity)
plot(model, which = 5)  # Residuals vs Leverage (influential points)

Diagnostic Plot Interpretation Guide

Plot 1: Residuals vs. Fitted

✅ What to Look For

Pattern	Meaning	Action
Random scatter around 0	✅ Good! Assumptions met	Proceed with interpretation
Curved pattern (U-shape or inverted U)	⚠️ Non-linearity	Consider transformation or polynomial terms
Funnel shape (widening or narrowing)	⚠️ Heteroscedasticity	Consider transformation or weighted regression

Plot 2: Q-Q Plot

✅ What to Look For

Pattern	Meaning	Action
Points fall along diagonal line	✅ Residuals are normal	Proceed with interpretation
Points deviate at the ends but middle is good	⚠️ Heavy-tailed distribution	Usually OK with large n; consider robust methods
S-shaped curve	⚠️ Skewed residuals	Consider transformation (log, sqrt)

⚠️ Common Student Misconception

"My Q-Q plot isn't perfectly straight, so I can't use regression!"

Correction: Minor deviations are fine! With large samples (n > 30), regression is robust to non-normality. Focus on extreme deviations, especially systematic patterns. A few points off the line at the extremes is usually not a problem.

Plot 3: Scale-Location

✅ What to Look For

Ideal: Random scatter with horizontal red line (constant variance)

Problem: Upward or downward trend (variance changes with fitted values)

This is another check for homoscedasticity - if both Plot 1 and Plot 3 look good, you're safe!

Plot 4: Residuals vs. Leverage

✅ What to Look For

Cook's Distance contours: Points outside dotted lines are influential

High leverage + large residual = TROUBLE: These points are pulling the regression line

Decision guide:

Cook's D > 1: Definitely investigate
Cook's D > 0.5: Worth checking
Cook's D < 0.5: Probably fine

Practice Dataset Analysis

✅ Good Regression Example

Dataset: Study time vs. exam scores (provided in module)

Expected findings:

Residuals vs Fitted: Random scatter, no pattern
Q-Q Plot: Points follow diagonal line closely
Scale-Location: Horizontal red line, equal variance
Leverage: No points beyond Cook's distance contours

Conclusion: All assumptions met; results are trustworthy

✅ Problematic Regression Example

Dataset: Age vs. income (provided in module)

Expected findings:

Residuals vs Fitted: Funnel shape (heteroscedasticity)
Reason: Income variance increases with age (early career = similar salaries; mid-career = huge range)
Solution: Log-transform income OR use robust standard errors

R code for transformation:

# Log transform the dependent variable
model_log <- lm(log(income) ~ age, data = dataset)
par(mfrow = c(2, 2))
plot(model_log)  # Check if diagnostics improve

Check Your Understanding Questions

✅ Question 1: Interpreting Diagnostic Plots

Q: "My residuals vs. fitted plot shows a clear U-shape. What does this mean and what should I do?"

A: Non-linearity detected! The relationship between X and Y is not linear.

Options:

Add a quadratic term: lm(Y ~ X + I(X^2))
Transform variables: Try log, sqrt, or reciprocal transformations
Use non-linear regression: If relationship is clearly non-linear

Example interpretation: "The curved pattern suggests the relationship between study time and test scores is non-linear. Perhaps there are diminishing returns - the first few hours of study help a lot, but after 10 hours, additional study helps less."

✅ Question 2: When Violations Matter

Q: "My Q-Q plot shows slight deviation at the tails, but my sample size is 200. Should I be concerned?"

A: No, probably not. With n = 200:

Central Limit Theorem provides protection
Minor normality violations have minimal impact on inference
Focus on major deviations, not minor imperfections

When to worry about normality violations:

Small samples (n < 30) with severe skew
Extreme outliers pulling results
When making predictions at extreme values

✅ Question 3: Dealing with Outliers

Q: "One data point has very high leverage. Should I delete it?"

A: Not automatically! Follow this decision tree:

Is it a data entry error? (e.g., age = 250) → Fix or remove
Is it a valid but extreme observation? → Keep it, but report sensitivity analysis
Does removing it change conclusions?
- If YES → Report both analyses and discuss why
- If NO → Keep it and mention robustness

Best practice: "We identified one participant with unusually high [X value]. Analyses with and without this participant yielded similar results (β = 0.45 vs. 0.43), suggesting findings are robust."

Decision Flowchart for Assumption Violations

💡 Teaching Aid: Print This Flowchart

Give students this decision tree to reference:

LINEARITY VIOLATION (curved residuals vs fitted)
└─→ Try transformations (log, sqrt, polynomial)
└─→ If still curved: non-linear regression or GAM

HETEROSCEDASTICITY (funnel shape)
└─→ Transform DV (log, sqrt)
└─→ OR use robust standard errors
└─→ OR use weighted least squares

NORMALITY VIOLATION (Q-Q plot deviation)
└─→ Sample size > 30? → Usually OK, proceed
└─→ Sample size < 30? → Consider:
    ├─→ Bootstrap confidence intervals
    ├─→ Non-parametric alternatives
    └─→ Transformation

INFLUENTIAL POINTS (high Cook's D)
└─→ Check for data entry errors
└─→ Run sensitivity analysis
└─→ Report with and without influential cases

Common Student Questions & Answers

❓ "Do I need to check assumptions BEFORE running regression?"

A: No! Run the regression first, THEN check assumptions using diagnostic plots. You need the model to generate residuals, which you then examine. The workflow is: (1) Fit model, (2) Check diagnostics, (3) Revise if needed, (4) Interpret final model.

❓ "My diagnostics look perfect. Does that mean my model is correct?"

A: Not necessarily! Good diagnostics mean your model is appropriate for the data you have, but don't prove causation or guarantee you've included all relevant variables. You could still be missing important predictors or have reverse causation issues.

❓ "Which assumption is most important?"

A: Linearity is most critical because if the relationship isn't linear, all your coefficients are wrong. Normality is least critical (especially with large n). Independence violations are serious but can't be fixed with transformations - they require different analysis methods (e.g., mixed models for clustered data).

Group Activity: Diagnostic Detective

💡 Active Learning Exercise (20-30 minutes)

Setup: Provide 4 different diagnostic plot sets (good, non-linear, heteroscedastic, influential point)

Task: In pairs, students identify:

Which assumptions are violated (if any)
What they'd do to fix the problem
Whether the violation is serious enough to worry about

Debrief: Discuss as class, emphasizing that often multiple approaches are acceptable

🎯 Module 4: Multiple Regression

⏱️ Estimated Time: 120-150 minutes

Learning Objectives

Run multiple regression with several predictors in R
Interpret coefficients in multiple regression (controlling for other variables)
Understand and interpret R² and Adjusted R²
Detect and address multicollinearity
Test and interpret interaction effects
Compare competing models

Key Conceptual Shift from Simple to Multiple Regression

💡 Critical Concept

This is THE BIG IDEA students must grasp:

In simple regression: β tells you how much Y changes per unit of X

In multiple regression: β tells you how much Y changes per unit of X holding all other predictors constant

Use this analogy: "It's like asking 'What's the unique effect of sleep on reaction time, after accounting for caffeine intake and stress level?'"

Essential R Commands

# Multiple regression with several predictors
model <- lm(Y ~ X1 + X2 + X3, data = dataset)
summary(model)

# With interaction term
model_int <- lm(Y ~ X1 * X2, data = dataset)  # Includes X1, X2, and X1:X2

# Check for multicollinearity
library(car)
vif(model)  # VIF > 10 is problematic, > 5 warrants attention

# Compare models
anova(model1, model2)  # Are they significantly different?
AIC(model1, model2)    # Lower AIC = better fit

Practice Dataset: Predicting Academic Performance

✅ Example Analysis

Research Question: What predicts final exam scores?

Predictors: Study hours, sleep hours, previous GPA

R Code:

# Load data (provided in module)
# Run multiple regression
model <- lm(exam_score ~ study_hours + sleep_hours + previous_gpa, 
            data = student_data)
summary(model)

Expected Output:

Intercept: ~30 (baseline score with all predictors = 0)
Study hours β: ~2.5 (each hour increases score by 2.5 points, holding sleep and GPA constant)
Sleep hours β: ~1.8 (each hour improves score by 1.8 points, holding study and GPA constant)
Previous GPA β: ~15 (each GPA point predicts 15-point higher exam score, holding study and sleep constant)
R²: ~0.72 (72% of variance explained by all three predictors together)
Adjusted R²: ~0.70 (accounts for number of predictors)

💡 Language Matters

Teach students to say: "Controlling for..." or "Holding constant..." or "Adjusting for..."

Example: "Study time predicts exam scores even after controlling for previous academic performance and sleep."

This language is crucial for showing they understand multiple regression!

Check Your Understanding Questions

✅ Question 1: Interpreting Coefficients

Q: "In a model predicting salary from years of experience, education level, and hours worked per week, the coefficient for experience is β = 2500. In a simple regression with only experience, β = 3200. Why did it change?"

A: This is EXPECTED and important!

In simple regression: β = 3200 captures ALL effects of experience (including indirect effects through education and hours)
In multiple regression: β = 2500 is the UNIQUE effect of experience after removing effects that overlap with education and hours
The difference (700) represents effects that experience shares with other variables (confounding)

Key insight: Multiple regression gives you the "pure" effect of each predictor, which is usually smaller than the simple regression coefficient.

✅ Question 2: R² vs. Adjusted R²

Q: "My R² = 0.45 but Adjusted R² = 0.41. Which should I report?"

A: Report both, but emphasize Adjusted R² for multiple regression.

Why they differ:

R² ALWAYS increases when you add predictors (even useless ones!)
Adjusted R² penalizes you for adding weak predictors
The gap tells you if you're overfitting

Interpretation: "The model explains 45% of variance in exam scores. After adjusting for the number of predictors (preventing overfitting), the model accounts for 41% of variance."

✅ Question 3: Multicollinearity

Q: "I'm predicting weight from height (inches), height (cm), and age. My VIF values are huge. What's wrong?"

A: Height in inches and height in cm are perfectly correlated! They're the same variable in different units.

Multicollinearity problem: When predictors are highly correlated, the model can't separate their unique effects.

Solution: Remove one of the redundant height measures. Keep only height in one unit.

General VIF guidelines:

VIF < 5: No problem
VIF 5-10: Moderate concern, investigate
VIF > 10: Serious problem, must address

Interaction Effects: The Trickiest Part

💡 Teaching Strategy for Interactions

Use this framing: "An interaction means the effect of X1 DEPENDS ON the level of X2."

Example students relate to: "The effect of studying depends on sleep. When well-rested, studying helps a lot. When exhausted, studying doesn't help much."

✅ Interaction Example with Answer Key

Research Question: Does caffeine improve cognitive performance differently for morning vs. evening people?

R Code:

# Model WITHOUT interaction
model1 <- lm(performance ~ caffeine + chronotype, data = dataset)

# Model WITH interaction
model2 <- lm(performance ~ caffeine * chronotype, data = dataset)

# Compare models
anova(model1, model2)  # Is interaction significant?
summary(model2)  # Interpret coefficients

Expected Results:

Caffeine main effect: β = 5 (effect for morning people, baseline)
Chronotype main effect: β = -10 (evening people start 10 points lower)
Interaction: β = -3 (caffeine helps evening people 3 points LESS than morning people)

Interpretation: "Caffeine improves performance by 5 points for morning people, but only 2 points for evening people (5 - 3 = 2). The benefit of caffeine depends on chronotype."

⚠️ Common Student Error: Ignoring Main Effects

Error: "The interaction is significant, so I can ignore the main effects!"

Correction: NO! When you have a significant interaction, you MUST interpret it in context of the main effects. The interaction MODIFIES the main effects; it doesn't replace them.

Proper interpretation includes:

Main effect of X1 (effect when X2 = 0 or at reference level)
Main effect of X2 (effect when X1 = 0 or at reference level)
Interaction (how the effect of X1 changes across levels of X2)

Model Comparison & Selection

✅ When to Add Predictors

Good reasons to add a predictor:

Theory suggests it's important
It improves Adjusted R² (not just R²!)
It's a confounder you need to control for
AIC/BIC decreases (better fit accounting for complexity)

Bad reasons to add predictors:

"I have the data, might as well throw it in"
"R² went from 0.40 to 0.401" (trivial increase)
Fishing for significant results

✅ Comparing Nested Models

Q: "Should I include age as a predictor?"

A: Test it statistically:

model_without_age <- lm(Y ~ X1 + X2, data = dataset)
model_with_age <- lm(Y ~ X1 + X2 + age, data = dataset)

# F-test for nested models
anova(model_without_age, model_with_age)

Decision:

If p < 0.05: Age significantly improves the model → Keep it
If p > 0.05: Age doesn't add meaningful information → Drop it (parsimony!)

Advanced Practice Problem

✅ Comprehensive Analysis Exercise

Scenario: Predicting depression scores from social support, exercise minutes/week, and sleep quality (1-10 scale)

Tasks for students:

Run multiple regression with all three predictors
Check VIF for multicollinearity
Test if social support × exercise interaction improves model
Check diagnostic plots
Write up results in APA style

Expected findings:

All three predictors significant (p < 0.05)
VIF values < 3 (no multicollinearity)
Interaction NOT significant (p = 0.32)
Final model R² = 0.58

Model write-up:

"A multiple regression was conducted to predict depression scores from social support, exercise, and sleep quality. The overall model was significant, F(3, 96) = 44.3, p < .001, R² = .58. All three predictors contributed uniquely to the model. Social support (β = -2.1, p < .001) and sleep quality (β = -3.4, p < .001) were the strongest predictors, followed by exercise (β = -0.08, p = .02). Together, these variables explained 58% of the variance in depression scores."

Common Questions About Multiple Regression

❓ "How many predictors can I include?"

A: Rule of thumb: At least 10-15 observations per predictor

With n = 100: Safely include up to 6-10 predictors
With n = 30: Maximum 2-3 predictors
Fewer predictors = more statistical power and less overfitting

Prioritize theoretical importance over just adding every variable you have!

❓ "What if my predictors are correlated?"

A: Some correlation is expected and OK!

r < 0.7 between predictors: Usually fine
r = 0.7-0.9: Watch for multicollinearity (check VIF)
r > 0.9: Serious problem - consider dropping one or combining them

In real research, many variables ARE correlated. That's partly why we use multiple regression - to separate their unique effects!

❓ "Should I center my variables before creating interactions?"

A: Yes, for continuous variables!

Why: Centering (subtracting the mean) makes main effects easier to interpret and reduces multicollinearity between main effects and interaction terms.

# Center variables
dataset$X1_centered <- scale(dataset$X1, scale = FALSE)
dataset$X2_centered <- scale(dataset$X2, scale = FALSE)

# Then create interaction
model <- lm(Y ~ X1_centered * X2_centered, data = dataset)

📋 Assessment Strategies & Rubrics

Formative Assessment Throughout Modules

Each module includes "Check Your Understanding" questions. Use these to gauge comprehension before moving forward.

Summative Assessment: Comprehensive Regression Project

💡 Recommended Final Project

Task: Students analyze a provided dataset (or collect their own) and complete a full regression analysis with write-up.

Components:

Research question and hypotheses (10%)
Data exploration and visualization (15%)
Regression analysis (simple or multiple) (20%)
Diagnostic checking and interpretation (25%)
Written interpretation of results (20%)
Discussion of limitations and assumptions (10%)

Detailed Grading Rubric

Component	Exemplary (A)	Proficient (B)	Developing (C)	Needs Work (D/F)
Research Question	Clear, specific, appropriate for regression; well-justified	Clear and appropriate but could be more specific	Vague or marginally appropriate for regression	Unclear or inappropriate for regression
Code Quality	Correct syntax, well-commented, efficient, reproducible	Correct with minor errors, adequate comments	Works but has errors or is poorly organized	Multiple errors or doesn't run
Diagnostic Checking	All plots generated, correctly interpreted, appropriate actions taken for violations	Plots generated, mostly correct interpretation	Some plots missing or misinterpreted	Diagnostics not checked or major misinterpretations
Interpretation	Coefficients, p-values, R² all correctly interpreted in plain language with context	Mostly correct interpretation, minor errors or lacks context	Some correct elements but major misunderstandings	Fundamental misinterpretation of results
Statistical vs. Practical Significance	Discusses both, provides context for effect sizes	Mentions both but limited discussion	Focuses only on p-values	Doesn't distinguish between types of significance
Limitations	Identifies multiple relevant limitations, discusses implications	Identifies some limitations appropriately	Generic limitations without specificity	No discussion of limitations

Quick Checks for Understanding

✅ Minute Paper Prompts

Use these at the end of each module (3-5 minutes):

Module 1: "In one sentence, explain when you'd use regression instead of correlation."
Module 2: "If β = -5.2, what does this tell you about the relationship?"
Module 3: "Name one diagnostic plot and what it checks for."
Module 4: "What does it mean to say a predictor is significant 'controlling for' other variables?"

Concept Application Questions (No R Required)

✅ Conceptual Quiz Items

Question 1: A researcher finds that hours of TV watching predicts lower GPA (β = -0.15, p = 0.003). What does β = -0.15 mean?

Answer: For each additional hour of TV per day, GPA decreases by 0.15 points on average.

Question 2: You run a regression and get R² = 0.09. Your colleague says "That's terrible, only 9% explained!" How do you respond?

Answer: It depends on the field and complexity of the outcome. In behavioral science with many unmeasured influences, 9% can be meaningful. Consider effect size in context, not just the percentage.

Question 3: Your residuals vs. fitted plot shows a clear funnel shape. What assumption is violated and what should you do?

Answer: Homoscedasticity (constant variance) is violated. Consider log-transforming the DV, using robust standard errors, or weighted least squares.

Question 4: In multiple regression predicting salary from education and experience, the coefficient for education is β = 5000. Does this mean education causes higher salary?

Answer: No! Regression shows association, not causation. Many confounders could explain this relationship (e.g., family background, ability, motivation).

Practical Skills Checklist

Use this to verify students can perform essential tasks:

Skill	Can Do	Needs Practice
Load data into R	☐	☐
Run simple regression with lm()	☐	☐
Interpret summary() output	☐	☐
Generate diagnostic plots	☐	☐
Identify assumption violations from plots	☐	☐
Run multiple regression	☐	☐
Check VIF for multicollinearity	☐	☐
Test interaction effects	☐	☐
Compare nested models	☐	☐
Write results in scientific format	☐	☐

🔧 Technical Troubleshooting

Common R Error Messages & Solutions

Error Message	Cause	Solution
`Error in lm.fit: 0 (non-NA) cases`	All data is missing (NA)	Check data import; look for NAs with `summary(dataset)`
`Error: could not find function "lm"`	Base R not loaded (rare)	Restart R session
`Error: object 'variable_name' not found`	Typo or data not loaded	Check spelling; verify data with `ls()` and `names(dataset)`
`Warning: essentially perfect fit`	R² = 1.0, deterministic relationship	Check for duplicate variables or data entry errors
`Error in plot.window: need finite xlim values`	Trying to plot with infinite or NA values	Remove NAs: `na.omit(dataset)` or use `na.rm = TRUE`
`Error: variable lengths differ`	X and Y have different numbers of observations	Check data dimensions with `length()`; ensure complete cases

Module-Specific Technical Issues

Module 1 (Interactive Visualizations)

⚠️ Issue: Interactive elements not working

Causes:

JavaScript disabled in browser
Old browser version
Pop-up blockers interfering

Solutions:

Test in Chrome or Firefox (most compatible)
Disable pop-up blockers for the site
Hard refresh: Ctrl+Shift+R (Windows) or Cmd+Shift+R (Mac)
Have backup static images ready to project

Module 2 (Running Regression in R)

⚠️ Issue: Students can't create scatter plot with regression line

Most common error: Wrong variable order

# WRONG
plot(dependent_var, independent_var)  # X and Y reversed!

# CORRECT
plot(independent_var, dependent_var)  # X first, Y second
abline(model)

Teaching tip: Remind students: "X goes on horizontal axis (first), Y on vertical (second)"

Module 3 (Diagnostic Plots)

⚠️ Issue: Plots appear too small or cramped

Solution:

# Before plotting, expand the plot window
par(mfrow = c(2, 2))  # 2x2 grid
plot(model)

# Reset to single plot
par(mfrow = c(1, 1))

Or use RStudio's "Zoom" button in the Plots pane

Module 4 (Multiple Regression)

⚠️ Issue: VIF() function not found

Cause: Package 'car' not installed

Solution:

# Install once per computer
install.packages("car")

# Load every session
library(car)
vif(model)

Data Import Issues

💡 Preventing Data Import Problems

Provide students with clean datasets:

Save as .csv (most universal)
No special characters in variable names
No spaces in variable names (use underscores)
First row = variable names
Missing data coded as NA (not blanks, "missing", or 999)

Standard import code to give students:

# Set working directory (modify path)
setwd("C:/Users/YourName/Documents/Stats")

# Import CSV
data <- read.csv("filename.csv", header = TRUE)

# Check import success
str(data)      # View structure
head(data)     # View first 6 rows
summary(data)  # Check for issues

Computer Lab Setup Recommendations

Before First Class Session

✓ Verify R and RStudio are installed on all computers
✓ Test that students can save files to their accounts/drives
✓ Pre-install 'car' package on lab computers if possible
✓ Upload all datasets to a shared location (LMS, cloud drive)
✓ Test interactive HTML modules on lab computers
✓ Have backup projector adapter cables

During Class

✓ Start with a simple "test" analysis to ensure everyone's R works
✓ Have students raise hands when they encounter errors (don't let them struggle silently)
✓ Pair students so they can help each other troubleshoot
✓ Walk around and spot-check screens periodically

Alternative Approaches If Technology Fails

💡 Backup Plans

If R crashes or lab computers fail:

Use RStudio Cloud: Browser-based R (requires internet)
Demonstrate live: Project your screen, have students follow conceptually
Switch to interpretation: Provide pre-run output, focus on interpretation rather than coding
Paper-based practice: Use printed output for interpretation exercises

Student Support Resources

✅ Recommended Resources to Share

R Help:

Built-in help: ?lm or help(lm)
Quick-R: https://www.statmethods.net/
R for Data Science (free online): https://r4ds.had.co.nz/

Regression Tutorials:

UCLA Stats Consulting: https://stats.oarc.ucla.edu/r/
Regression assumptions: http://www.sthda.com/english/articles/39-regression-model-diagnostics/

Stack Overflow:

Teach students to search: "R regression [their specific problem]"
Most common errors have been answered!

🎓 Final Teaching Tips & Philosophy

Creating a Supportive Learning Environment

💡 Normalize Struggle

Say things like:

"Statistics is hard. If this feels challenging, you're doing it right."
"I still Google R error messages all the time. That's normal."
"There's rarely one 'right' answer in real data analysis. Multiple approaches can be valid."

Students fear looking "stupid" in stats classes. Your openness about difficulty and normalizing errors reduces anxiety.

Balancing Rigor with Practicality

These modules intentionally prioritize application over derivation. You may have students who want more mathematical detail. Here's how to handle that:

💡 For mathematically curious students

If students ask about formulas/derivations:

"That's a great question! The mathematical details are beyond our scope, but if you're interested, I can point you to resources."
Offer supplementary readings for theory
Frame it as: "We're focusing on using the tool correctly. Understanding the engine under the hood is great, but you don't need to be a mechanic to drive a car."

Common Pedagogical Challenges

Challenge 1: Students Who Just Want Cookbook Steps

⚠️ "Just tell me what buttons to click!"

Why this happens: Statistics anxiety makes students crave certainty and procedures

How to respond: Balance structure with thinking

Provide checklists and workflows (students appreciate structure)
BUT require them to explain WHY they're making each decision
Use case studies where the "right" answer depends on context

Example: "Here's the workflow for running regression [give checklist]. Now apply it to this dataset and explain why you made each choice."

Challenge 2: "My p-value is 0.06, what do I do?!"

💡 Addressing p-value worship

Students obsess over p = 0.05 cutoff. Teach nuance:

"p = 0.06 and p = 0.04 are not fundamentally different"
"Focus on effect size, confidence intervals, and practical significance"
"The 0.05 threshold is a convention, not a law of nature"

This is a perfect opportunity to discuss the replication crisis and how blind p-value chasing has harmed science!

Challenge 3: Students Analyze Without Looking at Data

⚠️ Running analyses blind

Students jump straight to regression without exploring data first

Prevention strategy: Make visualization MANDATORY before any analysis

Require students to submit scatter plot before running regression
Have "what do you notice about the data?" discussion before analysis
Show examples of how visualization catches data entry errors

Mantra: "Always look at your data before analyzing it. Always."

Adapting for Different Student Levels

For struggling students:

Focus on Module 1 and Module 2 (simple regression only)
Provide worked examples they can modify
Allow cheat sheets during assessments
Emphasize interpretation over code

For advanced students:

Challenge them with real messy datasets
Introduce polynomial regression, splines, or GLMs as extensions
Have them critique published studies' statistical methods
Assign independent projects on topics of their choosing

Connecting to Students' Research Interests

💡 Making it relevant

First day activity: Have students share their research interests

Throughout modules, connect examples to their fields:

Neuroscience: Predicting reaction time from brain activation
Animal behavior: Territory size predicting mating success
Sensation/Perception: Contrast sensitivity as a function of spatial frequency
Clinical: Treatment duration predicting symptom improvement

When students see regression as a tool for THEIR questions, engagement skyrockets!

Building Statistical Intuition

Beyond procedural knowledge, aim to develop students' statistical intuition:

✅ Questions that build intuition

"Before we run the analysis, what do you predict we'll find?"
"Does this result make sense? Why or why not?"
"If you were peer-reviewing this study, what would concern you?"
"How would you explain this finding to someone without statistics training?"
"What additional data would strengthen these conclusions?"

Managing Your Time

Activity	Time Investment	Worth it?
Creating custom datasets for your field	2-3 hours	✓ YES - massively increases engagement
Detailed individualized feedback on code	High (10+ min/student)	⚠️ MAYBE - use group feedback for common errors instead
Live coding demonstrations	10-15 min/module	✓ YES - students learn by watching you think through problems
Creating answer keys for all practice problems	3-4 hours	✓ YES - saves time answering same questions repeatedly
Office hours before assignments due	Variable	✓ YES - prevents frustration and late submissions

Assessment Time-Savers

💡 Efficient grading strategies

Use rubrics religiously: Speeds grading and ensures consistency
Grade holistically: Don't mark every tiny error; focus on major concepts
Provide group feedback: Make a document of "common errors on Assignment X" instead of repeating the same comment
Use completion grades for practice: Full credit if they attempted all parts; save detailed grading for major assessments
Peer review: Have students review each other's work (with a rubric) before submitting

Measuring Success

How do you know if your teaching is working? Look for these signs:

✅ Success indicators

Students ask "why" not just "how": Shows conceptual engagement
Students catch their own errors: Developing self-monitoring
Students connect material to their research: Transfer of learning
Students debate interpretation: Shows they're thinking critically, not just following recipes
Decreased anxiety over the semester: Building confidence

Continuous Improvement

Keep notes on what works and what doesn't:

Which examples resonated?
Where did students struggle most?
What questions came up repeatedly?
Which activities were worth the time investment?

Revise modules based on this feedback. Teaching statistics is iterative!

🎉 You've Got This!

Teaching regression can feel daunting, but remember: your enthusiasm and support matter more than perfect explanations. Students will struggle - that's part of learning. Your role is to guide them through the struggle with patience and clarity.

Key takeaways for successful teaching:

Emphasize interpretation over computation
Normalize errors and struggle
Connect concepts to students' research interests
Check assumptions but don't demand perfection
Teach decision-making, not just procedures

Good luck with your regression modules! 📊✨

Questions or feedback on this instructor's guide?

Keep notes on what works in your specific teaching context and adapt these materials accordingly!