Comprehensive Teaching Guide for Modules 1-4
Complete with Answer Keys, Teaching Tips, Common Errors, and Facilitation Notes
These modules emphasize practical application over theoretical derivation. Students learn regression as a research tool for answering scientific questions, not as a mathematical exercise. The focus is on:
By the end of the complete module series, students should be able to:
Don't rush Module 1! Students who understand WHY regression matters and WHEN to use it have much better intuition for interpreting results later. The conceptual foundation pays dividends.
Students should have prior experience with:
Correlation Questions:
Regression Questions:
Students often struggle with the distinction between correlation and regression. Use this framing: "Correlation asks IF, regression asks HOW MUCH." Correlation = are they related? Regression = by how much does Y change when X changes?
Confusing R² with effect size: Students often think R² = 0.25 is "small." Emphasize that 25% of variance explained can be huge in behavioral science! Compare to real-world examples: even weather forecasts don't explain all variance.
Q: "A researcher wants to know if stress and cortisol levels are related. Should they use correlation or regression?"
A: Either is appropriate if the goal is just to determine if they're related. Use regression if you want to predict cortisol from stress (treating stress as the predictor) OR if you want to quantify "how much does cortisol change per unit of stress?"
Discussion point: This is a great question for class discussion because it reveals that the choice depends on the research question, not just the variables.
Q: "A regression model predicting GPA from hours studied has R² = 0.18. Is this a good model?"
A: Yes, this is quite good for behavioral research! 18% of variance in GPA is explained by study time alone. Many other factors affect GPA (prior knowledge, test anxiety, course difficulty, etc.), so explaining nearly 1/5 of variance with one predictor is meaningful.
Important clarification: "Good" depends on context. In physics, R² = 0.18 might be disappointing. In social sciences, it's respectable. Emphasize that perfect prediction (R² = 1.0) is neither expected nor realistic in complex human behavior.
The module uses body mass and metabolic rate as an example. Here are discussion prompts:
Have students find a research article in their field that uses regression and identify:
lm()
# Basic regression model
model <- lm(dependent_var ~ independent_var, data = dataset)
# View output
summary(model)
# Predictions
predict(model, newdata = data.frame(independent_var = c(new_value)))
# Plotting
plot(independent_var, dependent_var)
abline(model, col = "blue", lwd = 2)
When students run the sleep/reaction time analysis, they should get:
Students WILL ask: "What does the intercept mean if someone can't sleep 0 hours and be alive?"
Your response: "Great question! The intercept is often uninterpretable because it's an extrapolation beyond your data range. You probably didn't measure anyone with 0 hours of sleep. The intercept is mathematically necessary but not always practically meaningful. Focus on the slope - that's the actionable part."
Q: "A regression predicting depression scores from social support hours per week gives: Intercept = 45, Slope = -2.5. What does this mean?"
A:
Key point to emphasize: The slope tells us the RATE OF CHANGE. It's the "bang for your buck" - how much benefit per unit of the intervention.
Q: "A study with 10,000 participants finds that daily chocolate consumption (grams) predicts weight gain (kg/year): β = 0.002, p < 0.001. Is this meaningful?"
A: Statistically significant but practically trivial.
Error: "The correlation is 0.7, so the slope must be 0.7 too."
Correction: Correlation (r) and slope (β) are related but NOT the same:
Example: If predicting weight (kg) from height (cm), a strong correlation (r = 0.9) might give β = 0.5 kg per cm. The numbers are different because they measure different things!
Dataset: Temperature (°F) and Ice Cream Sales ($1000s)
Task: Run regression and interpret results
R Code:
# Given data
temperature <- c(65, 70, 75, 80, 85, 90, 95)
sales <- c(12, 15, 18, 22, 28, 32, 38)
# Run regression
model <- lm(sales ~ temperature)
summary(model)
# Plot
plot(temperature, sales, pch = 19, col = "blue",
xlab = "Temperature (°F)", ylab = "Sales ($1000s)")
abline(model, col = "red", lwd = 2)
Expected Results:
Interpretation: "For each 1°F increase in temperature, ice cream sales increase by approximately $870. Temperature is an excellent predictor of sales, explaining 98% of the variance."
Q: Using the ice cream model above, predict sales when temperature is 88°F.
R Code:
predict(model, newdata = data.frame(temperature = 88))
A: Sales ≈ $36,560
By hand calculation: Sales = -40 + 0.87(88) = 36.56 thousand dollars
For this module, consider doing a live coding demonstration where you:
lm() and show the outputNarrate your thought process: "Before I run regression, I always look at the data to check for outliers and non-linearity..."
A: That's R's way of saying the p-value is extremely small (less than 0.0000000000000002). It means "definitely statistically significant." R can't display the exact value because it's beyond the precision limit. Just report it as p < 0.001.
A: Technically yes, but be very cautious! This is called extrapolation and assumes the linear relationship continues beyond your data range. Often, relationships change outside the range you measured. Example: The relationship between fertilizer and crop yield might be linear from 0-100 lbs/acre, but at 500 lbs/acre, you might kill the plants! Stick to interpolation (within your data range) when possible.
| Student Issue | Likely Cause | Solution |
|---|---|---|
| "Error: object not found" | Didn't load data or typo in variable name | Check spelling, ensure data loaded with ls() |
| "All my p-values are 1.0" | Variables backwards in formula | Check: lm(Y ~ X) not lm(X ~ Y) |
| "My R² is negative" | This shouldn't happen with lm() |
Likely misreading adjusted R² (can be slightly negative) |
| "Regression line doesn't appear on plot" | Variables in plot don't match model | Ensure plot axes match lm() formula exactly |
Emphasize pragmatism over perfectionism. Students often panic when they see any deviation from perfect assumptions. Your message should be: "Regression is robust to minor violations. We check assumptions to identify SERIOUS problems, not to achieve perfection."
# Generate all four diagnostic plots at once
par(mfrow = c(2, 2))
plot(model)
# Or individual plots:
plot(model, which = 1) # Residuals vs Fitted (linearity & homoscedasticity)
plot(model, which = 2) # Q-Q plot (normality)
plot(model, which = 3) # Scale-Location (homoscedasticity)
plot(model, which = 5) # Residuals vs Leverage (influential points)
| Pattern | Meaning | Action |
|---|---|---|
| Random scatter around 0 | ✅ Good! Assumptions met | Proceed with interpretation |
| Curved pattern (U-shape or inverted U) | ⚠️ Non-linearity | Consider transformation or polynomial terms |
| Funnel shape (widening or narrowing) | ⚠️ Heteroscedasticity | Consider transformation or weighted regression |
| Pattern | Meaning | Action |
|---|---|---|
| Points fall along diagonal line | ✅ Residuals are normal | Proceed with interpretation |
| Points deviate at the ends but middle is good | ⚠️ Heavy-tailed distribution | Usually OK with large n; consider robust methods |
| S-shaped curve | ⚠️ Skewed residuals | Consider transformation (log, sqrt) |
"My Q-Q plot isn't perfectly straight, so I can't use regression!"
Correction: Minor deviations are fine! With large samples (n > 30), regression is robust to non-normality. Focus on extreme deviations, especially systematic patterns. A few points off the line at the extremes is usually not a problem.
Ideal: Random scatter with horizontal red line (constant variance)
Problem: Upward or downward trend (variance changes with fitted values)
This is another check for homoscedasticity - if both Plot 1 and Plot 3 look good, you're safe!
Cook's Distance contours: Points outside dotted lines are influential
High leverage + large residual = TROUBLE: These points are pulling the regression line
Decision guide:
Dataset: Study time vs. exam scores (provided in module)
Expected findings:
Conclusion: All assumptions met; results are trustworthy
Dataset: Age vs. income (provided in module)
Expected findings:
R code for transformation:
# Log transform the dependent variable
model_log <- lm(log(income) ~ age, data = dataset)
par(mfrow = c(2, 2))
plot(model_log) # Check if diagnostics improve
Q: "My residuals vs. fitted plot shows a clear U-shape. What does this mean and what should I do?"
A: Non-linearity detected! The relationship between X and Y is not linear.
Options:
lm(Y ~ X + I(X^2))Example interpretation: "The curved pattern suggests the relationship between study time and test scores is non-linear. Perhaps there are diminishing returns - the first few hours of study help a lot, but after 10 hours, additional study helps less."
Q: "My Q-Q plot shows slight deviation at the tails, but my sample size is 200. Should I be concerned?"
A: No, probably not. With n = 200:
When to worry about normality violations:
Q: "One data point has very high leverage. Should I delete it?"
A: Not automatically! Follow this decision tree:
Best practice: "We identified one participant with unusually high [X value]. Analyses with and without this participant yielded similar results (β = 0.45 vs. 0.43), suggesting findings are robust."
Give students this decision tree to reference:
LINEARITY VIOLATION (curved residuals vs fitted)
└─→ Try transformations (log, sqrt, polynomial)
└─→ If still curved: non-linear regression or GAM
HETEROSCEDASTICITY (funnel shape)
└─→ Transform DV (log, sqrt)
└─→ OR use robust standard errors
└─→ OR use weighted least squares
NORMALITY VIOLATION (Q-Q plot deviation)
└─→ Sample size > 30? → Usually OK, proceed
└─→ Sample size < 30? → Consider:
├─→ Bootstrap confidence intervals
├─→ Non-parametric alternatives
└─→ Transformation
INFLUENTIAL POINTS (high Cook's D)
└─→ Check for data entry errors
└─→ Run sensitivity analysis
└─→ Report with and without influential cases
A: No! Run the regression first, THEN check assumptions using diagnostic plots. You need the model to generate residuals, which you then examine. The workflow is: (1) Fit model, (2) Check diagnostics, (3) Revise if needed, (4) Interpret final model.
A: Not necessarily! Good diagnostics mean your model is appropriate for the data you have, but don't prove causation or guarantee you've included all relevant variables. You could still be missing important predictors or have reverse causation issues.
A: Linearity is most critical because if the relationship isn't linear, all your coefficients are wrong. Normality is least critical (especially with large n). Independence violations are serious but can't be fixed with transformations - they require different analysis methods (e.g., mixed models for clustered data).
Setup: Provide 4 different diagnostic plot sets (good, non-linear, heteroscedastic, influential point)
Task: In pairs, students identify:
Debrief: Discuss as class, emphasizing that often multiple approaches are acceptable
This is THE BIG IDEA students must grasp:
In simple regression: β tells you how much Y changes per unit of X
In multiple regression: β tells you how much Y changes per unit of X holding all other predictors constant
Use this analogy: "It's like asking 'What's the unique effect of sleep on reaction time, after accounting for caffeine intake and stress level?'"
# Multiple regression with several predictors
model <- lm(Y ~ X1 + X2 + X3, data = dataset)
summary(model)
# With interaction term
model_int <- lm(Y ~ X1 * X2, data = dataset) # Includes X1, X2, and X1:X2
# Check for multicollinearity
library(car)
vif(model) # VIF > 10 is problematic, > 5 warrants attention
# Compare models
anova(model1, model2) # Are they significantly different?
AIC(model1, model2) # Lower AIC = better fit
Research Question: What predicts final exam scores?
Predictors: Study hours, sleep hours, previous GPA
R Code:
# Load data (provided in module)
# Run multiple regression
model <- lm(exam_score ~ study_hours + sleep_hours + previous_gpa,
data = student_data)
summary(model)
Expected Output:
Teach students to say: "Controlling for..." or "Holding constant..." or "Adjusting for..."
Example: "Study time predicts exam scores even after controlling for previous academic performance and sleep."
This language is crucial for showing they understand multiple regression!
Q: "In a model predicting salary from years of experience, education level, and hours worked per week, the coefficient for experience is β = 2500. In a simple regression with only experience, β = 3200. Why did it change?"
A: This is EXPECTED and important!
Key insight: Multiple regression gives you the "pure" effect of each predictor, which is usually smaller than the simple regression coefficient.
Q: "My R² = 0.45 but Adjusted R² = 0.41. Which should I report?"
A: Report both, but emphasize Adjusted R² for multiple regression.
Why they differ:
Interpretation: "The model explains 45% of variance in exam scores. After adjusting for the number of predictors (preventing overfitting), the model accounts for 41% of variance."
Q: "I'm predicting weight from height (inches), height (cm), and age. My VIF values are huge. What's wrong?"
A: Height in inches and height in cm are perfectly correlated! They're the same variable in different units.
Multicollinearity problem: When predictors are highly correlated, the model can't separate their unique effects.
Solution: Remove one of the redundant height measures. Keep only height in one unit.
General VIF guidelines:
Use this framing: "An interaction means the effect of X1 DEPENDS ON the level of X2."
Example students relate to: "The effect of studying depends on sleep. When well-rested, studying helps a lot. When exhausted, studying doesn't help much."
Research Question: Does caffeine improve cognitive performance differently for morning vs. evening people?
R Code:
# Model WITHOUT interaction
model1 <- lm(performance ~ caffeine + chronotype, data = dataset)
# Model WITH interaction
model2 <- lm(performance ~ caffeine * chronotype, data = dataset)
# Compare models
anova(model1, model2) # Is interaction significant?
summary(model2) # Interpret coefficients
Expected Results:
Interpretation: "Caffeine improves performance by 5 points for morning people, but only 2 points for evening people (5 - 3 = 2). The benefit of caffeine depends on chronotype."
Error: "The interaction is significant, so I can ignore the main effects!"
Correction: NO! When you have a significant interaction, you MUST interpret it in context of the main effects. The interaction MODIFIES the main effects; it doesn't replace them.
Proper interpretation includes:
Good reasons to add a predictor:
Bad reasons to add predictors:
Q: "Should I include age as a predictor?"
A: Test it statistically:
model_without_age <- lm(Y ~ X1 + X2, data = dataset)
model_with_age <- lm(Y ~ X1 + X2 + age, data = dataset)
# F-test for nested models
anova(model_without_age, model_with_age)
Decision:
Scenario: Predicting depression scores from social support, exercise minutes/week, and sleep quality (1-10 scale)
Tasks for students:
Expected findings:
Model write-up:
"A multiple regression was conducted to predict depression scores from social support, exercise, and sleep quality. The overall model was significant, F(3, 96) = 44.3, p < .001, R² = .58. All three predictors contributed uniquely to the model. Social support (β = -2.1, p < .001) and sleep quality (β = -3.4, p < .001) were the strongest predictors, followed by exercise (β = -0.08, p = .02). Together, these variables explained 58% of the variance in depression scores."
A: Rule of thumb: At least 10-15 observations per predictor
Prioritize theoretical importance over just adding every variable you have!
A: Some correlation is expected and OK!
In real research, many variables ARE correlated. That's partly why we use multiple regression - to separate their unique effects!
A: Yes, for continuous variables!
Why: Centering (subtracting the mean) makes main effects easier to interpret and reduces multicollinearity between main effects and interaction terms.
# Center variables
dataset$X1_centered <- scale(dataset$X1, scale = FALSE)
dataset$X2_centered <- scale(dataset$X2, scale = FALSE)
# Then create interaction
model <- lm(Y ~ X1_centered * X2_centered, data = dataset)
Each module includes "Check Your Understanding" questions. Use these to gauge comprehension before moving forward.
Task: Students analyze a provided dataset (or collect their own) and complete a full regression analysis with write-up.
Components:
| Component | Exemplary (A) | Proficient (B) | Developing (C) | Needs Work (D/F) |
|---|---|---|---|---|
| Research Question | Clear, specific, appropriate for regression; well-justified | Clear and appropriate but could be more specific | Vague or marginally appropriate for regression | Unclear or inappropriate for regression |
| Code Quality | Correct syntax, well-commented, efficient, reproducible | Correct with minor errors, adequate comments | Works but has errors or is poorly organized | Multiple errors or doesn't run |
| Diagnostic Checking | All plots generated, correctly interpreted, appropriate actions taken for violations | Plots generated, mostly correct interpretation | Some plots missing or misinterpreted | Diagnostics not checked or major misinterpretations |
| Interpretation | Coefficients, p-values, R² all correctly interpreted in plain language with context | Mostly correct interpretation, minor errors or lacks context | Some correct elements but major misunderstandings | Fundamental misinterpretation of results |
| Statistical vs. Practical Significance | Discusses both, provides context for effect sizes | Mentions both but limited discussion | Focuses only on p-values | Doesn't distinguish between types of significance |
| Limitations | Identifies multiple relevant limitations, discusses implications | Identifies some limitations appropriately | Generic limitations without specificity | No discussion of limitations |
Use these at the end of each module (3-5 minutes):
Question 1: A researcher finds that hours of TV watching predicts lower GPA (β = -0.15, p = 0.003). What does β = -0.15 mean?
Answer: For each additional hour of TV per day, GPA decreases by 0.15 points on average.
Question 2: You run a regression and get R² = 0.09. Your colleague says "That's terrible, only 9% explained!" How do you respond?
Answer: It depends on the field and complexity of the outcome. In behavioral science with many unmeasured influences, 9% can be meaningful. Consider effect size in context, not just the percentage.
Question 3: Your residuals vs. fitted plot shows a clear funnel shape. What assumption is violated and what should you do?
Answer: Homoscedasticity (constant variance) is violated. Consider log-transforming the DV, using robust standard errors, or weighted least squares.
Question 4: In multiple regression predicting salary from education and experience, the coefficient for education is β = 5000. Does this mean education causes higher salary?
Answer: No! Regression shows association, not causation. Many confounders could explain this relationship (e.g., family background, ability, motivation).
Use this to verify students can perform essential tasks:
| Skill | Can Do | Needs Practice |
|---|---|---|
| Load data into R | ☐ | ☐ |
| Run simple regression with lm() | ☐ | ☐ |
| Interpret summary() output | ☐ | ☐ |
| Generate diagnostic plots | ☐ | ☐ |
| Identify assumption violations from plots | ☐ | ☐ |
| Run multiple regression | ☐ | ☐ |
| Check VIF for multicollinearity | ☐ | ☐ |
| Test interaction effects | ☐ | ☐ |
| Compare nested models | ☐ | ☐ |
| Write results in scientific format | ☐ | ☐ |
| Error Message | Cause | Solution |
|---|---|---|
Error in lm.fit: 0 (non-NA) cases |
All data is missing (NA) | Check data import; look for NAs with summary(dataset) |
Error: could not find function "lm" |
Base R not loaded (rare) | Restart R session |
Error: object 'variable_name' not found |
Typo or data not loaded | Check spelling; verify data with ls() and names(dataset) |
Warning: essentially perfect fit |
R² = 1.0, deterministic relationship | Check for duplicate variables or data entry errors |
Error in plot.window: need finite xlim values |
Trying to plot with infinite or NA values | Remove NAs: na.omit(dataset) or use na.rm = TRUE |
Error: variable lengths differ |
X and Y have different numbers of observations | Check data dimensions with length(); ensure complete cases |
Causes:
Solutions:
Most common error: Wrong variable order
# WRONG
plot(dependent_var, independent_var) # X and Y reversed!
# CORRECT
plot(independent_var, dependent_var) # X first, Y second
abline(model)
Teaching tip: Remind students: "X goes on horizontal axis (first), Y on vertical (second)"
Solution:
# Before plotting, expand the plot window
par(mfrow = c(2, 2)) # 2x2 grid
plot(model)
# Reset to single plot
par(mfrow = c(1, 1))
Or use RStudio's "Zoom" button in the Plots pane
Cause: Package 'car' not installed
Solution:
# Install once per computer
install.packages("car")
# Load every session
library(car)
vif(model)
Provide students with clean datasets:
Standard import code to give students:
# Set working directory (modify path)
setwd("C:/Users/YourName/Documents/Stats")
# Import CSV
data <- read.csv("filename.csv", header = TRUE)
# Check import success
str(data) # View structure
head(data) # View first 6 rows
summary(data) # Check for issues
If R crashes or lab computers fail:
R Help:
?lm or help(lm)Regression Tutorials:
Stack Overflow:
Say things like:
Students fear looking "stupid" in stats classes. Your openness about difficulty and normalizing errors reduces anxiety.
These modules intentionally prioritize application over derivation. You may have students who want more mathematical detail. Here's how to handle that:
If students ask about formulas/derivations:
Why this happens: Statistics anxiety makes students crave certainty and procedures
How to respond: Balance structure with thinking
Example: "Here's the workflow for running regression [give checklist]. Now apply it to this dataset and explain why you made each choice."
Students obsess over p = 0.05 cutoff. Teach nuance:
This is a perfect opportunity to discuss the replication crisis and how blind p-value chasing has harmed science!
Students jump straight to regression without exploring data first
Prevention strategy: Make visualization MANDATORY before any analysis
Mantra: "Always look at your data before analyzing it. Always."
First day activity: Have students share their research interests
Throughout modules, connect examples to their fields:
When students see regression as a tool for THEIR questions, engagement skyrockets!
Beyond procedural knowledge, aim to develop students' statistical intuition:
| Activity | Time Investment | Worth it? |
|---|---|---|
| Creating custom datasets for your field | 2-3 hours | ✓ YES - massively increases engagement |
| Detailed individualized feedback on code | High (10+ min/student) | ⚠️ MAYBE - use group feedback for common errors instead |
| Live coding demonstrations | 10-15 min/module | ✓ YES - students learn by watching you think through problems |
| Creating answer keys for all practice problems | 3-4 hours | ✓ YES - saves time answering same questions repeatedly |
| Office hours before assignments due | Variable | ✓ YES - prevents frustration and late submissions |
How do you know if your teaching is working? Look for these signs:
Keep notes on what works and what doesn't:
Revise modules based on this feedback. Teaching statistics is iterative!
Teaching regression can feel daunting, but remember: your enthusiasm and support matter more than perfect explanations. Students will struggle - that's part of learning. Your role is to guide them through the struggle with patience and clarity.
Key takeaways for successful teaching:
Good luck with your regression modules! 📊✨
Questions or feedback on this instructor's guide?
Keep notes on what works in your specific teaching context and adapt these materials accordingly!