🔢 Module 4: Multiple Regression

Using multiple predictors to explain your outcome

Learning Objectives

From Simple to Multiple Regression

Simple Regression (one predictor):

Y = a + b₁X₁

Example: Exam Score = 50 + 5 × Study Hours

Multiple Regression (multiple predictors):

Y = a + b₁X₁ + b₂X₂ + b₃X₃ + ...

Example: Exam Score = 40 + 4 × Study Hours + 2 × Sleep Hours - 3 × Anxiety

Why use multiple predictors?
  • Real phenomena have multiple causes
  • Control for confounding variables
  • Improve prediction accuracy
  • Understand unique contribution of each predictor

Demo 1: Adding a Second Predictor

See how adding a predictor improves the model.

Question 1: Why did R² increase when we added the second predictor?

Interpreting Coefficients in Multiple Regression

KEY CONCEPT: Each coefficient represents the effect of that predictor while holding all other predictors constant.

Score = 40 + 4(Hours) + 2(Sleep)
Coefficient Interpretation
b₁ = 4 For each additional study hour, score increases by 4 points, holding sleep constant
b₂ = 2 For each additional sleep hour, score increases by 2 points, holding study time constant
Common mistake: Forgetting "holding other variables constant"
Wrong: "Each study hour adds 4 points"
Right: "Each study hour adds 4 points, when sleep is held constant"

R² vs. Adjusted R²

Problem with R²: It ALWAYS increases when you add predictors, even if they're useless!

Measure What it does When to use
R² % variance explained Describing current model fit
Adjusted R² R² penalized for # of predictors Comparing models with different # of predictors
Key insight: Adjusted R² can actually DECREASE when you add unhelpful predictors. This helps prevent overfitting!

Demo 2: R² vs. Adjusted R²

See what happens when we add useful vs. useless predictors.

Multicollinearity

What it is: When predictors are highly correlated with each other.

Example of Multicollinearity:

Predicting weight from both height in inches AND height in centimeters

→ These are perfectly correlated! They measure the same thing.

Problems caused by multicollinearity:
  • Unstable coefficient estimates
  • Large standard errors
  • Coefficients may have wrong signs
  • Hard to determine individual predictor importance
How to detect:
  • Check correlations between predictors (r > 0.80 is concerning)
  • Variance Inflation Factor (VIF > 10 is problematic)
  • Large changes in coefficients when adding/removing predictors
Solutions:
  • Remove one of the correlated predictors
  • Combine correlated predictors (e.g., create an average)
  • Use different analysis method (e.g., principal components)

Demo 3: Effects of Multicollinearity

See what happens when predictors are highly correlated.

Question 2: Why is multicollinearity a problem for interpretation?

Interaction Effects

What it is: When the effect of one predictor depends on the level of another predictor.

Example:

Without interaction: Caffeine always improves performance by the same amount

With interaction: Caffeine helps when you're tired but doesn't help when you're already alert

Y = a + b₁X₁ + b₂X₂ + b₃(X₁ × X₂)

The b₃(X₁ × X₂) term captures the interaction

Scenario Model Needed
Caffeine and sleep have separate effects Main effects only (no interaction)
Caffeine's effect depends on sleep level Include interaction term

Demo 4: Understanding Interactions

See the difference between main effects and interaction effects.

Model Selection: Which Predictors to Include?

Guidelines:
  • Theory first: Include predictors based on theory/prior research
  • Parsimony: Simpler models are better (don't overfit)
  • Statistical significance: Consider p-values for predictors
  • Adjusted R²: Compare models using this, not regular R²
  • Practical significance: Does it matter in the real world?
Avoid:
  • Including too many predictors (overfitting)
  • Automated stepwise selection without thought
  • Adding predictors just to increase R²
  • Ignoring multicollinearity

Model Comparison Example:

Model Predictors R² Adj R² Assessment
1 Study hours 0.50 0.48 Good baseline
2 Study + Sleep 0.62 0.59 Improvement!
3 Study + Sleep + Breakfast 0.63 0.58 Adj R² dropped - not helpful

Demo 5: Model Comparison

Practice comparing models with different predictors.

Reading Multiple Regression Output in R

Call: lm(formula = score ~ hours + sleep + anxiety) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 35.234 4.123 8.545 < 2e-16 *** hours 4.128 0.421 9.805 < 2e-16 *** sleep 2.341 0.532 4.401 2.3e-05 *** anxiety -2.876 0.612 -4.699 8.1e-06 *** Residual standard error: 6.234 on 96 degrees of freedom Multiple R-squared: 0.6789, Adjusted R-squared: 0.6689 F-statistic: 67.89 on 3 and 96 DF, p-value: < 2.2e-16

How to interpret:

Key Takeaways

Remember:
  • Each coefficient is interpreted "holding other variables constant"
  • Use Adjusted R² to compare models with different predictors
  • Check for multicollinearity - highly correlated predictors cause problems
  • Interactions mean one predictor's effect depends on another
  • Simpler models are often better (avoid overfitting)
  • Theory and logic should guide predictor selection
  • Multiple regression helps control for confounds

Congratulations!

You have completed all four modules on linear regression!

You now understand:

  1. ✓ When and why to use regression
  2. ✓ How to fit and interpret simple regression
  3. ✓ How to check assumptions and diagnose problems
  4. ✓ How to use multiple predictors effectively