🔍 Module 3: Checking Assumptions and Diagnostics

Making sure your regression is valid

Learning Objectives

The Four Key Assumptions of Linear Regression

Assumption What It Means How to Check
1. Linearity The relationship between X and Y is linear Residual plot (should show no pattern)
2. Independence Observations are independent of each other Study design (can't test with plots)
3. Normality Residuals are normally distributed Q-Q plot, histogram of residuals
4. Equal Variance Variance of residuals is constant (homoscedasticity) Residual plot (spread should be even)
Remember LINE:
  • Linearity
  • Independence
  • Normality (of residuals)
  • Equal variance

What Are Residuals?

Residual = Observed Y - Predicted Y

Residuals are the errors in our predictions. Good regression means small, random residuals with no patterns.

Why residuals matter:
  • They tell us how wrong our predictions are
  • Patterns in residuals reveal assumption violations
  • We check assumptions using residuals, not raw data

Demo 1: Understanding Residuals

See what residuals look like and how they're calculated.

Question 1: In your own words, what is a residual?

Assumption 1: Linearity

What it means: The relationship between X and Y is actually linear (a straight line fits well).

How to check: Plot residuals vs. fitted values. Should see random scatter with no pattern.

Good (linear): Residuals randomly scattered around zero
Bad (non-linear): Residuals show a curve or U-shape pattern

Demo 2: Checking Linearity

Compare residual plots for linear vs. non-linear relationships.

Question 2: What does it mean if your residual plot shows a curved pattern?

Assumption 2: Independence

What it means: Each observation is independent - knowing one observation doesn't tell you about another.

Common violations:
  • Time series: Observations collected over time (today's value related to yesterday's)
  • Clustered data: Students within classrooms, patients within hospitals
  • Repeated measures: Multiple observations from the same person
How to check: Think about your study design
  • Are observations collected over time?
  • Are observations nested within groups?
  • Do you have repeated measurements?

If yes to any, you may need special methods (time series analysis, multilevel models, repeated measures ANOVA).

Assumption 3: Normality of Residuals

What it means: The residuals follow a normal distribution.

How to check: Q-Q plot or histogram of residuals.

Important: We check if RESIDUALS are normal, not if X or Y are normal!

Demo 3: Checking Normality of Residuals

See what normal vs. non-normal residuals look like.

Question 3: Why do we check if residuals are normal rather than checking if Y is normal?

Assumption 4: Equal Variance (Homoscedasticity)

What it means: The spread of residuals is the same across all values of X.

How to check: Residual plot - spread should be consistent, not getting wider or narrower.

Good (homoscedastic): Equal spread of residuals across all X values
Bad (heteroscedastic): Residuals fan out (wider spread at high X) or fan in (narrower at high X)

Demo 4: Checking Equal Variance

Compare residual plots with equal vs. unequal variance.

Outliers and Influential Points

Outlier: An observation that doesn't fit the pattern (large residual)

Influential point: An observation that strongly affects the regression line

Type Description Impact
Outlier in Y Far from regression line vertically Large residual, but may not affect slope much
Outlier in X Extreme X value (leverage point) Can strongly pull the regression line
Influential point Both extreme X AND doesn't fit pattern Changes slope and R² substantially

Demo 5: Outliers vs. Influential Points

See how different types of unusual points affect regression.

What to Do When Assumptions Are Violated

Violation Solutions
Non-linearity - Transform X or Y (log, square root)
- Add polynomial terms (X²)
- Use non-linear regression
Non-independence - Use multilevel/mixed models
- Time series methods
- Repeated measures ANOVA
Non-normal residuals - Transform Y
- Use robust regression
- Note: With large n, less critical
Unequal variance - Transform Y (often log)
- Use weighted least squares
- Use robust standard errors
Outliers/Influential - Investigate: data error or real?
- Report with and without
- Use robust regression
- NEVER just delete without justification

Key Takeaways

Remember:
  • ALWAYS check assumptions before trusting your regression
  • Use residual plots - they reveal problems
  • Check RESIDUALS for normality, not raw Y values
  • Patterns in residuals = assumption violations
  • Investigate outliers - don't just delete them
  • Many violations can be fixed with transformations
  • With large samples, normality is less critical

Ready to Continue?

Make sure you can:

  1. Name the four key assumptions of regression
  2. Explain what residuals are
  3. Interpret residual plots for linearity and equal variance
  4. Check normality of residuals
  5. Distinguish outliers from influential points