When Assumptions Fail

What to do when your data violate statistical assumptions.

🎯 General Strategy

graph TD
    A[Assumption Violated] --> B{How severe?}
    B --> C[Minor/Moderate]
    B --> D[Severe]
    C --> E{Sample size?}
    E --> F[Large n>30]
    E --> G[Small n<30]
    F --> H[Probably OK<br/>Report violation]
    G --> I[Consider alternatives]
    D --> J[Use alternative method]
    J --> K[Transform data]
    J --> L[Non-parametric test]
    J --> M[Robust method]

📊 Normality Violated

Option 1: Check Sample Size

Large samples (n > 30 per group): - Most tests are robust to normality violations - Central Limit Theorem protects you - Proceed with caution, report the violation

Small samples (n < 30 per group): - Violations matter more - Consider alternatives below

Option 2: Transform Data

Common transformations:

# Log transformation (for right skew)
data$log_var <- log(data$variable)

# Square root (for moderate right skew)
data$sqrt_var <- sqrt(data$variable)

# Reciprocal (for strong right skew)
data$recip_var <- 1 / data$variable

# After transformation, check normality again
shapiro.test(data$log_var)

Important

After transformation, interpret results in transformed units or back-transform for reporting.

Option 3: Use Non-Parametric Alternative

Parametric Test	Non-Parametric Alternative
One-sample t-test	Wilcoxon Signed-Rank Test
Independent t-test	Mann-Whitney U Test
Paired t-test	Wilcoxon Signed-Rank Test
One-Way ANOVA	Kruskal-Wallis Test
Repeated Measures ANOVA	Friedman Test
Pearson Correlation	Spearman Correlation

See: Non-Parametric Tests

📏 Unequal Variances

For t-tests

Easy fix: R's default t.test() uses Welch's t-test (no equal variance assumption)

# This handles unequal variances automatically
t.test(outcome ~ group, data = data)

For ANOVA

Use Welch's ANOVA:

oneway.test(outcome ~ group, data = data)

Or transform data:

# Log transformation often equalizes variances
data$log_outcome <- log(data$outcome)
model <- aov(log_outcome ~ group, data = data)

📈 Non-Linearity (Regression)

Option 1: Add Polynomial Terms

# Add squared term
model <- lm(outcome ~ predictor + I(predictor^2), data = data)

# Add cubic term if needed
model <- lm(outcome ~ predictor + I(predictor^2) + I(predictor^3), data = data)

Option 2: Transform Variables

# Log-transform X, Y, or both
model <- lm(log(outcome) ~ log(predictor), data = data)

Option 3: Use Non-Linear Models

# Example: exponential relationship
nls_model <- nls(outcome ~ a * exp(b * predictor), 
                 data = data, 
                 start = list(a = 1, b = 0.1))

🔗 Multicollinearity (Regression)

Solution 1: Remove Redundant Predictors

# Check VIF
library(car)
vif(model)

# Remove predictor with highest VIF > 10
# Refit model without that predictor

Solution 2: Combine Correlated Predictors

# Average highly correlated predictors
data$combined_predictor <- (data$pred1 + data$pred2) / 2

model <- lm(outcome ~ combined_predictor + other_preds, data = data)

Solution 3: Use Ridge Regression

library(glmnet)

# Prepare data
x <- model.matrix(outcome ~ pred1 + pred2 + pred3, data = data)[, -1]
y <- data$outcome

# Fit ridge regression
ridge_model <- glmnet(x, y, alpha = 0)

🎲 Independence Violated

Clustered Data

Use mixed models:

library(lme4)

# Random intercept for clusters (e.g., classrooms)
mixed_model <- lmer(outcome ~ predictor + (1|cluster), data = data)

Repeated Measures

Use repeated measures ANOVA or mixed models:

# Repeated measures ANOVA
library(ez)
ezANOVA(data = data, 
        dv = outcome, 
        wid = subject_id, 
        within = time)

# Or mixed model
library(lme4)
lmer(outcome ~ time + (1|subject_id), data = data)

📉 Small Expected Frequencies (Chi-Square)

Solution 1: Combine Categories

# Combine rare categories
data$var_collapsed <- ifelse(data$var %in% c("Rare1", "Rare2"), 
                              "Combined", 
                              data$var)

chisq.test(table(data$var_collapsed, data$other_var))

Solution 2: Fisher's Exact Test (2×2 only)

# For 2×2 tables with small counts
fisher.test(table(data$var1, data$var2))

Solution 3: Collect More Data

Sometimes this is the only real solution!

🎯 Decision Flowchart

Assumption Violated
    ↓
Is sample large (n>30)?
    ↓ Yes                    ↓ No
Probably OK              Consider alternatives
Report violation             ↓
    ↓                    Try transformation
Continue analysis            ↓
                        Still violated?
                             ↓
                    Use non-parametric test

📝 Reporting Violations

Always report: 1. Which assumption was checked 2. Method used to check (test + visual) 3. Result of check 4. What you did about it

Example:

"Normality was assessed using Shapiro-Wilk test and Q-Q plots. Residuals showed significant deviation from normality (W = 0.89, p = .03), with evidence of right skewness. Due to the moderate sample size (n = 25) and severity of the violation, we conducted a Mann-Whitney U test instead of an independent t-test."

💡 General Advice

Best Practices

Check assumptions BEFORE finalizing analysis
Don't cherry-pick - decide on method before seeing results
Be transparent - report all assumption checks
When in doubt, use conservative alternatives
Consider consulting a statistician for complex violations

What NOT to Do

❌ Ignore violated assumptions
❌ Try multiple tests until one "works"
❌ Transform data repeatedly until it's "normal"
❌ Remove outliers without justification
❌ Proceed knowing assumptions are severely violated

📚 Resources

Interactive learning: Normality Modules

Non-parametric alternatives: Non-Parametric Tests

Understanding assumptions: Understanding Normality

← Other Assumptions | Home