Multivariate Modeling Cheat Sheet

A Comprehensive Guide for Corpus Linguistics Research

Overview

🎯 Purpose

Analyze relationships between multiple linguistic variables simultaneously to understand complex patterns in language use

📈 Key Advantage

Controls for confounding variables and reveals true relationships between predictors and outcomes

🔍 Applications

Sociolinguistic variation, register analysis, diachronic change, cross-linguistic comparison

Common Multivariate Methods

Method Type Purpose Data Requirements R Package
Multiple Linear Regression Regression Predict continuous outcomes (frequency, duration) Continuous DV, mixed predictors lm(), glm()
Logistic Regression Classification Predict binary outcomes (variant choice) Binary DV, mixed predictors glm(family="binomial")
Mixed-Effects Models Regression Account for speaker/text-level variation Hierarchical data structure lme4, nlme
Correspondence Analysis Dimensionality Visualize associations in contingency tables Categorical variables ca, FactoMineR
Principal Component Analysis Dimensionality Reduce dimensionality, find patterns Continuous variables prcomp(), princomp()
Cluster Analysis Clustering Group similar observations Distance matrix cluster, stats
Random Forest Classification Variable importance, non-linear relationships Large datasets, mixed predictors randomForest

Model Selection Guide

Research Question Dependent Variable Data Structure Recommended Method
What predicts word frequency? Continuous (counts) Independent observations Poisson/Negative Binomial Regression
Which variant will speakers choose? Binary (variant A vs B) Multiple tokens per speaker Mixed-Effects Logistic Regression
How do registers cluster together? N/A (exploratory) Feature vectors per text Hierarchical Clustering + PCA
What linguistic features co-occur? N/A (exploratory) Feature frequencies Correspondence Analysis
Which factors predict grammaticalization? Ordinal (stages) Historical data Ordinal Logistic Regression

Key Assumptions & Diagnostics

Method Key Assumptions Diagnostic Tests Solutions if Violated
Linear Regression Linearity, independence, homoscedasticity, normality Residual plots, Shapiro-Wilk test Transform variables, use GLM
Logistic Regression Independence, linearity of logit Deviance residuals, ROC curves Add interaction terms, polynomial terms
Mixed-Effects Random effects normality, independence Q-Q plots of random effects Transform data, different correlation structure
PCA Linear relationships, adequate sample size KMO test, Bartlett's test Use factor analysis or non-linear methods

Step-by-Step Workflow

  1. Data Preparation
    Clean data, handle missing values, create dummy variables for categorical predictors
  2. Exploratory Data Analysis
    Examine distributions, correlations, and potential outliers
  3. Feature Selection
    Use domain knowledge, correlation analysis, or stepwise selection
  4. Model Fitting
    Start with simple models, gradually add complexity
  5. Assumption Checking
    Verify model assumptions using diagnostic plots and tests
  6. Model Comparison
    Use AIC, BIC, likelihood ratio tests, or cross-validation
  7. Interpretation
    Calculate effect sizes, confidence intervals, and practical significance
  8. Validation
    Test on holdout data or use resampling methods

Essential R Code Snippets

Mixed-Effects Logistic Regression

library(lme4)
model <- glmer(variant ~ age + gender + frequency + (1|speaker),
              data = corpus_data, family = binomial)
summary(model)
confint(model)

Correspondence Analysis

library(ca)
library(FactoMineR)
ca_result <- ca(contingency_table)
plot(ca_result)
summary(ca_result)

Random Forest Variable Importance

library(randomForest)
rf_model <- randomForest(outcome ~ ., data = train_data)
importance(rf_model)
varImpPlot(rf_model)

Interpretation Guidelines

Statistic Interpretation Corpus Linguistics Context
Odds Ratio (OR) Change in odds for 1-unit increase in predictor OR = 2.5 means variant A is 2.5× more likely with higher frequency
Coefficient (β) Change in log-odds or outcome per unit change β = 0.5 means 0.5 increase in log-odds per unit increase
Proportion of variance explained R² = 0.3 means model explains 30% of linguistic variation
C-index Concordance/discrimination (0.5-1.0) C = 0.8 means 80% accuracy in predicting variant choice
Eigenvalue Variance explained by each dimension First 2 dimensions explain 65% of register variation
⚠️ Common Pitfalls:
  • Not checking for multicollinearity between predictors
  • Ignoring hierarchical structure in corpus data
  • Over-interpreting small effect sizes
  • Not validating models on new data
💡 Pro Tips:
  • Always center and scale continuous predictors
  • Use cross-validation for robust model evaluation
  • Consider effect sizes, not just p-values
  • Visualize results with confidence intervals