Overview
🎯 Purpose
Analyze relationships between multiple linguistic variables simultaneously to understand complex patterns in language use
📈 Key Advantage
Controls for confounding variables and reveals true relationships between predictors and outcomes
🔍 Applications
Sociolinguistic variation, register analysis, diachronic change, cross-linguistic comparison
Common Multivariate Methods
Method | Type | Purpose | Data Requirements | R Package |
---|---|---|---|---|
Multiple Linear Regression | Regression | Predict continuous outcomes (frequency, duration) | Continuous DV, mixed predictors | lm(), glm() |
Logistic Regression | Classification | Predict binary outcomes (variant choice) | Binary DV, mixed predictors | glm(family="binomial") |
Mixed-Effects Models | Regression | Account for speaker/text-level variation | Hierarchical data structure | lme4, nlme |
Correspondence Analysis | Dimensionality | Visualize associations in contingency tables | Categorical variables | ca, FactoMineR |
Principal Component Analysis | Dimensionality | Reduce dimensionality, find patterns | Continuous variables | prcomp(), princomp() |
Cluster Analysis | Clustering | Group similar observations | Distance matrix | cluster, stats |
Random Forest | Classification | Variable importance, non-linear relationships | Large datasets, mixed predictors | randomForest |
Model Selection Guide
Research Question | Dependent Variable | Data Structure | Recommended Method |
---|---|---|---|
What predicts word frequency? | Continuous (counts) | Independent observations | Poisson/Negative Binomial Regression |
Which variant will speakers choose? | Binary (variant A vs B) | Multiple tokens per speaker | Mixed-Effects Logistic Regression |
How do registers cluster together? | N/A (exploratory) | Feature vectors per text | Hierarchical Clustering + PCA |
What linguistic features co-occur? | N/A (exploratory) | Feature frequencies | Correspondence Analysis |
Which factors predict grammaticalization? | Ordinal (stages) | Historical data | Ordinal Logistic Regression |
Key Assumptions & Diagnostics
Method | Key Assumptions | Diagnostic Tests | Solutions if Violated |
---|---|---|---|
Linear Regression | Linearity, independence, homoscedasticity, normality | Residual plots, Shapiro-Wilk test | Transform variables, use GLM |
Logistic Regression | Independence, linearity of logit | Deviance residuals, ROC curves | Add interaction terms, polynomial terms |
Mixed-Effects | Random effects normality, independence | Q-Q plots of random effects | Transform data, different correlation structure |
PCA | Linear relationships, adequate sample size | KMO test, Bartlett's test | Use factor analysis or non-linear methods |
Step-by-Step Workflow
- Data Preparation
Clean data, handle missing values, create dummy variables for categorical predictors - Exploratory Data Analysis
Examine distributions, correlations, and potential outliers - Feature Selection
Use domain knowledge, correlation analysis, or stepwise selection - Model Fitting
Start with simple models, gradually add complexity - Assumption Checking
Verify model assumptions using diagnostic plots and tests - Model Comparison
Use AIC, BIC, likelihood ratio tests, or cross-validation - Interpretation
Calculate effect sizes, confidence intervals, and practical significance - Validation
Test on holdout data or use resampling methods
Essential R Code Snippets
Mixed-Effects Logistic Regression
library(lme4)
model <- glmer(variant ~ age + gender + frequency + (1|speaker),
data = corpus_data, family = binomial)
summary(model)
confint(model)
model <- glmer(variant ~ age + gender + frequency + (1|speaker),
data = corpus_data, family = binomial)
summary(model)
confint(model)
Correspondence Analysis
library(ca)
library(FactoMineR)
ca_result <- ca(contingency_table)
plot(ca_result)
summary(ca_result)
library(FactoMineR)
ca_result <- ca(contingency_table)
plot(ca_result)
summary(ca_result)
Random Forest Variable Importance
library(randomForest)
rf_model <- randomForest(outcome ~ ., data = train_data)
importance(rf_model)
varImpPlot(rf_model)
rf_model <- randomForest(outcome ~ ., data = train_data)
importance(rf_model)
varImpPlot(rf_model)
Interpretation Guidelines
Statistic | Interpretation | Corpus Linguistics Context |
---|---|---|
Odds Ratio (OR) | Change in odds for 1-unit increase in predictor | OR = 2.5 means variant A is 2.5× more likely with higher frequency |
Coefficient (β) | Change in log-odds or outcome per unit change | β = 0.5 means 0.5 increase in log-odds per unit increase |
R² | Proportion of variance explained | R² = 0.3 means model explains 30% of linguistic variation |
C-index | Concordance/discrimination (0.5-1.0) | C = 0.8 means 80% accuracy in predicting variant choice |
Eigenvalue | Variance explained by each dimension | First 2 dimensions explain 65% of register variation |
⚠️ Common Pitfalls:
- Not checking for multicollinearity between predictors
- Ignoring hierarchical structure in corpus data
- Over-interpreting small effect sizes
- Not validating models on new data
💡 Pro Tips:
- Always center and scale continuous predictors
- Use cross-validation for robust model evaluation
- Consider effect sizes, not just p-values
- Visualize results with confidence intervals