Method | Dependent Variable | Purpose & Use Cases | Mathematical Foundation | Key Assumptions | R Code Example | Interpretation | Pros & Cons |
---|---|---|---|---|---|---|---|
Linear
Regression |
Continuous
Examples:• Word frequency• Sentence length • Reading time • Pitch values |
Predicts continuous outcomes • Understand relationships between variables • Predict numerical values • Control for confounding variables Corpus Example:Predicting average sentence length based on text complexity, author age, and genre |
Y = β₀ + β₁X₁ + β₂X₂ + ... + ε
Link function: IdentityDistribution: Normal Estimation: Ordinary Least Squares (OLS) |
|
# Basic linear regression
model <- lm(sentence_length ~ complexity + age + genre, data = corpus) summary(model) confint(model) plot(model) # diagnostics |
Coefficients (β): Change in Y per unit change in X R²: Proportion of variance explained Example: β = 2.5 means sentence length increases by 2.5 words per unit increase in complexity |
Pros
Cons
|
Logistic
Regression |
Binary
Examples:• Variant choice (A vs B)• Presence/absence • Success/failure • Yes/no responses |
Predicts binary outcomes • Model probability of an event • Classification problems • Sociolinguistic variation Corpus Example:Predicting whether speakers use "going to" vs "gonna" based on formality, age, and region |
log(p/(1-p)) = β₀ + β₁X₁ + β₂X₂ + ...
Link function: LogitDistribution: Binomial Estimation: Maximum Likelihood |
|
# Logistic regression
model <- glm(variant ~ formality + age + region, data = corpus, family = binomial) summary(model) exp(coef(model)) # odds ratios |
Odds Ratio: exp(β) = change in odds per unit change in X Probability: Convert using logistic function Example: OR = 2.0 means "gonna" is twice as likely with each unit increase in informality |
Pros
Cons
|
Multinomial
Regression |
Categorical (3+ categories)
Examples:• Multiple variants• Text categories • Rating scales • Language choices |
Predicts multiple categories • When outcome has 3+ unordered categories • Multi-class classification • Linguistic variation studies Corpus Example:Predicting choice among "soda", "pop", "soft drink", "coke" based on geography and demographics |
log(P(Y=k)/P(Y=ref)) = β₀ₖ + β₁ₖX₁ + ...
Link function: Generalized logitDistribution: Multinomial Estimation: Maximum Likelihood |
|
library(nnet)
model <- multinom(drink_term ~ region + age + education, data = corpus) summary(model) # Relative risk ratios exp(coef(model)) |
Relative Risk Ratio: exp(β) comparing each category to reference Predicted Probabilities: For each category Example: RRR = 3.0 means 3× more likely to say "pop" vs "soda" for Midwesterners |
Pros
Cons
|
Poisson
Regression |
Count Data
Examples:• Word counts• Frequency per text • Number of errors • Occurrences per time |
Predicts count data • Frequency analysis • Rate modeling • When outcomes are non-negative integers Corpus Example:Predicting number of discourse markers per 1000 words based on speaker characteristics and context |
log(λ) = β₀ + β₁X₁ + β₂X₂ + ...
Link function: LogDistribution: Poisson Estimation: Maximum Likelihood |
|
# Poisson regression
model <- glm(word_count ~ complexity + genre + length, data = corpus, family = poisson) # Check for overdispersion library(AER) dispersiontest(model) |
Rate Ratio: exp(β) = multiplicative change in expected count Expected Count: exp(linear predictor) Example: RR = 1.5 means 50% increase in discourse marker frequency per unit increase in complexity |
Pros
Cons
|
Cox
Regression |
Time-to-Event
Examples:• Language change timing• First occurrence age • Response latency • Duration until event |
Analyzes time until event occurs • Survival analysis • Handles censored data • Historical linguistics Corpus Example:Analyzing time until language change spreads through communities, accounting for social network factors |
h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ...)
Link function: Log (for hazard ratio)Distribution: Non-parametric baseline Estimation: Partial Likelihood |
|
library(survival)
model <- coxph(Surv(time, event) ~ social_network + education + community_size, data = corpus) summary(model) # Test proportional hazards cox.zph(model) |
Hazard Ratio: exp(β) = relative risk of event occurrence Survival Curves: Probability of "surviving" to time t Example: HR = 2.0 means twice the risk of language change adoption per unit increase in network connectivity |
Pros
Cons
|
🎯 Quick Selection Guide
Linear: Continuous numbers (word frequency, sentence length)
Logistic: Two choices (variant A vs B, yes/no)
Multinomial: Multiple choices (dialect variants, text types)
Poisson: Count data (occurrences per text, errors per speaker)
Cox: Time until something happens (age of language change, response time)