📊 Regression Methods Comparison

Linear • Logistic • Multinomial • Poisson • Cox Regression

Method Dependent Variable Purpose & Use Cases Mathematical Foundation Key Assumptions R Code Example Interpretation Pros & Cons
Linear
Regression
Continuous
Examples:
• Word frequency
• Sentence length
• Reading time
• Pitch values
Predicts continuous outcomes

• Understand relationships between variables
• Predict numerical values
• Control for confounding variables

Corpus Example:
Predicting average sentence length based on text complexity, author age, and genre
Y = β₀ + β₁X₁ + β₂X₂ + ... + ε
Link function: Identity
Distribution: Normal
Estimation: Ordinary Least Squares (OLS)
  • Linearity of relationships
  • Independence of observations
  • Homoscedasticity (constant variance)
  • Normality of residuals
  • No multicollinearity
# Basic linear regression
model <- lm(sentence_length ~
  complexity + age + genre,
  data = corpus)

summary(model)
confint(model)
plot(model) # diagnostics
Coefficients (β): Change in Y per unit change in X

R²: Proportion of variance explained

Example: β = 2.5 means sentence length increases by 2.5 words per unit increase in complexity

Pros

  • Simple to interpret
  • Fast computation
  • No distributional assumptions on X
  • Well-established theory

Cons

  • Strong assumptions
  • Sensitive to outliers
  • Limited to continuous outcomes
  • Assumes linear relationships
Logistic
Regression
Binary
Examples:
• Variant choice (A vs B)
• Presence/absence
• Success/failure
• Yes/no responses
Predicts binary outcomes

• Model probability of an event
• Classification problems
• Sociolinguistic variation

Corpus Example:
Predicting whether speakers use "going to" vs "gonna" based on formality, age, and region
log(p/(1-p)) = β₀ + β₁X₁ + β₂X₂ + ...
Link function: Logit
Distribution: Binomial
Estimation: Maximum Likelihood
  • Independence of observations
  • Linear relationship with log-odds
  • No extreme outliers
  • Large sample size
  • No perfect multicollinearity
# Logistic regression
model <- glm(variant ~
  formality + age + region,
  data = corpus,
  family = binomial)

summary(model)
exp(coef(model)) # odds ratios
Odds Ratio: exp(β) = change in odds per unit change in X

Probability: Convert using logistic function

Example: OR = 2.0 means "gonna" is twice as likely with each unit increase in informality

Pros

  • Handles binary outcomes
  • Provides probabilities
  • No normality assumption
  • Robust to outliers

Cons

  • Requires large sample
  • Sensitive to outliers in X
  • Complex interpretation
  • Assumes linear logit
Multinomial
Regression
Categorical (3+ categories)
Examples:
• Multiple variants
• Text categories
• Rating scales
• Language choices
Predicts multiple categories

• When outcome has 3+ unordered categories
• Multi-class classification
• Linguistic variation studies

Corpus Example:
Predicting choice among "soda", "pop", "soft drink", "coke" based on geography and demographics
log(P(Y=k)/P(Y=ref)) = β₀ₖ + β₁ₖX₁ + ...
Link function: Generalized logit
Distribution: Multinomial
Estimation: Maximum Likelihood
  • Independence of observations
  • Independence of Irrelevant Alternatives (IIA)
  • Linear relationship with log-odds
  • No multicollinearity
  • Large sample size
library(nnet)
model <- multinom(drink_term ~
  region + age + education,
  data = corpus)

summary(model)
# Relative risk ratios
exp(coef(model))
Relative Risk Ratio: exp(β) comparing each category to reference

Predicted Probabilities: For each category

Example: RRR = 3.0 means 3× more likely to say "pop" vs "soda" for Midwesterners

Pros

  • Handles multiple categories
  • Flexible framework
  • Natural for linguistic variation
  • Provides category probabilities

Cons

  • IIA assumption restrictive
  • Requires large samples
  • Complex interpretation
  • Many parameters to estimate
Poisson
Regression
Count Data
Examples:
• Word counts
• Frequency per text
• Number of errors
• Occurrences per time
Predicts count data

• Frequency analysis
• Rate modeling
• When outcomes are non-negative integers

Corpus Example:
Predicting number of discourse markers per 1000 words based on speaker characteristics and context
log(λ) = β₀ + β₁X₁ + β₂X₂ + ...
Link function: Log
Distribution: Poisson
Estimation: Maximum Likelihood
  • Independence of observations
  • Mean equals variance (equidispersion)
  • Linear relationship with log(mean)
  • No excess zeros
  • Large sample size
# Poisson regression
model <- glm(word_count ~
  complexity + genre + length,
  data = corpus,
  family = poisson)

# Check for overdispersion
library(AER)
dispersiontest(model)
Rate Ratio: exp(β) = multiplicative change in expected count

Expected Count: exp(linear predictor)

Example: RR = 1.5 means 50% increase in discourse marker frequency per unit increase in complexity

Pros

  • Natural for count data
  • Handles skewed distributions
  • Interpretable rate ratios
  • No upper bound assumption

Cons

  • Assumes mean = variance
  • Problems with overdispersion
  • Sensitive to outliers
  • Zero-inflation issues
Cox
Regression
Time-to-Event
Examples:
• Language change timing
• First occurrence age
• Response latency
• Duration until event
Analyzes time until event occurs

• Survival analysis
• Handles censored data
• Historical linguistics

Corpus Example:
Analyzing time until language change spreads through communities, accounting for social network factors
h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ...)
Link function: Log (for hazard ratio)
Distribution: Non-parametric baseline
Estimation: Partial Likelihood
  • Proportional hazards
  • Independence of observations
  • Linear relationship with log-hazard
  • No time-dependent confounding
  • Censoring is non-informative
library(survival)
model <- coxph(Surv(time, event) ~
  social_network + education +
  community_size,
  data = corpus)

summary(model)
# Test proportional hazards
cox.zph(model)
Hazard Ratio: exp(β) = relative risk of event occurrence

Survival Curves: Probability of "surviving" to time t

Example: HR = 2.0 means twice the risk of language change adoption per unit increase in network connectivity

Pros

  • Handles censored data
  • Semi-parametric approach
  • Flexible baseline hazard
  • Well-established theory

Cons

  • Proportional hazards assumption
  • Complex interpretation
  • Requires survival data structure
  • Limited to time-to-event outcomes

🎯 Quick Selection Guide

Linear: Continuous numbers (word frequency, sentence length)

Logistic: Two choices (variant A vs B, yes/no)

Multinomial: Multiple choices (dialect variants, text types)

Poisson: Count data (occurrences per text, errors per speaker)

Cox: Time until something happens (age of language change, response time)