nlme for TSCS Data

We’re going to be analyzing data on voting rates (% of adult population voting) in US presidential elections from 1978 to 2014. Load the data files and combine them in preparation for the analysis.

library(tidyverse)
library(reshape2)

## load population data
pop <- read.csv('pop.csv') %>%
  rename(year = V1, state = V2, population = pop) %>%
  mutate(population = scale(population))

## load voting rate data
vote <- read.csv('vote.csv') %>% melt() %>%
  mutate(variable = gsub('X', '', variable), variable = as.numeric(variable)) %>%
  rename(year = variable, vote = value)

## load region data
regions <- read.csv('regions.csv') %>%
  select(state = State, State.Code, region = Region)

## merge data
dat <- vote %>% left_join(regions) %>% select(-state) %>% rename(state = State.Code) %>% left_join(pop)

nlme uses a slightly different syntax for mixed effects models than lme4 does. We specify the fixed part of the model using a regular formula like we would for lm(), but the random component has to be specified as an argument to the random option e.g. random = ~ 1| group would produce a random intercept by group just like (1 | group) would in lme4.

library(nlme)

## fit model
mod1 <- lme(vote ~ population + region, data = dat, random = ~ 1 | year)
## Error in na.fail.default(structure(list(vote = c(40.9, 49, 35.7, 36.5, : missing values in object

Uh oh, looks like we have some missing values in our data and lme() doesn’t perform listwise deletion like lmer does. Unfortunately, the error doesn’t tell us where the problem is in our data. Use apply() and anyNA() to figure out where we should start looking.

apply(dat, 2, anyNA)
##       year       vote      state     region population 
##      FALSE      FALSE       TRUE       TRUE       TRUE

It looks there are missing values for state, region, and population. This means that there aren’t any issues with our voting data, but something’s going wrong with our region and population data. Since we join the region data before joining the population data, let’s start there.

dat <- vote %>% left_join(regions)

## list missing states
dat$state[is.na(dat$State.Code)]
##  [1] "Rhode Island " "Utah "         "Rhode Island " "Utah "        
##  [5] "Rhode Island " "Utah "         "Rhode Island " "Utah "        
##  [9] "Rhode Island " "Utah "         "Rhode Island " "Utah "        
## [13] "Rhode Island " "Utah "         "Rhode Island " "Utah "        
## [17] "Rhode Island " "Utah "         "Rhode Island " "Utah "

Our voting data have an extra trailing space for Rhode Island and Utah. left_join() uses exact string matching, so this extra space is causing a problem. Look up the sub() function by typing ?sub. How can we use this to fix our data ensure that all our data get properly merged.

## drop trailing spaces in RI and UT
vote <- vote %>% mutate(state = sub('\\ $' , '', state))

## merge data
dat <- vote %>% left_join(regions) %>% select(-state) %>% rename(state = State.Code) %>% left_join(pop)

## check for NAs again
apply(dat, 2, anyNA)
##       year       vote      state     region population 
##      FALSE      FALSE      FALSE      FALSE      FALSE

Now we’re good and can go ahead and estimate our model.

library(texreg)

mod1 <- lme(vote ~ population + region, data = dat, random = ~ 1 | year)
htmlreg(mod1, stars = .05)
Statistical models
Model 1
(Intercept) 50.85*
(0.90)
population -2.36*
(0.27)
regionNortheast -3.26*
(0.86)
regionSouth -8.67*
(0.73)
regionWest -3.00*
(0.78)
AIC 3327.07
BIC 3356.65
Log Likelihood -1656.54
Num. obs. 510
Num. groups 10
*p < 0.05

While nlme isn’t as advanced as lme4 in a lot of ways, it has certain functionalities that offer much more control over your model. What if we’re worried that we not only need a random intercept by year, but that the errors might vary by year? We can account for this possiblity with the varIdent() function and the weights argument to lme().

mod2 <- lme(vote ~ population + region, data = dat, random = ~ 1 | year,
           weights = varIdent(~ 1 | year))
htmlreg(mod2, stars = .05)
Statistical models
Model 1
(Intercept) 50.85*
(0.90)
population -2.36*
(0.27)
regionNortheast -3.26*
(0.86)
regionSouth -8.67*
(0.73)
regionWest -3.00*
(0.78)
AIC 3327.07
BIC 3356.65
Log Likelihood -1656.54
Num. obs. 510
Num. groups 10
*p < 0.05

Correlation Structures

The previous model assumes that errors are correlated within years, but not between them. Let’s take things a step further and fit an explicit time series model to our data. Take a look at what the correlation argument to lme() does and use the corAR1() function to fit a model with a first order autoregressive structure with states as the grouping variable.

mod3 <- lme(vote ~ population + region, data = dat, random = ~ 1 | state,
           correlation = corAR1(form = ~ year | state))
htmlreg(mod3, stars = .05)
Statistical models
Model 1
(Intercept) 50.86*
(1.31)
population -3.17*
(0.55)
regionNortheast -3.18
(2.01)
regionSouth -8.61*
(1.72)
regionWest -3.09
(1.82)
AIC 3219.59
BIC 3253.39
Log Likelihood -1601.80
Num. obs. 510
Num. groups 51
*p < 0.05

The default corStruct for an lme model is compound symmetry, which enforces all off diagonal entries of the variance-covariance matrix to be 0. This is a restrictive assumption, but it matches the assumption used by lmer(). Fit the a model with random intecepts by year using both lmer() and lme() and compare the results.

library(lme4)

mod_nlme <- lme(vote ~ population + region, data = dat, random = ~ 1 | state,
                correlation = corCompSymm())
mod_lme4 <- lmer(vote ~ population + region + (1 | state), data = dat)

htmlreg(list(mod_nlme, mod_lme4), stars = .05,
        custom.model.names = c('<code>nlme</code>', '<code>lme4</code>'))
Statistical models
nlme lme4
(Intercept) 50.86* 50.86*
(1.31) (1.31)
population -3.17* -3.17*
(0.55) (0.55)
regionNortheast -3.18 -3.18
(2.01) (2.01)
regionSouth -8.61* -8.61*
(1.72) (1.72)
regionWest -3.09 -3.09
(1.82) (1.82)
AIC 3219.59 3217.59
BIC 3253.39 3247.23
Log Likelihood -1601.80 -1601.80
Num. obs. 510 510
Num. groups 51
Num. groups: state 51
Var: state (Intercept) 18.10
Var: Residual 26.22
*p < 0.05

They’re identical! Well, at least the coefficient estimates, standard errors, variances of the random effects, and log-likelihood are identical. AIC and BIC are different because lmer() and lme() calculate the number of parameters in a model slightly differently, which we can verify by running AIC(mod_nlme, mod_lme4) and comparing the degrees of freedom.

Model Selection

Since there are so many different ways we can specify a mixed effects model (different random intercepts and slopes, different correlation structures, different levels of groups, etc.), it’s important to think about how we decide between different specifications. The simplest way is to perform a likelihood ratio test with the anova() function. Fit a regular linear model with the same fixed effects as one of your multilevel models and compare the two; remember to list the unrestricted model first!

mod_lm <- lm(vote ~ population + region, data = dat)

anova(mod_lme4, mod_lm)
## Data: dat
## Models:
## mod_lm: vote ~ population + region
## mod_lme4: vote ~ population + region + (1 | year)
##          Df  AIC  BIC logLik deviance Chisq Chi Df Pr(>Chisq)    
## mod_lm    6 3365 3390  -1676     3353                            
## mod_lme4  7 3331 3360  -1658     3317  35.9      1      2e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our unrestricted model is more likely to have generated the data than our restricted one, so we should definitely be fitting some type of multilevel model to these data. However,

## parametric bootstrap of likelihood ratio test
lme.boot <- function(model_r, model_ur){
  
  new_y <<- simulate(model_r, 1)[[1]] # simulate data according to null DGP
  restricted_dev <- -2 * logLik(update(model_r, new_y ~ .))
  unrestricted_dev <- -2 * logLik(update(model_ur, new_y ~ .))
  return(restricted_dev - unrestricted_dev) # LRT under null
  
}

## 100 bootstraps
boot_dev_nre <- replicate(100, expr = lme.boot(mod_lm, mod_lme4))

## the boostrapped approximation: (this is preferred)
test_stat <- -2 * logLik(mod_lm) - -2 * logLik(mod_lme4)
mean(boot_dev_nre > test_stat) #approximate p-value
## [1] 0

Not a single bootstrapped likelihood ratio test is greater than our test statistic, so once again our unrestricted model better fits the data than the restricted one. Let’s try comparing the first order autoregressive correlation model with the heteroskedastic model.

## 100 bootstraps
boot_dev_nre <- replicate(100, expr = lme.boot(mod2, mod3))
## Error in simulate.lme(model_r, 1): models with "corStruct" and/or "varFunc" objects not allowed

Unfortunately simulate.lme() doesn’t allow us to simulate outcomes from models with corStruct() correlation structures, so we have to try another option.

anova(mod3, mod2)
##      Model df  AIC  BIC logLik   Test L.Ratio p-value
## mod3     1  8 3325 3359  -1654                       
## mod2     2  7 3327 3357  -1657 1 vs 2     4.2   0.041

Our unrestricted AR1 model is statistically more likely than the heteroskedastic model.