The function glmnet() solves the following equation over a grid of lambda values.¹

\[ \begin{align} \min_{\beta_0,\beta} \frac{1}{N} \sum_{i=1}^{N} w_i \mathcal{L}(y_i|\beta_0+ x_i\beta) + \lambda\left[{(1-\alpha)}/{2}||\beta||_2^2 + \alpha ||\beta||_1\right] \end{align} \]

The elastic net penalty is represented by \(\alpha\). When \(\alpha=1\) the result is a lasso regression i.e. \(\lambda\left[ ||\beta||_1 \right]\), and when \(\alpha=0\) the result is a ridge regression i.e. \(\lambda\left[ 1/2||\beta||_2^2 \right]\). \(\mathcal{L}(y_i|\beta_0+x_i\beta)\) is the log-likelihood of \(y_i\) given a linear combination of coefficients and predictors.

Since we aren’t choosing our predictors based on theoretical considerations, we need a way to determine which model is best. We can accomplish this using a variety of methods, but one of the most common is mean squared error (MSE), which is given by:

\[ \begin{align} \frac{1}{n}\sum_{i=1}^{n} \left(\hat{Y_i} - Y_i \right)^2 \end{align} \]

Let’s make some fake data to test out glmnet. Create \(1000 \times 5000\) matrix \(\mathbf{X} \sim \mathcal{N}(0,1)\). This means we have a dataset with \(p \gg n\), if we want to include all of the possible predictors. We need to figure out the best fitting model that includes \(1 \leq k \leq 1000\) predictors. To make life easier in this example, create a response variable \(y\) that is a linear combination of the first 15 variables, plus an error \(\epsilon \sim \mathcal{N}(0,1)\) (this assumes that \(\beta \in \{ \beta_1, \ldots, \beta_{15} \} = 1\) since we’re not multiplying our predictors by anything

set.seed(8125)
n <- 1000 # observations
p <- 5000 # predictors
x <- matrix(rnorm(n * p), nrow = n, ncol = p)
y <- apply(x[, 1:15], 1, sum) + rnorm(n) # first 15 predictors are 'true'

MSE increasingly penalizes larger deviations from the true value due to the square. Since MSE needs true and predicted values, that means we need training and test data. Randomly select 1/3 of x and y to leave out as test data.

train_rows <- sample(1:n, (2/3) * n)
x_train <- x[train_rows, ]
x_test <- x[-train_rows, ]
y_train <- y[train_rows]
y_test <- y[-train_rows]

We can easily estimate models with a variety of \(\alpha\) values, so run glmnet() on our training data with \(\alpha = \{0,.5, 1\}\).

library(glmnet)
fit_lasso <- glmnet(x_train, y_train, family = 'gaussian', alpha=1) # LASSO
fit_ridge <- glmnet(x_train, y_train, family = 'gaussian', alpha=0) # ridge
fit_elnet <- glmnet(x_train, y_train, family = 'gaussian', alpha=.5) # e-net

Let’s assess which of these is the best approach. And since we’re choosing models based on predictive power, let’s do so for a range of \(\alpha\)s between 0 and 1. Use the cv.glmnet() function to carry out k-fold cross validation on the training set of our fake data.

library(parallel)

# alpha values from 0 to 1
fits <- mclapply(0:10, function(x) cv.glmnet(x_train, y_train, type.measure='mse',
                                           alpha=x/10, family='gaussian'))

The cv.glmnet() function will automatically identify the value of \(\lambda\) that minimizes the MSE for the selected \(\alpha\). Use plot() on the lasso, ridge, and elastic net models we ran above. Plot them next to their respective cv.glmnet() objects to see how their MSE changes with respect to different log(\(\lambda\)) values.

par(mfrow=c(3,2))
plot(fit_lasso, xvar = 'lambda')
plot(fits[[11]], main = 'LASSO')

plot(fit_elnet, xvar = 'lambda')
plot(fits[[6]], main = expression(paste('Elastic Net (', alpha, '= .5)')))

plot(fit_ridge, xvar = 'lambda')
plot(fits[[1]], main = 'Ridge')

For the lasso and elastic net models, we can see that MSE doesn’t significantly increase until the coefficient values for our first 15 coefficients start shrinking towards 0. This tells us that we’re doing a good job of selecting relevant features without overfitting to our training data

We can extract the \(\lambda\) value which produces a model with the lowest MSE from the cv.glmnet object, but first, let’s make sure we chose the best \(\alpha\) value for our data. Use the predict() function and our test x and y to generate \(\hat{y}\)s for each cross validated fit_ Be sure to set s in predict() because the default is not to use the \(\lambda\) value that produced the lowest MSE, but the entire sequence of \(\lambda\) values tried.

# out of sample predictive accuracy
mse <- sapply(0:10, function(x) mean((predict(fits[[x + 1]], newx = x_test,
                                              s = 'lambda.min') - y_test)^2))

# report OOS MSE for each alpha value
options(digits = 4)
kable(data.frame(alpha = (0:10)/10, mse))

alpha	mse
0.0	13.983
0.1	2.385
0.2	1.600
0.3	1.388
0.4	1.292
0.5	1.247
0.6	1.215
0.7	1.199
0.8	1.188
0.9	1.175
1.0	1.165

options(digits = 2)

In this case, the model that best predicts our data is a almost a lasso with \(\alpha = 1\). We can extract the \(lambda\) value that produced the best fitting model from our cv.glmnet object. We can now get coefficients from the best fitting model within this model. When using coef() on the cross validated model, don’t forget to set s = 'lambda.min' since MSE is not the default. This will return a sparse matrix with the coefficient values for the coefficients included in the best fitting model (as assessed by MSE). Extract just these coefficients and their names to another object so we can see how the lasso did. str() might be helpful here…

lambda_best <- fits[[10]]$lambda.1se
best_coefs <- coef(fits[[10]], s = 'lambda.min')
data.frame(name = (best_coefs@Dimnames[[1]][best_coefs@i + 1]), coefficient = best_coefs@x)

Our lasso picked all 15 of the predictors we used to create our response variable – nice! It also picked 68 other predictors that weren’t in our model, but notice that the estimated coefficients for them (mean = 0) are much smaller than for the ‘true’ variables (mean = 0.89). This suggests that our lasso did a good job identifying important features because the implicit coefficients on the first 15 predictors are equal to 1, while the absence of the remaining predictors means their implicit coefficients are equal to 0.

If we want to assess how much explanatory power these variables are providing, we can inspect a number of plots of the model.

# fit a glmnet with our best alpha value of .9
fit_best <- glmnet(x_train, y_train, family = 'gaussian', alpha = 1)

# coefficients vs. L1 norm of coefficients
plot(fit_lasso, xvar = 'norm', label = T)

# coefficients vs. log lambda values
plot(fit_lasso, xvar = 'lambda', label = T)

# coefficients vs. % deviance explained
plot(fit_lasso, xvar = 'dev', label = T)

Notice that the coefficient values for the first 15 predictors are consistently separated from the coefficients for the other included predictors. The first two plot tell us that our model is doing a good job of selecting relevant features, while the last one shows that as \(\beta \rightarrow 1\) for the first 15 variables, we explain more of the variation in y.

Now let’s try this again with some coefficients \(\neq1\).

# nonzero coefficients for first 5 variables
fake_coef <- rgamma(5, 2.25, .67)

# DGP
y <- x[ ,1:5] %*% fake_coef + rnorm(n) # first 5 predictors are 'true'

# training and test sets
y_train <- y[train_rows]
y_test <- y[-train_rows]

# cross validation for alpha from 0 to 1
fits <- mclapply(0:10, function(x) cv.glmnet(x_train, y_train, type.measure='mse',
                                             alpha=x/10, family='gaussian'))

# inspect the minimum MSE for each value of alpha
lapply(fits, function(x) min(x$cvm))

## [[1]]
## [1] 156
## 
## [[2]]
## [1] 4.2
## 
## [[3]]
## [1] 1.7
## 
## [[4]]
## [1] 1.4
## 
## [[5]]
## [1] 1.3
## 
## [[6]]
## [1] 1.3
## 
## [[7]]
## [1] 1.3
## 
## [[8]]
## [1] 1.2
## 
## [[9]]
## [1] 1.2
## 
## [[10]]
## [1] 1.2
## 
## [[11]]
## [1] 1.2

# select alpha value w/ lowest minimum MSE
alpha_best <- (which(sapply(fits, function(x) min(x$cvm)) ==
                       min(sapply(fits, function(x) min(x$cvm)))) - 1) / 10

# fit model using best alpha value
fit_best <- glmnet(x_train, y_train, family = 'gaussian', alpha = alpha_best)

# coefficients for comparison to plots
fake_coef

## [1] 12.37  3.27  1.74  1.05  0.63

# coefficients vs. L1 norm of coefficients
plot(fit_best, xvar = 'norm', label = T)

# coefficients vs. log lambda values
plot(fit_best, xvar = 'lambda', label = T)

# coefficients vs. % deviance explained
plot(fit_best, xvar = 'dev', label = T)

Advanced Features

We don’t always start from a place of complete ingorance when running elastic nets. Sometimes we have significant prior knowledge that certain predictors are important in explaining variation in our outcome. We can incorporate this information via the penalty.factor argument. Generate a fake dataset similar to above, but this time fit a model where we are very confident that the first predictor is important, and somewhat confident that the next two predictors are important.

set.seed(81457)
n <- 25  # observations
p <- 50  # predictors
x <- matrix(rnorm(n * p), nrow = n, ncol = p)
y <- apply(x[, 1:6], 1, sum) + rnorm(n)

train_rows <- sample(1:n, (2/3) * n)
x_train <- x[train_rows, ]
x_test <- x[-train_rows, ]
y_train <- y[train_rows]
y_test <- y[-train_rows]

# lasso with no prior info on feature importance
lasso_ign <- glmnet(x = x_train, y = y_train, alpha = 1)

# lasso that has to include first feature, and only 1/2 penalty on next two
lasso_prior <- glmnet(x = x_train, y = y_train, alpha = 1,
                      penalty.factor = c(0, .5, .5, rep(1, ncol(x_train) - 3)))

par(mfrow = c(2, 2))
plot(lasso_ign, label = T, main = 'Ignorance')
plot(lasso_prior, label = T, main = 'Prior Knowledge')
plot(lasso_ign, xvar = 'dev', label = T, main = 'Ignorance')
plot(lasso_prior, xvar = 'dev', label = T, main = 'Prior Knowledge')

Our less penalized predictors have the three largest coefficients in the prior knowledge model, and they have much lower coefficient estimates and explain much less of the variation in the dependent variable in the ignorance model.

Complex Data Generating Processes

The lasso also works with more complicated data generating processes. If assign randomly drawn coefficient values to the last five predictors in our data and allow them to influence our outcome, the lasso is able to pick up on their importance.

set.seed(8125)
n <- 1000 # observations
p <- 5000 # predictors
x <- matrix(rnorm(n * p), nrow = n, ncol = p)
y <- apply(x[,1:15], 1, sum) + x[, 4996:5000] %*% c(rnorm(3), 2.5, -3) + rnorm(n) # first 15 predictors are 'true'
train_rows <- sample(1:n, (2/3) * n)
x_train <- x[train_rows, ]
x_test <- x[-train_rows, ]
y_train <- y[train_rows]
y_test <- y[-train_rows]

fit_lasso <- glmnet(x_train, y_train, family = 'gaussian', alpha=1) # LASSO
plot(fit_lasso, xvar = 'norm', label = T)

The model correctly estimates a value of approximately 1 for \(x_1,\ldots,x_{15}\) and estimates a value of 2.5 for \(x_{4999}\) and -3 for \(x_{5000}\).

Individual Exercise

Use fl2003.RData, which is a cleaned up version of the data from Fearon and Laitin (2003). Use cv.glmnet() to fit elastic net models where onset is explained by all variables for a variety of \(\alpha\) values, using a loss function that is appropriate for the binomial nature of the data. Present plots of the model’s predictive accuracy for different \(\alpha\) values. Fit a model with glmnet() using the \(\alpha\) value you found that minimizes predictive error. Report coefficient estimates for all variables, and plot the changes in coefficient values vs. the L1 norm, log-lambda value, and deviance explained.

Randomly sample five cases where onset = 0 and five where onset = 1. Fit an elastic net with the optimal \(\alpha\) value you found for the whole dataset. Are the most important coefficients the same?

References

Fearon, James D., and David D. Laitin. 2003. “Ethnicity, Insurgency, and Civil War.” The American Political Science Review 97 (1): 75–90.

This lab is adapted from this tutorial by Josh Day.↩

Elastic Net, LASSO, and Ridge Regression

Rob Williams

November 15, 2018

Advanced Features

Complex Data Generating Processes

Individual Exercise

References