I’m a big fan of the Elastic Net for variable selection and shrinkage and have given numerous talks about it and its implementation,
glmnet. In fact, I will even have a DataCamp course about glmnet coming out soon.
As a side note, I used to pronounce it g-l-m-net but after having lunch with one of its creators, Trevor Hastie, I learn it is pronounced glimnet.
coefplot has long supported
glmnet via a standard coefficient plot but I recently added some functionality, so let’s take a look. As we go through this, please pardon the
htmlwidgets in iframes.
First, we load packages. I am now fond of using the following syntax for loading the packages we will be using.
# list the packages that we load # alphabetically for reproducibility packages <- c('coefplot', 'DT', 'glmnet') # call library on each package purrr::walk(packages, library, character.only=TRUE) # some packages we will reference without actually loading # they are listed here for complete documentation packagesColon <- c('dplyr', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')
The versions can then be displayed in a table.
versions <- c(packages, packagesColon) %>% purrr::map(packageVersion) %>% purrr::map_chr(as.character) packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% dplyr::arrange(Package) knitr::kable(packageDF)
First, we read some data. The data are available at http://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world.
manTrain <- readRDS(url('http://www.jaredlander.com/data/manhattan_Train.rds'))
The data are about New York City land value and have many columns. A sample of the data follows.
datatable(manTrain %>% dplyr::sample_n(size=100), elementId='DataSampled', rownames=FALSE, extensions=c('FixedHeader', 'Scroller'), options=list( scroller=TRUE, scrollY=300 ))
In order to use
glmnet we need to convert our
tbl into an X (predictor)
matrix and a Y (response)
vector. Since we don’t have to worry about multicolinearity with
glmnet we do not want to drop the baselines of
factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.
In order to build the
vector we need a
formula. This could be built programmatically, but we can just build it ourselves. The response is
valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 + Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea + OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + Landmark - 1
- 1 means do not include an intercept since
glmnet will do that for us.
manX <- useful::build.x(valueFormula, data=manTrain, # do not drop the baselines of factors contrasts=FALSE, # use a sparse matrix sparse=TRUE) manY <- useful::build.y(valueFormula, data=manTrain)
We are now ready to fit a model.
mod1 <- glmnet(x=manX, y=manY, family='gaussian')
We can view a coefficient plot for a given value of
lambda like this.
coefplot(mod1, lambda=330500, sort='magnitude')
A common plot that is built into the
glmnet package it the coefficient path.
plot(mod1, xvar='lambda', label=TRUE)
This plot shows the path the coefficients take as
lambda increases. They greater
lambda is, the more the coefficients get shrunk toward zero. The problem is, it is hard to disambiguate the lines and the labels are not informative.
coefplot has a new function in Version 1.2.5 called
coefpath for making this into an interactive plot using dygraphs.
While still busy this function provides so much more functionality. We can hover over lines, zoom in then pan around.
These functions also work with any value for
alpha and for cross-validated models fit with
mod2 <- cv.glmnet(x=manX, y=manY, family='gaussian', alpha=0.7, nfolds=5)
We plot coefficient plots for both optimal
# coefplot for the 1se error lambda coefplot(mod2, lambda='lambda.1se', sort='magnitude')
# coefplot for the min error lambda coefplot(mod2, lambda='lambda.min', sort='magnitude')
The coefficient path is the same as before though the optimal
lambdas are noted as dashed vertical lines.
coefplot has long been able to plot coefficients from
glmnet models, the new
coefpath function goes a long way in helping visualize the paths the coefficients take as
Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York R Conference and author of R for Everyone.