I’m a big fan of the Elastic Net for variable selection and shrinkage and have given numerous talks about it and its implementation,
glmnet. In fact, I will even have a DataCamp course about glmnet coming out soon.
As a side note, I used to pronounce it g-l-m-net but after having lunch with one of its creators, Trevor Hastie, I learn it is pronounced glimnet.
coefplot has long supported
glmnet via a standard coefficient plot but I recently added some functionality, so let’s take a look. As we go through this, please pardon the
htmlwidgets in iframes.
First, we load packages. I am now fond of using the following syntax for loading the packages we will be using.
# list the packages that we load # alphabetically for reproducibility packages <- c('coefplot', 'DT', 'glmnet') # call library on each package purrr::walk(packages, library, character.only=TRUE) # some packages we will reference without actually loading # they are listed here for complete documentation packagesColon <- c('dplyr', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')
The versions can then be displayed in a table.
versions <- c(packages, packagesColon) %>% purrr::map(packageVersion) %>% purrr::map_chr(as.character) packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% dplyr::arrange(Package) knitr::kable(packageDF)
First, we read some data. The data are available at https://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world.
manTrain <- readRDS(url('https://www.jaredlander.com/data/manhattan_Train.rds'))
The data are about New York City land value and have many columns. A sample of the data follows.
datatable(manTrain %>% dplyr::sample_n(size=100), elementId='DataSampled', rownames=FALSE, extensions=c('FixedHeader', 'Scroller'), options=list( scroller=TRUE, scrollY=300 ))
In order to use
glmnet we need to convert our
tbl into an X (predictor)
matrix and a Y (response)
vector. Since we don’t have to worry about multicolinearity with
glmnet we do not want to drop the baselines of
factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.
In order to build the
vector we need a
formula. This could be built programmatically, but we can just build it ourselves. The response is
valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 + Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea + OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + Landmark - 1
- 1 means do not include an intercept since
glmnet will do that for us.
manX <- useful::build.x(valueFormula, data=manTrain, # do not drop the baselines of factors contrasts=FALSE, # use a sparse matrix sparse=TRUE) manY <- useful::build.y(valueFormula, data=manTrain)
We are now ready to fit a model.
mod1 <- glmnet(x=manX, y=manY, family='gaussian')
We can view a coefficient plot for a given value of
lambda like this.
coefplot(mod1, lambda=330500, sort='magnitude')
A common plot that is built into the
glmnet package it the coefficient path.
plot(mod1, xvar='lambda', label=TRUE)
This plot shows the path the coefficients take as
lambda increases. They greater
lambda is, the more the coefficients get shrunk toward zero. The problem is, it is hard to disambiguate the lines and the labels are not informative.
coefplot has a new function in Version 1.2.5 called
coefpath for making this into an interactive plot using dygraphs.
While still busy this function provides so much more functionality. We can hover over lines, zoom in then pan around.
These functions also work with any value for
alpha and for cross-validated models fit with
mod2 <- cv.glmnet(x=manX, y=manY, family='gaussian', alpha=0.7, nfolds=5)
We plot coefficient plots for both optimal
# coefplot for the 1se error lambda coefplot(mod2, lambda='lambda.1se', sort='magnitude')
# coefplot for the min error lambda coefplot(mod2, lambda='lambda.min', sort='magnitude')
The coefficient path is the same as before though the optimal
lambdas are noted as dashed vertical lines.
coefplot has long been able to plot coefficients from
glmnet models, the new
coefpath function goes a long way in helping visualize the paths the coefficients take as
Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.
Hi Jared — I’m the guy who was in the front row at both your workshops at ODSC London last week. I’ve just bookmarked this to my employer’s corporate social media thang so my colleagues will understand what I’m talking about (and to help me find my way back to it, until I’ve bought your book). This will definitely be a big help in my work (TL;DR: emulators for complex physcis codes). Hope all continues well for you.
I’m glad it has been so helpful! If you want any more help or onsite training we’re happy to help.
You may wish to check out my Shiny app, > Honey, I Shrunk the Coefficients! <. The app fits a 10-fold CV ElasticNet model and plots RSME vs Alpha, coefficient path and variable importance & magnitude. It does this interactively so you can vary Alpha and Lambda with the sliders, and view variables and value via Plotly reactive elements. Finally, the app allows, indeed requires that you upload your own .csv dataset. A brief YouTube walk-through of the app is here:
The app is on Shiny.io at:
Just looked quickly and that’s pretty cool. What did you build it for?