I’m a big fan of the Elastic Net for variable selection and shrinkage and have given numerous talks about it and its implementation, `glmnet`

. In fact, I will even have a DataCamp course about glmnet coming out soon.

As a side note, I used to pronounce it g-l-m-net but after having lunch with one of its creators, Trevor Hastie, I learn it is pronounced glimnet.

`coefplot`

has long supported `glmnet`

via a standard coefficient plot but I recently added some functionality, so let’s take a look. As we go through this, please pardon the `htmlwidgets`

in iframes.

First, we load packages. I am now fond of using the following syntax for loading the packages we will be using.

```
# list the packages that we load
# alphabetically for reproducibility
packages <- c('coefplot', 'DT', 'glmnet')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)
# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')
```

The versions can then be displayed in a table.

```
versions <- c(packages, packagesColon) %>%
purrr::map(packageVersion) %>%
purrr::map_chr(as.character)
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>%
dplyr::arrange(Package)
knitr::kable(packageDF)
```

Package | Version |
---|---|

coefplot | 1.2.5.1 |

dplyr | 0.7.4 |

DT | 0.2 |

glmnet | 2.0.13 |

knitr | 1.18 |

magrittr | 1.5 |

purrr | 0.2.4 |

tibble | 1.4.1 |

useful | 1.2.3 |

First, we read some data. The data are available at http://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world.

`manTrain <- readRDS(url('http://www.jaredlander.com/data/manhattan_Train.rds'))`

The data are about New York City land value and have many columns. A sample of the data follows.

```
datatable(manTrain %>% dplyr::sample_n(size=100), elementId='DataSampled',
rownames=FALSE,
extensions=c('FixedHeader', 'Scroller'),
options=list(
scroller=TRUE,
scrollY=300
))
```

In order to use `glmnet`

we need to convert our `tbl`

into an X (predictor) `matrix`

and a Y (response) `vector`

. Since we don’t have to worry about multicolinearity with `glmnet`

we do not want to drop the baselines of `factor`

s. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the `matrix`

ad `vector`

we need a `formula`

. This could be built programmatically, but we can just build it ourselves. The response is `TotalValue`

.

```
valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal +
LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built +
Landmark - 1
```

Notice the `- 1`

means do not include an intercept since `glmnet`

will do that for us.

```
manX <- useful::build.x(valueFormula, data=manTrain,
# do not drop the baselines of factors
contrasts=FALSE,
# use a sparse matrix
sparse=TRUE)
manY <- useful::build.y(valueFormula, data=manTrain)
```

We are now ready to fit a model.

`mod1 <- glmnet(x=manX, y=manY, family='gaussian')`

We can view a coefficient plot for a given value of `lambda`

like this.

`coefplot(mod1, lambda=330500, sort='magnitude')`

A common plot that is built into the `glmnet`

package it the coefficient path.

`plot(mod1, xvar='lambda', label=TRUE)`

This plot shows the path the coefficients take as `lambda`

increases. They greater `lambda`

is, the more the coefficients get shrunk toward zero. The problem is, it is hard to disambiguate the lines and the labels are not informative.

Fortunately, `coefplot`

has a new function in Version 1.2.5 called `coefpath`

for making this into an interactive plot using dygraphs.

`coefpath(mod1)`

While still busy this function provides so much more functionality. We can hover over lines, zoom in then pan around.

These functions also work with any value for `alpha`

and for cross-validated models fit with `cv.glmnet`

.

`mod2 <- cv.glmnet(x=manX, y=manY, family='gaussian', alpha=0.7, nfolds=5)`

We plot coefficient plots for both optimal `lambda`

s.

```
# coefplot for the 1se error lambda
coefplot(mod2, lambda='lambda.1se', sort='magnitude')
```

```
# coefplot for the min error lambda
coefplot(mod2, lambda='lambda.min', sort='magnitude')
```

The coefficient path is the same as before though the optimal `lambda`

s are noted as dashed vertical lines.

`coefpath(mod2)`

While `coefplot`

has long been able to plot coefficients from `glmnet`

models, the new `coefpath`

function goes a long way in helping visualize the paths the coefficients take as `lambda`

changes.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York R Conference and author of R for Everyone.