I’m a big fan of the Elastic Net for variable selection and shrinkage and have given numerous talks about it and its implementation, glmnet. In fact, I will even have a DataCamp course about glmnet coming out soon.

As a side note, I used to pronounce it g-l-m-net but after having lunch with one of its creators, Trevor Hastie, I learn it is pronounced glimnet.

coefplot has long supported glmnet via a standard coefficient plot but I recently added some functionality, so let’s take a look. As we go through this, please pardon the htmlwidgets in iframes.

First, we load packages. I am now fond of using the following syntax for loading the packages we will be using.

# list the packages that we load
# alphabetically for reproducibility
packages <- c('coefplot', 'DT', 'glmnet')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)

# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')

The versions can then be displayed in a table.

versions <- c(packages, packagesColon) %>% 
    purrr::map(packageVersion) %>% 
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% 
Package Version
dplyr 0.7.4
DT 0.2
glmnet 2.0.13
knitr 1.18
magrittr 1.5
purrr 0.2.4
tibble 1.4.1
useful 1.2.3

First, we read some data. The data are available at http://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world.

manTrain <- readRDS(url('http://www.jaredlander.com/data/manhattan_Train.rds'))

The data are about New York City land value and have many columns. A sample of the data follows.

datatable(manTrain %>% dplyr::sample_n(size=100), elementId='DataSampled',
              extensions=c('FixedHeader', 'Scroller'),

In order to use glmnet we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with glmnet we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix ad vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
    Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
    OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + 
    LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + 
    Landmark - 1

Notice the - 1 means do not include an intercept since glmnet will do that for us.

manX <- useful::build.x(valueFormula, data=manTrain,
                        # do not drop the baselines of factors
                        # use a sparse matrix

manY <- useful::build.y(valueFormula, data=manTrain)

We are now ready to fit a model.

mod1 <- glmnet(x=manX, y=manY, family='gaussian')

We can view a coefficient plot for a given value of lambda like this.

coefplot(mod1, lambda=330500, sort='magnitude')

A common plot that is built into the glmnet package it the coefficient path.

plot(mod1, xvar='lambda', label=TRUE)

This plot shows the path the coefficients take as lambda increases. They greater lambda is, the more the coefficients get shrunk toward zero. The problem is, it is hard to disambiguate the lines and the labels are not informative.

Fortunately, coefplot has a new function in Version 1.2.5 called coefpath for making this into an interactive plot using dygraphs.


While still busy this function provides so much more functionality. We can hover over lines, zoom in then pan around.

These functions also work with any value for alpha and for cross-validated models fit with cv.glmnet.

mod2 <- cv.glmnet(x=manX, y=manY, family='gaussian', alpha=0.7, nfolds=5)

We plot coefficient plots for both optimal lambdas.

# coefplot for the 1se error lambda
coefplot(mod2, lambda='lambda.1se', sort='magnitude')

# coefplot for the min error lambda
coefplot(mod2, lambda='lambda.min', sort='magnitude')

The coefficient path is the same as before though the optimal lambdas are noted as dashed vertical lines.


While coefplot has long been able to plot coefficients from glmnet models, the new coefpath function goes a long way in helping visualize the paths the coefficients take as lambda changes.

The biggest event for me this year was completely outside of work and had nothing to do with statistics or R: I got married. We technically met at the Open Stats meetup and I did build our wedding website with RMarkdown, so R was still involved. We just returned from our around-the-world honeymoon so I thought the best way to track our travels would be with maps and globes using leaflet and threejs.

Before we get to any code, the following packages were used in making this post.

This was an extensive trip that, in addition to traditional vacation activities, included a few visits to clients and speaking and a few conferences and meetups. In all, we visited, London, Singapore, Hong Kong, Auckland, Queenstown, Bora Bora, Tahiti, Moorea, San Jose and San Francisco, with a connection or two in between.

The airport/ferry codes for our trip were the following.

Origin Destination Airline
JFK LGW Norwegian Air
LHR SIN Singapore Airlines
SIN HKG Singapore Airlines
HKG AKL Cathay Pacific
AKL ZQN Air New Zealand
ZQN AKL Air New Zealand
AKL PPT Air New Zealand
PPT BOB Air Tahiti
BOB MOZ Air Tahiti
MOZ PPT Terevau
PPT LAX Air Tahiti Nui
LAX SJC Alaska Airlines

Converting these to latitude and longitude is easy thanks to Open Flights.

# read in the data
airports <- readr::read_csv('https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports-extended.dat',
                            # give it good column names since the data are headerless
                            col_names=c('ID', 'Name', 'City', 'Country', 
                                        'IATA', 'ICAO', 'Latitude', 'Longitude', 
                                        'Altitude', 'Timezone', 'DST', 'Tz', 
                                        'Type', 'Source'))

We then use filter to get just the ports we visited. Notice how we use a second tbl inside filter.

visited <- airports %>% 
    select(Name, City, Country, IATA, Latitude, Longitude) %>% 
    filter(IATA %in% (codes %>% select(Origin, Destination) %>% unlist))
DT::datatable(visited, elementId='AirportsTable',
              extensions=c('FixedHeader', 'Scroller'),
) %>% 
    DT::formatRound(columns=c('Latitude', 'Longitude'), digits=2)

We then manually reorder the airports so that edges can be drawn nicely between them. This is akin to creating an edgelist of airport-pairs. This is not the most robust way of creating this list, but suffices for our purposes.

visitedOrdered <- visited %>% 
    slice(c(12, 1, 2, 8, 7, 5, 6, 5, 13, 3, 4, 13, 10, 11, 12))

For the first visualization let’s create a map using leaflet.

# initialize the widget
leaflet(data=visitedOrdered) %>% 
    # overlay map tiles
    addTiles() %>% 
    # plot lines from one point to another
    addPolylines(lng=~Longitude, lat=~Latitude) %>% 
    # add markers with city names
    addMarkers(lng=~Longitude, lat=~Latitude, popup=~City)

Unfortunately, this doesn’t quite capture the directions of the flights as it makes it look like we flew back west to get to Papeete. So let’s try a globe instead using threejs.

We augment the edgelist of airports so that it has the latitude and longitude of the origin and destination airports for each flight.

flightPaths <- codes %>% 
    left_join(visited %>% select(IATA, Longitude, Latitude), by=c('Origin'='IATA')) %>% 
    rename(oLong=Longitude, oLat=Latitude) %>% 
    left_join(visited %>% select(IATA, Longitude, Latitude), by=c('Destination'='IATA')) %>% 
    rename(dLong=Longitude, dLat=Latitude)
DT::datatable(visited, elementId='FlightPathLatLong',
              extensions=c('FixedHeader', 'Scroller'),
) %>% 
    DT::formatRound(columns=c('Latitude', 'Longitude'), digits=2)

Now we can provide that data to threejs. We first specify an image to overlay on the globe. Then we specify the latitude and longitude of visited airports. After that, we provide the origin and destination latitudes and longitudes of our flights. The rest of the arguments are cosmetic.

    # the image to overlay on the globe
    # lat/long of visited airports
    lat=visited$Latitude, long=visited$Longitude,
    # lat/long of origin and destination
    arcs=flightPaths %>% select(oLat, oLong, dLat, dLong),
    # cosmetic adjustments
    arcsHeight=.4, arcsLwd=7, arcsColor="red", arcsOpacity=.95,
    atmosphere=FALSE, fov=30, rotationlat=0.3, rotationlong=.8*pi)

We now calculate the total distance traveled (not including car trips) using Haversine Distance to account for the curvature of the Earth.

distHaversine(visitedOrdered %>% select(Longitude, Latitude), r=3959) %>% sum
## [1] 28660.52

So we traveled 3,760 more miles than the circumference of the Earth.

Beyond the epic proportions of our travel, this honeymoon was outstanding from the sheer length, to the vastly different places we visited, to the food we ate and the sights we saw, to the activities we participated in, to the great people along the way. And, of course, it’s amazing to spend a month traveling with your favorite person.

Posted in R.

For a while I’ve been trying to find the best slideshow format to create using RMarkdown. My go-to format is ioslides because it has a presenter mode that works in self-contained mode, meaning the entire presentation is in one file. The drawback is that fullscreen images is not a built in feature.

Another popular format is reveal.js. This allows for fullscreen images, but presenter mode is not possible while keeping the file self-contained.

A very promising format is from the xaringan package which is based on remarkjs. This allows for beautiful, fullscreen images and it has an excellent presenter mode. This can all be done with a self-contained file. The problem is, due to the nature of the format, having all the images URI encoded for self-contained mode causes the file to open very slowly. I gave a talk about web scraping at the MIT Sports Analytics Conference which takes over five minutes to open.

Given all this, I decided to hack my way to a solution using ioslides. After some prep it works quite well.

This is a long post, so if you want it all in one place, skip ahead.

The Problem

RMarkdown slides are initiated using two hash tags (##). Metadata can be encoded in curly braces ({}) after the hash tags and any heading text.

## Heading Text {.SlideClass #SlideID0 attr=value0}

Slide Content

The thing is, SlideClass, SlideID and the attr value are applied to an <article> tag inside a <slide>, not to the <slide>.

<slide class=''>
        <h2>Heading Text</h2>

    <article  attr="value0" class="SlideClass" id="SlideID0">
        <p>Slide Content</p>

This precludes us from setting style="background-url(filename.jpg)" in the header metadata. Further, doing so would still leave just a link to the file and now URI encode it.

The Solution

The solution is to specify the image in CSS and associate it with a specific ID. However, at this point, it will only be associated with the <article>.

#SlideID1 {
    background-image: url(Cowboy-Lasso.jpg);

To keep things looking nice, in a separate CSS block we specify some background image properties that apply to all slide background images.

slide {
    background-position: center;
    background-repeat: no-repeat;
    background-size: contain;
## Background One {.SlideClass #SlideID1}

Some Text Here
knitr::include_graphics(file.path(imageDir, 'Background-Article.png'))

The trick is to associate the image, in CSS, with the ID of the <slide>. Even though we assign an ID to the <article> this does not help us know the ID of the <slide>. To get around this we use a bit of jQuery. We write some code that take a name attribute from every <article> tag and assigns it as an ID for the corresponding <slide> tags.

    // for every article tag inside a slide tag
    $("slide > article").each(function(){
        // copy the article name to the parentNode's (the slide) ID

In order for this to work, we also need to load jQuery. The quickest way I have found to do this is to use htmltools to

```{r deps,include=FALSE}
# this ensures jquery is loaded
dep <- htmltools::htmlDependency(name = "jquery", version = "1.11.3", src = system.file("rmd/h/jquery-1.11.3", package='rmarkdown'), script = "jquery.min.js")
# attach the dependency to an empty span
htmltools::attachDependencies(htmltools::tags$span(''), dep)

The order matters, and jQuery must be loaded first. So rather than deal with this code every time we write a presentation let’s be good coders and save this to an Rmd file called ‘LoadJQuery.Rmd’.

title: "Load JQuery"
author: "Jared P. Lander"
date: "July 1, 2017"
output: ioslides_presentation

```{r deps,include=FALSE}
# this ensures jquery is loaded
dep <- htmltools::htmlDependency(name = "jquery", version = "1.11.3", src = system.file("rmd/h/jquery-1.11.3", package='rmarkdown'), script = "jquery.min.js")
htmltools::attachDependencies(htmltools::tags$span(''), dep)

```{js move-id-background-color}
    // for every article tag inside a slide tag
    $("slide > article").each(function(){
        // copy the article name to the parentNode's (the slide) ID

Then in the presentation Rmd file, toward the top, this file is loaded as a chunk child.

```{r load-jquery,child='LoadJQuery.Rmd'}

Now, any time a name attribute is included in heading metadata, that name will become the id for the enclosing <slide>.

## Background Two {.SlideClass #SlideID2 name=Slide2}

Some more text here


<slide class="current" data-slide-num="4" data-total-slides="4" id="Slide2">
        <h2>Background Two</h2>
    <article id="SlideID2" class="SlideClass">
        <p>Some more text here</p>

Then there is the matter of writing the CSS so that the image file is URI encoded. This can become laborious so we write a function to do it.

```{r background-function,include=FALSE}
makeBG <- function(id, file)
        sprintf('<style type="text/css">\n#%s {\nbackground-image: url(%s);\n}\n</style>',
                id, knitr::image_uri(file))

Anywhere in the document, though most logically in the intended slide, that function is called, specifying the slide ID and the image.

## Background Three {.SlideClass #SlideID3}

Full screen background

```{r results='asis',echo=FALSE}
makeBG(id='Slide3', 'Cowboy-Lasso.jpg')

This results in the following slide.