2018 New York R Conference

The 2018 New York R Conference was the biggest and best yet. This is both in terms of the crowd size and content.  The speakers included some of the R community’s best such as Hadley Wickham, David Robinson, Jennifer Hill, Max Kuhn, Andreas Mueller (ok, a little Python), Evelina Gabasova, Sean Taylor and Jeff Ryan. I am proud to say we were almost at gender parity for both attendees and speakers which is amazing for a tech conference. Brooke Watson even excitedly noted that we had a line for the women’s room.

Particularly gratifying for me was seeing so many of my students speak. Eurry Kim, Dan Chen and Alex Boghosian all gave excellent talks.

Some highlights that stuck out to me are:

Emily Robinson Shows There is More to the Tidyverse than Hadley

The Expanded Tidyverse

Emily Robinson, otherwise known as ERob, gave an excellent talk showing how the Tidyverse is so much more than just Hadley and that there are many people inspired by him to contribute in the Tidy way.

Sean Taylor Forecasted the Future with Prophet

Sean Taylor

Sean Taylor, former New Yorker and unrepentant Eagles fan, demonstrate his powerful R and Python, package Prophet, for forecasting time series data. Facebook open sourced his work so we could all benefit.

OG Data Mafia Founder Drew Conway Popped In

Giving away a data mafia shirt

A lucky fan got an autographed NYC Data Mafia t-shirt from Drew Conway.

David Smith Playing Minecraft Through R

Minecraft in R

David Smith played Minecraft through R, including building objects and moving through the world.

Evelina Gabasova Used Social Network Analysis to Break Down Star Wars

It's a Trap

Evelina Gabasova wowed the audience with her fun talk and detailed analysis of character interaction in Star Wars.

Dusty Turner Represented West Point

Dusty Talking Army Sports

Dusty Turner taught us how the United States Military Academy uses R for both student instruction and evaluation.

Hadley Wickham Delved into the Nitty Gritty of R

Hadley shows off objects are stored in memory

Hadley Wickham showed us how to get into the internals of R and figure out how to examine objects from a memory perspective.

Jennifer Hill Demonstrated Awesome Machine Learning Techniques for Causal Inference

Jennifer Hill Explaining Causal Inference

Following her sold-out meetup appearance in March, Jennifer continued to push the boundaries of causal inference.

I Made the Authors of Caret and scitkit-learn Show That R and Python Can Get Along

Caret and Scikit-learn in one place

While both Andreas and Max gave great individual talks, I made them pose for this peace-making photo.

David Robinson Got the Upper Hand in a Sibling Twitter Duel

DRob Teaching

Given only about 30 minutes notice, David put together an entire slideshow on how to livetweet and how to compete with your sibling.

In the End Emily Robinson Beat Her Brother For Best Tweeting

Emily won the prize for best tweeting

Despite David’s headstart Emily was the best tweeter (as calculated by Max Kuhn and Mara Averick) so she won the WASD Code mechanical keyboard with MX Cherry Clear switches.

Silent Auction of Data Paintings

The Robinson Family bought the Pizza Data painting for me

Thomas Levine made paintings of famous datasets that we auctioned off with the proceeds supporting the R Foundation and the Free Software Foundation. The Robinson family very graciously chipped in and bought the painting of the Pizza Poll data for me! I’m still floored by this and in love with the painting.

Ice Cream Sandwiches

Ice Cream Sandwiches

In addition to bagels and eggs sandwiches from Murray’s Bagels, Israeli food from Hummus and Pita Company, avocado toast and coffee from Bluestone Lane Coffee and pizza from Fiore’s, we also had ice creams sandwiches from World’s Best Cookie Dough.

All the Material

To catch up on all the presentations check out Mara Averick’s excellent notes:

Or check out all of Brooke’s drawings, collated by Dan Chen.

Videos and Upcoming Events

The videos will be posted at rstats.nyc in a few weeks for all to enjoy.

There are a number of other events coming up including:

We are already beginning plans for next year’s conference and are working on bringing it to DC as well! Stay tuned for all that and more.

Dan loves his mug

In my last post I discussed using coefplot on glmnet models and in particular discussed a brand new function, coefpath, that uses dygraphs to make an interactive visualization of the coefficient path.

Another new capability for version 1.2.5 of coefplot is the ability to show coefficient plots from xgboost models. Beyond fitting boosted trees and boosted forests, xgboost can also fit a boosted Elastic Net. This makes it a nice alternative to glmnet even though it might not have some of the same user niceties.

To illustrate, we use the same data as our previous post.

First, we load the packages we need and note the version numbers.

# list the packages that we load
# alphabetically for reproducibility
packages <- c('caret', 'coefplot', 'DT', 'xgboost')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)

# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'dygraphs', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')
versions <- c(packages, packagesColon) %>% 
    purrr::map(packageVersion) %>% 
    purrr::map_chr(as.character)
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% 
    dplyr::arrange(Package)
knitr::kable(packageDF)
Package Version
caret 6.0.78
coefplot 1.2.6
dplyr 0.7.4
DT 0.2
dygraphs 1.1.1.4
knitr 1.18
magrittr 1.5
purrr 0.2.4
tibble 1.4.2
useful 1.2.3
xgboost 0.6.4

Then, we read the data. The data are available at http://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world. We also get validation data which is helpful when fitting xgboost mdoels.

manTrain <- readRDS(url('http://www.jaredlander.com/data/manhattan_Train.rds'))
manVal <- readRDS(url('http://www.jaredlander.com/data/manhattan_Validate.rds'))

The data are about New York City land value and have many columns. A sample of the data follows. There’s an odd bug where you have to click on one of the column names for the data to display the actual data.

datatable(manTrain %>% dplyr::sample_n(size=1000), elementId='TrainingSampled',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  scroller=TRUE
              ))

While glmnet automatically standardizes the input data, xgboost does not, so we calculate that manually. We use preprocess from caret to compute the mean and standard deviation of each numeric column then use these later.

preProc <- preProcess(manTrain, method=c('center', 'scale'))

Just like with glmnet, we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with xgboost we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix and vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
    Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
    OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + 
    LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + 
    Landmark
manX <- useful::build.x(valueFormula, data=predict(preProc, manTrain),
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY <- useful::build.y(valueFormula, data=manTrain)

manX_val <- useful::build.x(valueFormula, data=predict(preProc, manVal),
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY_val <- useful::build.y(valueFormula, data=manVal)

There are two functions we can use to fit xgboost models, the eponymous xgboost and xgb.train. When using xgb.train we first store our X and Y matrices in a special xgb.DMatrix object. This is not a necessary step, but makes things a bit cleaner.

manXG <- xgb.DMatrix(data=manX, label=manY)
manXG_val <- xgb.DMatrix(data=manX_val, label=manY_val)

We are now ready to fit a model. All we need to do to fit a linear model instead of a tree is set booster='gblinear' and objective='reg:linear'.

mod1 <- xgb.train(
    # the X and Y training data
    data=manXG,
    # use a linear model
    booster='gblinear',
    # minimize the a regression criterion 
    objective='reg:linear',
    # use MAE as a measure of quality
    eval_metric=c('mae'),
    # boost for up to 500 rounds
    nrounds=500,
    # print out the eval_metric for both the train and validation data
    watchlist=list(train=manXG, validate=manXG_val),
    # print eval_metric every 10 rounds
    print_every_n=10,
    # if the validate eval_metric hasn't improved by this many rounds, stop early
    early_stopping_rounds=25,
    # penalty terms for the L2 portion of the Elastic Net
    lambda=10, lambda_bias=10,
    # penalty term for the L1 portion of the Elastic Net
    alpha=900000000,
    # randomly sample rows
    subsample=0.8,
    # randomly sample columns
    col_subsample=0.7,
    # set the learning rate for gradient descent
    eta=0.1
)
## [1]  train-mae:1190145.875000    validate-mae:1433464.750000 
## Multiple eval metrics are present. Will use validate_mae for early stopping.
## Will train until validate_mae hasn't improved in 25 rounds.
## 
## [11] train-mae:938069.937500 validate-mae:1257632.000000 
## [21] train-mae:932016.625000 validate-mae:1113554.625000 
## [31] train-mae:931483.500000 validate-mae:1062618.250000 
## [41] train-mae:931146.750000 validate-mae:1054833.625000 
## [51] train-mae:930707.312500 validate-mae:1062881.375000 
## [61] train-mae:930137.375000 validate-mae:1077038.875000 
## Stopping. Best iteration:
## [41] train-mae:931146.750000 validate-mae:1054833.625000

The best fit was arrived at after 41 rounds. We can see how the model did on the train and validate sets using dygraphs.

dygraphs::dygraph(mod1$evaluation_log)

We can now plot the coefficients using coefplot. Since xgboost does not save column names, we specify it with feature_names=colnames(manX). Unlike with glmnet models, there is only one penalty so we do not need to specify a specific penalty to plot.

coefplot(mod1, feature_names=colnames(manX), sort='magnitude')

This is another nice addition to coefplot utilizing the power of xgboost.

I’m a big fan of the Elastic Net for variable selection and shrinkage and have given numerous talks about it and its implementation, glmnet. In fact, I will even have a DataCamp course about glmnet coming out soon.

As a side note, I used to pronounce it g-l-m-net but after having lunch with one of its creators, Trevor Hastie, I learn it is pronounced glimnet.

coefplot has long supported glmnet via a standard coefficient plot but I recently added some functionality, so let’s take a look. As we go through this, please pardon the htmlwidgets in iframes.

First, we load packages. I am now fond of using the following syntax for loading the packages we will be using.

# list the packages that we load
# alphabetically for reproducibility
packages <- c('coefplot', 'DT', 'glmnet')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)

# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')

The versions can then be displayed in a table.

versions <- c(packages, packagesColon) %>% 
    purrr::map(packageVersion) %>% 
    purrr::map_chr(as.character)
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% 
    dplyr::arrange(Package)
knitr::kable(packageDF)
Package Version
coefplot 1.2.5.1
dplyr 0.7.4
DT 0.2
glmnet 2.0.13
knitr 1.18
magrittr 1.5
purrr 0.2.4
tibble 1.4.1
useful 1.2.3

First, we read some data. The data are available at http://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world.

manTrain <- readRDS(url('http://www.jaredlander.com/data/manhattan_Train.rds'))

The data are about New York City land value and have many columns. A sample of the data follows.

datatable(manTrain %>% dplyr::sample_n(size=100), elementId='DataSampled',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  scroller=TRUE,
                  scrollY=300
              ))

In order to use glmnet we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with glmnet we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix ad vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
    Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
    OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + 
    LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + 
    Landmark - 1

Notice the - 1 means do not include an intercept since glmnet will do that for us.

manX <- useful::build.x(valueFormula, data=manTrain,
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY <- useful::build.y(valueFormula, data=manTrain)

We are now ready to fit a model.

mod1 <- glmnet(x=manX, y=manY, family='gaussian')

We can view a coefficient plot for a given value of lambda like this.

coefplot(mod1, lambda=330500, sort='magnitude')

A common plot that is built into the glmnet package it the coefficient path.

plot(mod1, xvar='lambda', label=TRUE)

This plot shows the path the coefficients take as lambda increases. They greater lambda is, the more the coefficients get shrunk toward zero. The problem is, it is hard to disambiguate the lines and the labels are not informative.

Fortunately, coefplot has a new function in Version 1.2.5 called coefpath for making this into an interactive plot using dygraphs.

coefpath(mod1)

While still busy this function provides so much more functionality. We can hover over lines, zoom in then pan around.

These functions also work with any value for alpha and for cross-validated models fit with cv.glmnet.

mod2 <- cv.glmnet(x=manX, y=manY, family='gaussian', alpha=0.7, nfolds=5)

We plot coefficient plots for both optimal lambdas.

# coefplot for the 1se error lambda
coefplot(mod2, lambda='lambda.1se', sort='magnitude')

# coefplot for the min error lambda
coefplot(mod2, lambda='lambda.min', sort='magnitude')

The coefficient path is the same as before though the optimal lambdas are noted as dashed vertical lines.

coefpath(mod2)

While coefplot has long been able to plot coefficients from glmnet models, the new coefpath function goes a long way in helping visualize the paths the coefficients take as lambda changes.

For a while I’ve been trying to find the best slideshow format to create using RMarkdown. My go-to format is ioslides because it has a presenter mode that works in self-contained mode, meaning the entire presentation is in one file. The drawback is that fullscreen images is not a built in feature.

Another popular format is reveal.js. This allows for fullscreen images, but presenter mode is not possible while keeping the file self-contained.

A very promising format is from the xaringan package which is based on remarkjs. This allows for beautiful, fullscreen images and it has an excellent presenter mode. This can all be done with a self-contained file. The problem is, due to the nature of the format, having all the images URI encoded for self-contained mode causes the file to open very slowly. I gave a talk about web scraping at the MIT Sports Analytics Conference which takes over five minutes to open.

Given all this, I decided to hack my way to a solution using ioslides. After some prep it works quite well.

This is a long post, so if you want it all in one place, skip ahead.

The Problem

RMarkdown slides are initiated using two hash tags (##). Metadata can be encoded in curly braces ({}) after the hash tags and any heading text.

## Heading Text {.SlideClass #SlideID0 attr=value0}

Slide Content

The thing is, SlideClass, SlideID and the attr value are applied to an <article> tag inside a <slide>, not to the <slide>.

<slide class=''>
    <hgroup>
        <h2>Heading Text</h2>
    </hgroup>

    <article  attr="value0" class="SlideClass" id="SlideID0">
        <p>Slide Content</p>
    </article>
</slide>

This precludes us from setting style="background-url(filename.jpg)" in the header metadata. Further, doing so would still leave just a link to the file and now URI encode it.

The Solution

The solution is to specify the image in CSS and associate it with a specific ID. However, at this point, it will only be associated with the <article>.

#SlideID1 {
    background-image: url(Cowboy-Lasso.jpg);
}

To keep things looking nice, in a separate CSS block we specify some background image properties that apply to all slide background images.

slide {
    background-position: center;
    background-repeat: no-repeat;
    background-size: contain;
}
## Background One {.SlideClass #SlideID1}

Some Text Here
knitr::include_graphics(file.path(imageDir, 'Background-Article.png'))

The trick is to associate the image, in CSS, with the ID of the <slide>. Even though we assign an ID to the <article> this does not help us know the ID of the <slide>. To get around this we use a bit of jQuery. We write some code that take a name attribute from every <article> tag and assigns it as an ID for the corresponding <slide> tags.

$(document).ready(function(){
    // for every article tag inside a slide tag
    $("slide > article").each(function(){
        // copy the article name to the parentNode's (the slide) ID
        this.parentNode.id=$(this).attr('name');
    });
});

In order for this to work, we also need to load jQuery. The quickest way I have found to do this is to use htmltools to

```{r deps,include=FALSE}
# this ensures jquery is loaded
dep <- htmltools::htmlDependency(name = "jquery", version = "1.11.3", src = system.file("rmd/h/jquery-1.11.3", package='rmarkdown'), script = "jquery.min.js")
# attach the dependency to an empty span
htmltools::attachDependencies(htmltools::tags$span(''), dep)
```

The order matters, and jQuery must be loaded first. So rather than deal with this code every time we write a presentation let’s be good coders and save this to an Rmd file called ‘LoadJQuery.Rmd’.

---
title: "Load JQuery"
author: "Jared P. Lander"
date: "July 1, 2017"
output: ioslides_presentation
---

```{r deps,include=FALSE}
# this ensures jquery is loaded
dep <- htmltools::htmlDependency(name = "jquery", version = "1.11.3", src = system.file("rmd/h/jquery-1.11.3", package='rmarkdown'), script = "jquery.min.js")
htmltools::attachDependencies(htmltools::tags$span(''), dep)
```

```{js move-id-background-color}
$(document).ready(function(){
    // for every article tag inside a slide tag
    $("slide > article").each(function(){
        // copy the article name to the parentNode's (the slide) ID
        this.parentNode.id=$(this).attr('name');
    });
});
```

Then in the presentation Rmd file, toward the top, this file is loaded as a chunk child.

```{r load-jquery,child='LoadJQuery.Rmd'}
```

Now, any time a name attribute is included in heading metadata, that name will become the id for the enclosing <slide>.

## Background Two {.SlideClass #SlideID2 name=Slide2}

Some more text here

becomes

<slide class="current" data-slide-num="4" data-total-slides="4" id="Slide2">
    <hgroup>
        <h2>Background Two</h2>
    </hgroup>
    
    <article id="SlideID2" class="SlideClass">
        <p>Some more text here</p>
    </article>
</slide>

Then there is the matter of writing the CSS so that the image file is URI encoded. This can become laborious so we write a function to do it.

```{r background-function,include=FALSE}
makeBG <- function(id, file)
{
    cat(
        sprintf('<style type="text/css">\n#%s {\nbackground-image: url(%s);\n}\n</style>',
                id, knitr::image_uri(file))
    )
}
```

Anywhere in the document, though most logically in the intended slide, that function is called, specifying the slide ID and the image.

## Background Three {.SlideClass #SlideID3}

Full screen background

```{r results='asis',echo=FALSE}
makeBG(id='Slide3', 'Cowboy-Lasso.jpg')
```

This results in the following slide.