In my last post I discussed using coefplot on glmnet models and in particular discussed a brand new function, coefpath, that uses dygraphs to make an interactive visualization of the coefficient path.

Another new capability for version 1.2.5 of coefplot is the ability to show coefficient plots from xgboost models. Beyond fitting boosted trees and boosted forests, xgboost can also fit a boosted Elastic Net. This makes it a nice alternative to glmnet even though it might not have some of the same user niceties.

To illustrate, we use the same data as our previous post.

First, we load the packages we need and note the version numbers.

# list the packages that we load
# alphabetically for reproducibility
packages <- c('caret', 'coefplot', 'DT', 'xgboost')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)

# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'dygraphs', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')

versions <- c(packages, packagesColon) %>% 
    purrr::map(packageVersion) %>% 
    purrr::map_chr(as.character)
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% 
    dplyr::arrange(Package)
knitr::kable(packageDF)

Package	Version
caret	6.0.78
coefplot	1.2.6
dplyr	0.7.4
DT	0.2
dygraphs	1.1.1.4
knitr	1.18
magrittr	1.5
purrr	0.2.4
tibble	1.4.2
useful	1.2.3
xgboost	0.6.4

Then, we read the data. The data are available at https://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world. We also get validation data which is helpful when fitting xgboost mdoels.

manTrain <- readRDS(url('https://www.jaredlander.com/data/manhattan_Train.rds'))
manVal <- readRDS(url('https://www.jaredlander.com/data/manhattan_Validate.rds'))

The data are about New York City land value and have many columns. A sample of the data follows. There’s an odd bug where you have to click on one of the column names for the data to display the actual data.

datatable(manTrain %>% dplyr::sample_n(size=1000), elementId='TrainingSampled',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  scroller=TRUE
              ))

<br />

While glmnet automatically standardizes the input data, xgboost does not, so we calculate that manually. We use preprocess from caret to compute the mean and standard deviation of each numeric column then use these later.

preProc <- preProcess(manTrain, method=c('center', 'scale'))

Just like with glmnet, we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with xgboost we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix and vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
    Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
    OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + 
    LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + 
    Landmark

manX <- useful::build.x(valueFormula, data=predict(preProc, manTrain),
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY <- useful::build.y(valueFormula, data=manTrain)

manX_val <- useful::build.x(valueFormula, data=predict(preProc, manVal),
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY_val <- useful::build.y(valueFormula, data=manVal)

There are two functions we can use to fit xgboost models, the eponymous xgboost and xgb.train. When using xgb.train we first store our X and Y matrices in a special xgb.DMatrix object. This is not a necessary step, but makes things a bit cleaner.

manXG <- xgb.DMatrix(data=manX, label=manY)
manXG_val <- xgb.DMatrix(data=manX_val, label=manY_val)

We are now ready to fit a model. All we need to do to fit a linear model instead of a tree is set booster='gblinear' and objective='reg:linear'.

mod1 <- xgb.train(
    # the X and Y training data
    data=manXG,
    # use a linear model
    booster='gblinear',
    # minimize the a regression criterion 
    objective='reg:linear',
    # use MAE as a measure of quality
    eval_metric=c('mae'),
    # boost for up to 500 rounds
    nrounds=500,
    # print out the eval_metric for both the train and validation data
    watchlist=list(train=manXG, validate=manXG_val),
    # print eval_metric every 10 rounds
    print_every_n=10,
    # if the validate eval_metric hasn't improved by this many rounds, stop early
    early_stopping_rounds=25,
    # penalty terms for the L2 portion of the Elastic Net
    lambda=10, lambda_bias=10,
    # penalty term for the L1 portion of the Elastic Net
    alpha=900000000,
    # randomly sample rows
    subsample=0.8,
    # randomly sample columns
    col_subsample=0.7,
    # set the learning rate for gradient descent
    eta=0.1
)

## [1]  train-mae:1190145.875000    validate-mae:1433464.750000 
## Multiple eval metrics are present. Will use validate_mae for early stopping.
## Will train until validate_mae hasn't improved in 25 rounds.
## 
## [11] train-mae:938069.937500 validate-mae:1257632.000000 
## [21] train-mae:932016.625000 validate-mae:1113554.625000 
## [31] train-mae:931483.500000 validate-mae:1062618.250000 
## [41] train-mae:931146.750000 validate-mae:1054833.625000 
## [51] train-mae:930707.312500 validate-mae:1062881.375000 
## [61] train-mae:930137.375000 validate-mae:1077038.875000 
## Stopping. Best iteration:
## [41] train-mae:931146.750000 validate-mae:1054833.625000

The best fit was arrived at after 41 rounds. We can see how the model did on the train and validate sets using dygraphs.

dygraphs::dygraph(mod1$evaluation_log)

<br />

We can now plot the coefficients using coefplot. Since xgboost does not save column names, we specify it with feature_names=colnames(manX). Unlike with glmnet models, there is only one penalty so we do not need to specify a specific penalty to plot.

coefplot(mod1, feature_names=colnames(manX), sort='magnitude')

This is another nice addition to coefplot utilizing the power of xgboost.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science and AI firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Government Data Science and AI Conferences and author of R for Everyone.

I’m a big fan of the Elastic Net for variable selection and shrinkage and have given numerous talks about it and its implementation, glmnet. In fact, I will even have a DataCamp course about glmnet coming out soon.

As a side note, I used to pronounce it g-l-m-net but after having lunch with one of its creators, Trevor Hastie, I learn it is pronounced glimnet.

coefplot has long supported glmnet via a standard coefficient plot but I recently added some functionality, so let’s take a look. As we go through this, please pardon the htmlwidgets in iframes.

First, we load packages. I am now fond of using the following syntax for loading the packages we will be using.

# list the packages that we load
# alphabetically for reproducibility
packages <- c('coefplot', 'DT', 'glmnet')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)

# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')

The versions can then be displayed in a table.

versions <- c(packages, packagesColon) %>% 
    purrr::map(packageVersion) %>% 
    purrr::map_chr(as.character)
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% 
    dplyr::arrange(Package)
knitr::kable(packageDF)

Package	Version
coefplot	1.2.5.1
dplyr	0.7.4
DT	0.2
glmnet	2.0.13
knitr	1.18
magrittr	1.5
purrr	0.2.4
tibble	1.4.1
useful	1.2.3

First, we read some data. The data are available at https://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world.

manTrain <- readRDS(url('https://www.jaredlander.com/data/manhattan_Train.rds'))

The data are about New York City land value and have many columns. A sample of the data follows.

datatable(manTrain %>% dplyr::sample_n(size=100), elementId='DataSampled',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  scroller=TRUE,
                  scrollY=300
              ))

<br />

In order to use glmnet we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with glmnet we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix ad vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
    Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
    OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + 
    LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + 
    Landmark - 1

Notice the - 1 means do not include an intercept since glmnet will do that for us.

manX <- useful::build.x(valueFormula, data=manTrain,
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY <- useful::build.y(valueFormula, data=manTrain)

We are now ready to fit a model.

mod1 <- glmnet(x=manX, y=manY, family='gaussian')

We can view a coefficient plot for a given value of lambda like this.

coefplot(mod1, lambda=330500, sort='magnitude')

A common plot that is built into the glmnet package it the coefficient path.

plot(mod1, xvar='lambda', label=TRUE)

This plot shows the path the coefficients take as lambda increases. They greater lambda is, the more the coefficients get shrunk toward zero. The problem is, it is hard to disambiguate the lines and the labels are not informative.

Fortunately, coefplot has a new function in Version 1.2.5 called coefpath for making this into an interactive plot using dygraphs.

coefpath(mod1)

<br />

While still busy this function provides so much more functionality. We can hover over lines, zoom in then pan around.

These functions also work with any value for alpha and for cross-validated models fit with cv.glmnet.

mod2 <- cv.glmnet(x=manX, y=manY, family='gaussian', alpha=0.7, nfolds=5)

We plot coefficient plots for both optimal lambdas.

# coefplot for the 1se error lambda
coefplot(mod2, lambda='lambda.1se', sort='magnitude')

# coefplot for the min error lambda
coefplot(mod2, lambda='lambda.min', sort='magnitude')

The coefficient path is the same as before though the optimal lambdas are noted as dashed vertical lines.

coefpath(mod2)

<br />

While coefplot has long been able to plot coefficients from glmnet models, the new coefpath function goes a long way in helping visualize the paths the coefficients take as lambda changes.

The biggest event for me this year was completely outside of work and had nothing to do with statistics or R: I got married. We technically met at the Open Stats meetup and I did build our wedding website with RMarkdown, so R was still involved. We just returned from our around-the-world honeymoon so I thought the best way to track our travels would be with maps and globes using leaflet and threejs.

Before we get to any code, the following packages were used in making this post.

<br />

This was an extensive trip that, in addition to traditional vacation activities, included a few visits to clients and speaking and a few conferences and meetups. In all, we visited, London, Singapore, Hong Kong, Auckland, Queenstown, Bora Bora, Tahiti, Moorea, San Jose and San Francisco, with a connection or two in between.

The airport/ferry codes for our trip were the following.

Origin	Destination	Airline
JFK	LGW	Norwegian Air
LHR	SIN	Singapore Airlines
SIN	HKG	Singapore Airlines
HKG	AKL	Cathay Pacific
AKL	ZQN	Air New Zealand
ZQN	AKL	Air New Zealand
AKL	PPT	Air New Zealand
PPT	BOB	Air Tahiti
BOB	MOZ	Air Tahiti
MOZ	PPT	Terevau
PPT	LAX	Air Tahiti Nui
LAX	SJC	Alaska Airlines
SFO	JFK	JetBlue

Converting these to latitude and longitude is easy thanks to Open Flights.

# read in the data
airports <- readr::read_csv('https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports-extended.dat',
                            # give it good column names since the data are headerless
                            col_names=c('ID', 'Name', 'City', 'Country', 
                                        'IATA', 'ICAO', 'Latitude', 'Longitude', 
                                        'Altitude', 'Timezone', 'DST', 'Tz', 
                                        'Type', 'Source'))

We then use filter to get just the ports we visited. Notice how we use a second tbl inside filter.

visited <- airports %>% 
    select(Name, City, Country, IATA, Latitude, Longitude) %>% 
    filter(IATA %in% (codes %>% select(Origin, Destination) %>% unlist))

DT::datatable(visited, elementId='AirportsTable',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  dom='<"top"f>rt<"bottom"i><"clear">'
                  ,
                  scrollY=200,
                  scroller=TRUE
              )
) %>% 
    DT::formatRound(columns=c('Latitude', 'Longitude'), digits=2)

<br />

We then manually reorder the airports so that edges can be drawn nicely between them. This is akin to creating an edgelist of airport-pairs. This is not the most robust way of creating this list, but suffices for our purposes.

visitedOrdered <- visited %>% 
    slice(c(12, 1, 2, 8, 7, 5, 6, 5, 13, 3, 4, 13, 10, 11, 12))

For the first visualization let’s create a map using leaflet.

# initialize the widget
leaflet(data=visitedOrdered) %>% 
    # overlay map tiles
    addTiles() %>% 
    # plot lines from one point to another
    addPolylines(lng=~Longitude, lat=~Latitude) %>% 
    # add markers with city names
    addMarkers(lng=~Longitude, lat=~Latitude, popup=~City)

<br />

Unfortunately, this doesn’t quite capture the directions of the flights as it makes it look like we flew back west to get to Papeete. So let’s try a globe instead using threejs.

We augment the edgelist of airports so that it has the latitude and longitude of the origin and destination airports for each flight.

flightPaths <- codes %>% 
    left_join(visited %>% select(IATA, Longitude, Latitude), by=c('Origin'='IATA')) %>% 
    rename(oLong=Longitude, oLat=Latitude) %>% 
    left_join(visited %>% select(IATA, Longitude, Latitude), by=c('Destination'='IATA')) %>% 
    rename(dLong=Longitude, dLat=Latitude)

DT::datatable(visited, elementId='FlightPathLatLong',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  dom='<"top"f>rt<"bottom"i><"clear">'
                  ,
                  scrollY=200,
                  scroller=TRUE
              )
) %>% 
    DT::formatRound(columns=c('Latitude', 'Longitude'), digits=2)

<br />

Now we can provide that data to threejs. We first specify an image to overlay on the globe. Then we specify the latitude and longitude of visited airports. After that, we provide the origin and destination latitudes and longitudes of our flights. The rest of the arguments are cosmetic.

globejs(
    # the image to overlay on the globe
    img="http://eoimages.gsfc.nasa.gov/images/imagerecords/73000/73909/world.topo.bathy.200412.3x5400x2700.jpg",
    # lat/long of visited airports
    lat=visited$Latitude, long=visited$Longitude,
    # lat/long of origin and destination
    arcs=flightPaths %>% select(oLat, oLong, dLat, dLong),
    # cosmetic adjustments
    arcsHeight=.4, arcsLwd=7, arcsColor="red", arcsOpacity=.95,
    atmosphere=FALSE, fov=30, rotationlat=0.3, rotationlong=.8*pi)

<br />

We now calculate the total distance traveled (not including car trips) using Haversine Distance to account for the curvature of the Earth.

distHaversine(visitedOrdered %>% select(Longitude, Latitude), r=3959) %>% sum

## [1] 28660.52

So we traveled 3,760 more miles than the circumference of the Earth.

Beyond the epic proportions of our travel, this honeymoon was outstanding from the sheer length, to the vastly different places we visited, to the food we ate and the sights we saw, to the activities we participated in, to the great people along the way. And, of course, it’s amazing to spend a month traveling with your favorite person.

For a while I’ve been trying to find the best slideshow format to create using RMarkdown. My go-to format is ioslides because it has a presenter mode that works in self-contained mode, meaning the entire presentation is in one file. The drawback is that fullscreen images is not a built in feature.

Another popular format is reveal.js. This allows for fullscreen images, but presenter mode is not possible while keeping the file self-contained.

A very promising format is from the xaringan package which is based on remarkjs. This allows for beautiful, fullscreen images and it has an excellent presenter mode. This can all be done with a self-contained file. The problem is, due to the nature of the format, having all the images URI encoded for self-contained mode causes the file to open very slowly. I gave a talk about web scraping at the MIT Sports Analytics Conference which takes over five minutes to open.

Given all this, I decided to hack my way to a solution using ioslides. After some prep it works quite well.

This is a long post, so if you want it all in one place, skip ahead.

The Problem

RMarkdown slides are initiated using two hash tags (##). Metadata can be encoded in curly braces ({}) after the hash tags and any heading text.

## Heading Text {.SlideClass #SlideID0 attr=value0}

Slide Content

The thing is, SlideClass, SlideID and the attr value are applied to an <article> tag inside a <slide>, not to the <slide>.

<slide class=''>
    <hgroup>
        <h2>Heading Text</h2>
    </hgroup>

    <article  attr="value0" class="SlideClass" id="SlideID0">
        <p>Slide Content</p>
    </article>
</slide>

This precludes us from setting style="background-url(filename.jpg)" in the header metadata. Further, doing so would still leave just a link to the file and now URI encode it.

The Solution

The solution is to specify the image in CSS and associate it with a specific ID. However, at this point, it will only be associated with the <article>.

#SlideID1 {
    background-image: url(Cowboy-Lasso.jpg);
}

To keep things looking nice, in a separate CSS block we specify some background image properties that apply to all slide background images.

slide {
    background-position: center;
    background-repeat: no-repeat;
    background-size: contain;
}

## Background One {.SlideClass #SlideID1}

Some Text Here

knitr::include_graphics(file.path(imageDir, 'Background-Article.png'))

The trick is to associate the image, in CSS, with the ID of the <slide>. Even though we assign an ID to the <article> this does not help us know the ID of the <slide>. To get around this we use a bit of jQuery. We write some code that take a name attribute from every <article> tag and assigns it as an ID for the corresponding <slide> tags.

$(document).ready(function(){
    // for every article tag inside a slide tag
    $("slide > article").each(function(){
        // copy the article name to the parentNode's (the slide) ID
        this.parentNode.id=$(this).attr('name');
    });
});

In order for this to work, we also need to load jQuery. The quickest way I have found to do this is to use htmltools to

```{r deps,include=FALSE}
# this ensures jquery is loaded
dep <- htmltools::htmlDependency(name = "jquery", version = "1.11.3", src = system.file("rmd/h/jquery-1.11.3", package='rmarkdown'), script = "jquery.min.js")
# attach the dependency to an empty span
htmltools::attachDependencies(htmltools::tags$span(''), dep)
```

The order matters, and jQuery must be loaded first. So rather than deal with this code every time we write a presentation let’s be good coders and save this to an Rmd file called ‘LoadJQuery.Rmd’.

---
title: "Load JQuery"
author: "Jared P. Lander"
date: "July 1, 2017"
output: ioslides_presentation
---

```{r deps,include=FALSE}
# this ensures jquery is loaded
dep <- htmltools::htmlDependency(name = "jquery", version = "1.11.3", src = system.file("rmd/h/jquery-1.11.3", package='rmarkdown'), script = "jquery.min.js")
htmltools::attachDependencies(htmltools::tags$span(''), dep)
```

```{js move-id-background-color}
$(document).ready(function(){
    // for every article tag inside a slide tag
    $("slide > article").each(function(){
        // copy the article name to the parentNode's (the slide) ID
        this.parentNode.id=$(this).attr('name');
    });
});
```

Then in the presentation Rmd file, toward the top, this file is loaded as a chunk child.

```{r load-jquery,child='LoadJQuery.Rmd'}
```

Now, any time a name attribute is included in heading metadata, that name will become the id for the enclosing <slide>.

## Background Two {.SlideClass #SlideID2 name=Slide2}

Some more text here

becomes

<slide class="current" data-slide-num="4" data-total-slides="4" id="Slide2">
    <hgroup>
        <h2>Background Two</h2>
    </hgroup>
    
    <article id="SlideID2" class="SlideClass">
        <p>Some more text here</p>
    </article>
</slide>

Then there is the matter of writing the CSS so that the image file is URI encoded. This can become laborious so we write a function to do it.

```{r background-function,include=FALSE}
makeBG <- function(id, file)
{
    cat(
        sprintf('<style type="text/css">\n#%s {\nbackground-image: url(%s);\n}\n</style>',
                id, knitr::image_uri(file))
    )
}
```

Anywhere in the document, though most logically in the intended slide, that function is called, specifying the slide ID and the image.

## Background Three {.SlideClass #SlideID3}

Full screen background

```{r results='asis',echo=FALSE}
makeBG(id='Slide3', 'Cowboy-Lasso.jpg')
```

This results in the following slide.

Of course, we must use images with the proper resolution.

Putting it All Together

Putting it all together yields a presentation with fullscreen background images.

---
title: "Fullscreen Backgrounds"
author: Jared P. Lander
output:
    ioslides_presentation:
      self_contained: yes
      widescreen: yes
---

```{css}
slide {
    background-position: center;
    background-repeat: no-repeat;
    background-size: contain;
}
```

```{r deps,include=FALSE}
# this ensures jquery is loaded
dep <- htmltools::htmlDependency(name = "jquery", version = "1.11.3", src = system.file("rmd/h/jquery-1.11.3", package='rmarkdown'), script = "jquery.min.js")
htmltools::attachDependencies(htmltools::tags$span(''), dep)
```

```{js move-id-background-color}
$(document).ready(function(){
    // for every article tag inside a slide tag
    $("slide > article").each(function(){
        // copy the article name to the parentNode's (the slide) ID
        this.parentNode.id=$(this).attr('name');
    });
});
```

```{r background-function,include=FALSE}
makeBG <- function(id, file)
{
    cat(
        sprintf('<style type="text/css">\n#%s {\nbackground-image: url(%s);\n}\n</style>',
                id, knitr::image_uri(file))
    )
}
```

## Slide with Background {.SlideClass #SlideID name=ThisSlide}

```{r results='asis',echo=FALSE}
makeBG(id='ThisSlide', 'Cowboy-Lasso.jpg')
```

Slide with a nice background.

The full code and final presentation are posted so we can see the whole thing.

In case typing in the block of code to make the background becomes too tedious we can make an RStudio snippet.

snippet back
    ```{r results='asis'}
    makeBG('${1}', file.path(imageDir, '${2}'))
    ```
    ${0}

Since this is RMarkdown, in order to complete a snippet we have to use shift+tab rather than just tab as in R scripts.

I posted the snippets I use on GitHub.

Other RStudio tips and tricks can be seen in Sean Lopp’s talk at this year’s New York R Conference.

This is the code for a webinar I gave with Dan Mbanga for Amazon’s AWS Webinar Series about deep learning using MXNet in R. The video is available on YouTube.

I am experimentally saving htmlwidget objects to HTML files then loading them with iframes so please excuse the scroll bars.

Packages

These are the packages we are using either by loading the entire package or using individual functions.

<br />

Data

This dataset is about property lots in Manhattan and includes descriptive information as well as value. The original data are available from NYC Planning and the prepared files seen here at the Lander Analytics data.world repo.

data_train <- readr::read_csv(file.path(dataDir, 'manhattan_Train.csv'))
data_validate <- readr::read_csv(file.path(dataDir, 'manhattan_Validate.csv'))
data_test <- readr::read_csv(file.path(dataDir, 'manhattan_Test.csv'))

# Remove some variables
data_train <- data_train %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)
data_validate <- data_validate %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)
data_test <- data_test %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)

This is a glimpse of the data.

<br />

Here is a visualization of the class balance.

dataList <- list(data_train, data_validate, data_test)
dataList %>% 
    purrr::map(function(x) figure(width=600/NROW(dataList), height=500, legend_location=NULL) %>% 
                   ly_bar(x=High, color=factor(High), data=x, legend='code')
    ) %>% 
    grid_plot(nrow=1, ncol=NROW(dataList), same_axes=TRUE)

<br />

We use vtreat to do some automated feature engineering.

# The column name for the response
responseName <- 'High'
# The target value for the response
responseTarget <- TRUE
# The remaining columns are predictors
varNames <- setdiff(names(data_train), responseName)

# build the treatment design
treatmentDesign <- designTreatmentsC(dframe=data_train, varlist=varNames, 
                                     outcomename=responseName, 
                                     outcometarget=responseTarget, 
                                     verbose=TRUE)

## [1] "desigining treatments Mon Jun 26 01:23:38 2017"
## [1] "designing treatments Mon Jun 26 01:23:38 2017"
## [1] " have level statistics Mon Jun 26 01:23:39 2017"
## [1] "design var FireService Mon Jun 26 01:23:39 2017"
## [1] "design var ZoneDist1 Mon Jun 26 01:23:39 2017"
## [1] "design var ZoneDist2 Mon Jun 26 01:23:40 2017"
## [1] "design var ZoneDist3 Mon Jun 26 01:23:40 2017"
## [1] "design var Class Mon Jun 26 01:23:41 2017"
## [1] "design var LandUse Mon Jun 26 01:23:42 2017"
## [1] "design var Easements Mon Jun 26 01:23:43 2017"
## [1] "design var OwnerType Mon Jun 26 01:23:43 2017"
## [1] "design var LotArea Mon Jun 26 01:23:43 2017"
## [1] "design var BldgArea Mon Jun 26 01:23:43 2017"
## [1] "design var ComArea Mon Jun 26 01:23:43 2017"
## [1] "design var ResArea Mon Jun 26 01:23:44 2017"
## [1] "design var OfficeArea Mon Jun 26 01:23:44 2017"
## [1] "design var RetailArea Mon Jun 26 01:23:44 2017"
## [1] "design var GarageArea Mon Jun 26 01:23:44 2017"
## [1] "design var StrgeArea Mon Jun 26 01:23:44 2017"
## [1] "design var FactryArea Mon Jun 26 01:23:44 2017"
## [1] "design var OtherArea Mon Jun 26 01:23:44 2017"
## [1] "design var NumBldgs Mon Jun 26 01:23:44 2017"
## [1] "design var NumFloors Mon Jun 26 01:23:44 2017"
## [1] "design var UnitsRes Mon Jun 26 01:23:44 2017"
## [1] "design var UnitsTotal Mon Jun 26 01:23:44 2017"
## [1] "design var LotFront Mon Jun 26 01:23:44 2017"
## [1] "design var LotDepth Mon Jun 26 01:23:44 2017"
## [1] "design var BldgFront Mon Jun 26 01:23:44 2017"
## [1] "design var BldgDepth Mon Jun 26 01:23:45 2017"
## [1] "design var Extension Mon Jun 26 01:23:45 2017"
## [1] "design var Proximity Mon Jun 26 01:23:45 2017"
## [1] "design var IrregularLot Mon Jun 26 01:23:45 2017"
## [1] "design var LotType Mon Jun 26 01:23:46 2017"
## [1] "design var BasementType Mon Jun 26 01:23:46 2017"
## [1] "design var Landmark Mon Jun 26 01:23:47 2017"
## [1] "design var BuiltFAR Mon Jun 26 01:23:47 2017"
## [1] "design var ResidFAR Mon Jun 26 01:23:47 2017"
## [1] "design var CommFAR Mon Jun 26 01:23:47 2017"
## [1] "design var FacilFAR Mon Jun 26 01:23:47 2017"
## [1] "design var Built Mon Jun 26 01:23:47 2017"
## [1] "design var HistoricDistrict Mon Jun 26 01:23:48 2017"
## [1] " scoring treatments Mon Jun 26 01:23:48 2017"
## [1] "have treatment plan Mon Jun 26 01:24:19 2017"
## [1] "rescoring complex variables Mon Jun 26 01:24:19 2017"
## [1] "done rescoring complex variables Mon Jun 26 01:24:32 2017"

Then we create train, validate and test matrices.

# build design data.frames
dataTrain <- prepare(treatmentplan=treatmentDesign, dframe=data_train)
dataValidate <- prepare(treatmentplan=treatmentDesign, dframe=data_validate)
dataTest <- prepare(treatmentplan=treatmentDesign, dframe=data_test)

# use all the level names as predictors
predictorNames <- setdiff(names(dataTrain), responseName)

# training matrices
trainX <- data.matrix(dataTrain[, predictorNames])
trainY <- dataTrain[, responseName]

# validation matrices
validateX <- data.matrix(dataValidate[, predictorNames])
validateY <- dataValidate[, responseName]

# test matrices
testX <- data.matrix(dataTest[, predictorNames])
testY <- dataTest[, responseName]

# Sparse versions for some models
trainX_sparse <- sparse.model.matrix(object=High ~ ., data=dataTrain)
validateX_sparse <- sparse.model.matrix(object=High ~ ., data=dataValidate)
testX_sparse <- sparse.model.matrix(object=High ~ ., data=dataTest)

Feedforward Network

Helper Functions

This is a function that allows mxnet to calculate log-loss based on the logloss function from the Metrics package.

# log-loss
mx.metric.mlogloss <- mx.metric.custom("mlogloss", function(label, pred){
    return(Metrics::logLoss(label, pred))
})

Network Formulation

We build the model symbolically. We use a feedforward network with two hidden layers. The first hidden layer has 256 units and the second has 128 units. We also use dropout and batch normalization for regularization. The last step is to use a logistic sigmoid (inverse logit) for the logistic regression output.

net <- mx.symbol.Variable('data') %>%
    # drop out 20% of predictors
    mx.symbol.Dropout(p=0.2, name='Predictor_Dropout') %>%
    # a fully connected layer with 256 units
    mx.symbol.FullyConnected(num_hidden=256, name='fc_1') %>%
    # batch normalize the units
    mx.symbol.BatchNorm(name='bn_1') %>%
    # use the rectified linear unit (relu) for the activation function
    mx.symbol.Activation(act_type='relu', name='relu_1') %>%
    # drop out 50% of the units
    mx.symbol.Dropout(p=0.5, name='dropout_1') %>%
    # a fully connected layer with 128 units
    mx.symbol.FullyConnected(num_hidden=128, name='fc_2') %>%
    # batch normalize the units
    mx.symbol.BatchNorm(name='bn_2') %>%
    # use the rectified linear unit (relu) for the activation function
    mx.symbol.Activation(act_type='relu', name='relu_2') %>%
    # drop out 50% of the units
    mx.symbol.Dropout(p=0.5, name='dropout_2') %>%
    # fully connect to the output layer which has just the 1 unit
    mx.symbol.FullyConnected(num_hidden=1, name='out') %>%
    # use the sigmoid output
    mx.symbol.LogisticRegressionOutput(name='output')

Inspect the Network

By inspecting the symbolic network we see that it is actually just a C++ pointer. We also see its arguments and a visualization.

net

## C++ object <0000000018ed9aa0> of class 'MXSymbol' <0000000018c30d00>

arguments(net)

##  [1] "data"         "fc_1_weight"  "fc_1_bias"    "bn_1_gamma"  
##  [5] "bn_1_beta"    "fc_2_weight"  "fc_2_bias"    "bn_2_gamma"  
##  [9] "bn_2_beta"    "out_weight"   "out_bias"     "output_label"

graph.viz(net)

<br />

Network Training

With the data prepared and the network specified we now train the model. First we set the envinronment variable MXNET_CPU_WORKER_NTHREADS=4 since this demo is on a laptop with four threads. Using a GPU will speed up the computations. We also set the random seed with mx.set.seed for reproducibility.

We use the Adam optimization algorithm which has an adaptive learning rate which incorporates momentum.

# use four CPU threads
Sys.setenv('MXNET_CPU_WORKER_NTHREADS'=4)

# set the random seed
mx.set.seed(1234)

# train the model
mod_net <- mx.model.FeedForward.create(
    symbol            = net,    # the symbolic network
    X                 = trainX, # the predictors
    y                 = trainY, # the response
    optimizer         = "adam", # using the Adam optimization method
    eval.data         = list(data=validateX, label=validateY), # validation data
    ctx               = mx.cpu(), # use the cpu for training
    eval.metric       = mx.metric.mlogloss, # evaluate with log-loss
    num.round         = 50,     # 50 epochs
    learning.rate     = 0.001,   # learning rate
    array.batch.size  = 256,    # batch size
    array.layout      = "rowmajor"  # the data is stored in row major format
)

Predictions

Statisticians call this step prediction while the deep learning field calls it inference which has an entirely different meaning in statistics.

preds_net <- predict(mod_net, testX, array.layout="rowmajor") %>% t

Elastic Net

Model Training

We fit an Elastic Net model with glmnet.

registerDoParallel(cl=4)

set.seed(1234)
mod_glmnet <- cv.glmnet(x=trainX_sparse, y=trainY, 
                        alpha=.5, family='binomial', 
                        type.measure='auc',
                        nfolds=5, parallel=TRUE)

Predictions

preds_glmnet <- predict(mod_glmnet, newx=testX_sparse, s='lambda.1se', type='response')

XGBoost

Model Training

We fit a random forest with xgboost.

set.seed(1234)

trainXG <- xgb.DMatrix(data=trainX_sparse, label=trainY)
validateXG <- xgb.DMatrix(data=validateX_sparse, label=validateY)

watchlist <- list(train=trainXG, validate=validateXG)

mod_xgboost <- xgb.train(data=trainXG, 
                nrounds=1, nthread=4, 
                num_parallel_tree=500, subsample=0.5, colsample_bytree=0.5,
                objective='binary:logistic',
                eval_metric = "error", eval_metric = "logloss",
                print_every_n=1, watchlist=watchlist)

## [1]  train-error:0.104713    train-logloss:0.525032  validate-error:0.108254 validate-logloss:0.527535

Predictions

preds_xgboost <- predict(mod_xgboost, newdata=testX_sparse)

SVM

Model Training

set.seed(1234)
mod_svm <- e1071::svm(x=trainX_sparse, y=trainY, probability=TRUE, type='C')

This model did not train in a reasonable time.

Results

ROC

rocData <- dplyr::bind_rows(
    cbind(data.frame(roc(testY, preds_glmnet, direction="<")[c('specificities', 'sensitivities')]), Model='glmnet'),
    cbind(data.frame(roc(testY, preds_xgboost, direction="<")[c('specificities', 'sensitivities')]), Model='xgboost'),
    cbind(data.frame(roc(testY, preds_net, direction="<")[c('specificities', 'sensitivities')]), Model='Net')
)

ggplotly(ggplot(rocData, aes(x=specificities, y=sensitivities)) + geom_line(aes(color=Model, group=Model)) + scale_x_reverse(), width=800, height=600)

<br />

Shortly after I learned LaTeX I used it to write my resume (or CV if you will), freeing me from the headache of using Microsoft Word and the associated formatting troubles. Even that wasn’t enough though because different audiences needed different information and job listings. I could have stored all the information in the file and commented out bullet points I did not want to use, but that seemed sloppy. So instead I wrote an R package called resumer.

The trick is to store all of the data in a CSV, one row per bullet point.¹

JobName	Company	Location	Title	Start	End	Bullet	BulletName	Type	Description
Tech Startup	Pied Piper	New York, NY	CTO	2013	Present	Set up company’s computing platform	1	Job	NA
Tech Startup	Pied Piper	New York, NY	CTO	2013	Present	Designed data strategy overseeing many datasources	2	Job	NA
Tech Startup	Pied Piper	New York, NY	CTO	2013	Present	Constructed statistical models for predictive analytics of big data	3	Job	NA
Large Bank	Goliath National Bank	New York, NY	Quant	2011	2013	Built quantitative models for derivatives trades	1	Job	NA
Large Bank	Goliath National Bank	New York, NY	Quant	2011	2013	Wrote algorithms using the R statistical programming language	2	Job	NA
Bank Intern	Goliath National Bank	New York, NY	Intern	2010	NA	Got coffee for senior staff	1	Job	NA

Each row represents a detail about a job. So a job may take multiple rows.

The columns are:

JobName: Name identifying this job. This is identifying information used when selecting which jobs to display.
Company: Name of company.
Location: Physical location of job.
Title: Title held at job.
Start: Start date of job, usually represented by a year.
End: End date of job. This would ordinarily by a year, ‘Present’ or blank.
Bullet: The detail about the job.
BulletName: Identifier for this detail, used when selecting which details to display.
Type: Should be either Job or Research.
Description: Used for a quick blurb about research roles.

There are many parts to using this package which are all explained in the README and mostly reproduced here.

The yaml header holds your name, address, the location of the jobs CSV file, education information and any highlights. Remember, proper indenting is required for yaml.

The name and address fields are self explanatory. output takes the form of package::function which for this package is resumer::resumer.

The location of the jobs CSV is specified in the JobFile slot of the params entry. This should be the absolute path to the CSV.

These would look like this.

---
name: "Generic Name"
address: "New York"
output: resumer::resumer
params:
    JobFile: "examples/jobs.csv"
---

Supplying education information is done as a list in the education entry, with each school containing slots for school, dates and optionally notes. Each slot of the list is started with a -. The notes slot starts with a | and each line (except the last line) must end with two spaces.

For example:

---
education:
-   school: "Hudson University"
    dates: "2007--2009"
    notes: |
        GPA 3.955  
        Master of Arts in Statistics
-   school: "Smallville College"
    dates: "2000--2004"
    notes: |
        Cumulative GPA 3.838 Summa Cum Laude, Honors in Mathematics  
        Bachelor of Science in Mathematics, Journalism Minor  
        The Wayne Award for Excellence in Mathematics  
        Member of Pi Mu Epsilon, a national honorary mathematics society
---

To provide a highlights section set doHighlights: yes and create a highlights tag.

Each bullet in the highlights entry should be a list slot started by -. For example.

---
doHighlights: yes
highlights:
-   bullet: Author of \emph{Pulitzer Prize} winning article
-   bullet: Organizer of \textbf{Glasses and Cowl} Meetup
-   bullet: Analyzed global survey by the \textbf{Surveyors Inc}
-   bullet: Professor of Journalism at \textbf{Hudson University}
-   bullet: Thesis on \textbf{Facial Recognition Errors}
-   bullet: Served as reporter in \textbf{Vientiane, Laos}
---

Jobs and details are selected for display by building a list of lists named jobList. Each inner list represents a job and should have three unnamed elements: – CompanyName – JobName – Vector of BulletNames

An example is:

jobList <- list(
    list("Pied Piper", "Tech Startup", c(1, 3)),
    list("Goliath National Bank", "Large Bank", 1:2),
    list("Goliath National Bank", "Bank Intern", 1:3),
    list("Surveyors Inc", "Survery Stats", 1:2),
    list("Daily Planet", "Reporting", 2:4),
    list("Hudson University", "Professor", c(1, 3:4)),
    list("Hooli", "Coding Intern", c(1:3))
)

Research is specified similarly in researchList.

# generate a list of lists of research that list the company name, job name and bullet
researchList <- list(
    list("Hudson University", "Oddie Research", 4:5),
    list("Daily Planet", "Winning Article", 2)
)

The job file is read into the jobs variable using read.csv2.

library(resumer)
jobs <- read.csv2(params$JobFile, header=TRUE, sep=',', stringsAsFactors=FALSE)

The jobs and details are written to LaTeX using a code chunk with results='asis'.

Same with research details.

Regular LaTeX code can be used, such as in specifying an athletics section. Note that this uses a special rSection environment.

\begin{rSection}{Athletics}
\textbf{Ice Hockey} \emph{Goaltender} | \textbf{Hudson University} | 2000--2004 \\
\textbf{Curling} \emph{Vice Skip} | \textbf{Hudson University} | 2000--2004
\end{rSection}

A complete template is available when creating a new file in RStudio.

Any suggestions or, even better, pull requests are welcome at the GitHub page.

A helper function, createJobFile, creates a CSV with the correct headers.↩

Snowstorm Stella impacted both our numbers and our location, but last night a smaller crew braved the cold weather and messy streets to celebrate Pi Day with pizza and Pi Cake at Ribalta.

We naturally ate a lot of round pies and even a rectangular pie to honor Hippocrates’ squaring the lune.

This year’s Pi Cake came from Empire Cakes for the third year in a row. It was their Brooklyn Blackout cake with Chocolate frosting, a blue Pi symbol on top and blue circles with red radii around the sides.

Some pictures from last night:

And all the years’ Pi Cakes:

[Show slideshow]

I’m speaking in a few places over the next few weeks, so rather than just giving people a day’s notice I figured I should lay it out a bit. Right now I have three public talks lined up with a few more about to solidify. Soon I will update this map to have past talks too.

Talk	Event	City	Date
Modeling and Machine Learning in R	ODSC	San Francisco	2017-03-01
Scraping and Analyzing NFL Data	Sloan Sports Analytics Conference	Boston	2017-03-03
Fun with R	New York R Conference	New York	2017-04-21

Highlights from the 2016 New York R Conference

Originally posted on www.work-bench.com.

You might be asking yourself, “How was the 2016 New York R Conference?”

Well, if we had to sum it up in one picture, it would look a lot like this (thank you to Drew Conway for the slide & delivering the battle cry for data science in NYC):

Our 2nd annual, sold-out New York R Conference was back this year on April 8th & 9th at Work-Bench. Co-hosted with our friends at Lander Analytics, this year’s conference was bigger and better than ever, with over 250 attendees, and speakers from Airbnb, AT&T, Columbia University, eBay, Etsy, RStudio, Socure, and Tamr. In case you missed the conference or want to relive the excitement, all of the talks and slides are now live on the R Conference website.

With 30 talks, each 20 minutes long and two forty-minute keynotes, the topics of the presentations were just as diverse as the speakers. Vivian Peng gave an emotional talk on data visualization using non-visual senses and “The Feels.” Bryan Lewis measured the shadows of audience members to demonstrate the pros and cons of projection methods, and Daniel Lee talked about life, love, Stan, and March Madness. But, even with 32 presentations from a diverse selection of speakers, two dominant themes emerged: 1) Community and 2) Writing better code.

Given the amazing caliber of speakers and attendees, community was on everyone’s mind from the start. Drew Conway emoted the past, present, and future of data science in NYC, and spoke to the dangers of tearing down the tent we built. Joe Rickert from Microsoft discussed the R Consortium and how to become involved. Wes McKinney talked about community efforts in improving interoperability between data science languages with the new Feather data frame file format under the Apache Arrow project. Elena Grewal discussed how Airbnb’s data science team made changes to the hiring process to increase the number of female hires, and Andrew Gelman even talked about how your political opinions are shaped by those around you in his talk about Social Penumbras.

Writing better code also proved to be a dominant theme throughout the two day conference. Dan Chen of Lander Analytics talked about implementing tests in R. Similarly, Neal Richardson and Mike Malecki of Crunch.io talked about how they learned to stop munging and love tests, and Ben Lerner discussed how to optimize Python code using profilers and Cython. The perfect intersection of themes came from Bas van Schaik of Semmle who discussed how to use data science to write better code by treating code as data. While everyone had some amazing insights, these were our top five highlights:

JJ Allaire Releases a New Preview of RStudio

JJ Allaire, the second speaker of the conference, got the crowd fired up by announcing new features of RStudio and new packages. Particularly exciting was bookdown for authoring large documents, R Notebooks for interactive Markdown files and shared sessions so multiple people can code together from separate computers.

As always, Dr. Andrew Gelman wowed the crowd with his breakdown of how political opinions are shaped by those around us. He utilized his trademark visualizations and wit to convey the findings of complex models.

Vivian Peng Helps Kick off the Second Day with a Punch to the Gut

On the morning of the second day of the conference, Vivian Peng gave a heartfelt talk on using data visualization and non-visual senses to drive emotional reaction and shape public opinion on everything from the Syrian civil war to drug resistance statistics.

Ivor Cribben Studies Brain Activity with Time Varying Networks

University of Alberta Professor Ivor Cribben demonstrated his techniques for analyzing fMRI data. His use of network graphs, time series and extremograms brought an academic rigor to the conference.

Elena Grewal Talks About Scaling Data Science at Airbnb

After a jam-packed 2 full days, Elena Grewal helped wind down the conference with a thoughtful introspection on how Airbnb has grown their data science team from 5 to 70 people, with a focus on increasing diversity and eliminating bias in the hiring process.

See the full conference videos & presentations below, and sign up for updates for the 2017 New York R Conference on www.rstats.nyc. To get your R fix in the meantime, follow @nyhackr, @Work_Bench, and @rstatsnyc on Twitter, and check out the New York Open Programming Statistical Meetup or one of Work-Bench’s upcoming events!

Ohio State Buckeyes defensive lineman Joey Bosa

Been a busy few weeks with the New York R Conference, speaking engagements, writing the second edition of R for Everyone and coding open source packages. The most exciting news involves the news as the Wall Street Journal wrote an article about my NFL Draft work.

It is a great piece with some nice quotes from the Vikings General Manager Rick Spielman and ESPN’s legendary John Clayton that succinctly sums up the work I did and runs the numbers on a few select players.

So now I’ve been in the news for pizza, the lottery and football. Fun mix.

Related Posts

Related Posts

Related Posts

The Problem

The Solution

Putting it All Together

Related Posts

Packages

Data

Feedforward Network

Helper Functions

Network Formulation

Inspect the Network

Network Training

Predictions

Elastic Net

Model Training

Predictions

XGBoost

Model Training

Predictions

SVM

Model Training

Results

ROC

Related Posts

Related Posts

Related Posts

Related Posts

Highlights from the 2016 New York R Conference

JJ Allaire Releases a New Preview of RStudio

Andrew Gelman Discusses the Political Impact of the Social Penumbra

Vivian Peng Helps Kick off the Second Day with a Punch to the Gut

Ivor Cribben Studies Brain Activity with Time Varying Networks

Elena Grewal Talks About Scaling Data Science at Airbnb

Related Posts

Related Posts