Data scientists and R enthusiasts gathered for the 5th annual New York R Conference held on May 9th-11th. In front of a crowd of more than 300 attendees, 24 speakers gave presentations on topics ranging from deep learning and building packages in R to football and hockey analytics.

Speakers included: Andrew Gelman, Emily Robinson, Max Kuhn, Dan Chen, Jared Lander, Namita Nandakumar, Mike Band, Soumya Kalra, Brooke Watson, David Madigan, Jacqueline Nolis, Heather Nolis, Gabriela Hempfling, Wes McKinney, Noam Ross, Jim Savage, Ludmila Janda, Emily Dodwell, Michelle Gill, Krista Watts, Elizabeth Sweeney, Adam Chekroud, Amanda Dobbyn and Letisha Smith.

This year marked the ten-year anniversary of the New York Open Statistical Programming Meetup. It has been incredible to see the growth of meetup over the years. We now have over 10,000 members around the world!

Members Over Time

Let’s take a look at some of the highlights from the conference:

Jonah Gabry Kicked Off “R” Week at the New York Open Statistical Programming Meetup with a Talk on Using Stan in R

Jonah Gabry kicking off R Week with a talk about the Stan ecosystem

Jonah Gabry from the Stan Development Team kicked off “R” week with a talk on making Bayes easier in the R ecosystem. Jonah went over the packages (rstanarm, rstantools, bayesplot and loo) which emulate other R model-fitting functions, unify function naming across Stan-based R packages, and develop plotting functions using ggplot objects.

50 Conference Attendees Participated in Pre-Conference Workshops on Thursday before the Conference

A group of people in a room

Description automatically generated
People learning about machine learning from Max Kuhn during the pre-conference workshops

On the Thursday before the two-day conference, more than 50 conference attendees arrived at Work-Bench a day early for a full day of workshops. This was the first year of the R Conference Workshop Series. Max Kuhn, Dan Chen, Elizabeth Sweeney and Kaz Sakamoto each led a workshop which covered the following topics:

  • Machine Learning with Caret (Max Kuhn)
  • Git for Data Science (Dan Chen)
  • Introduction to Survival Analysis (Elizabeth Sweeney)
  • Geospatial Statistics and Mapping in R (Kaz Sakamoto)

The Growth of R-Ladies Summed Up in Three Pictures…

A group of people posing for a photo

Description automatically generated

We are so excited to see the growth of the R-Ladies community and we appreciate their support for the NY R Conference over the years. Congratulations ladies!

Dr. Andrew Gelman Delivers Keynote Speech on the Fallacy of P-Values and Thinking like a Statistician—All Without Slides

A person standing in front of a curtain

Description automatically generated
Andrew Gelman wowing the crowd as usual

There wasn’t a soul in the crowd who wasn’t hanging on every word from Columbia professor Dr. Andrew Gelman. The only speaker with a 40-minute time slot, and the only speaker to not use slides, Dr. Gelman talked about life as a statistician, warned of the perils of p-values and stressed the importance of simulation—before data collection—to improve our understanding of possible real-life scenarios. “Only through simulating fake data, can you really have statistical confidence about whatever performance metric you’re aiming for,” Gelman noted.

While we try not to pick a favorite speaker, Dr. Gelman runs away with that title every time he comes to speak at the New York R Conference.

Jacqueline and Heather Nolis Taught Us to Not to Be Afraid of Deep Learning and Model Deployment in Production

A group of people standing in a room

Description automatically generated
Jacqueline Nolis showing how fun and easy deep learning can be

The final talk on day one was perhaps the most entertaining and insightful from the weekend. Jacqueline Nolis taught us how developing a deep learning model is easier than we thought and how humor can help us understand a complex idea in a simple form. Our top five favorite neural network-generate pet names: Dia, Spok, Jori, Lule, and Timuse!

On Saturday morning, Heather Nolis showed us how we can deploy the model into production. Heather walked through the steps involved in preparing an R model for production using containers (Docker) and container orchestration (Kubernetes) to share models throughout an organization or for the public. How can we put a model into production without your laptop running 24/7? By running the code safely on a server in the cloud!

Emily Robinson and Honey Berk Win Headphones for Most Tweets During the Conference

Two people standing in front of a curtain

Description automatically generated
Emily Robinson winning the tweeting grand prize

If you’re not following Emily Robinson (@robinson_es) and Honey Berk (@honeyberk), you’re missing out! Emily and Honey led all conference attendees in Twitter mentions according to our Twitter scorekeeper Malorie Hughes (@data_all_day). Because of Emily and Honey’s presence on Twitter, those who were unable to attend the conference were able to follow along with all of our incredible speakers throughout the two-day event.
Tweet stats dashboard created by Malorie Hughes

Jared Lander Debuts New-Born R Package Hex Sticker T-Shirts: Congratulations to Jared and Rebecca on the Birth of their Son, Lev

Baby themed hex sticker shirts designed by Vivian Peng

During my talk I debuted a custom R package hex sticker t-shirt with my wife Rebecca and son Lev. We R a very nerdy family.

Looking Forward to 2020

If you attended the 2019 New York R Conference, we hope you had an incredible experience. If you did not attend the conference, we hope to see you next year!

Related Posts

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Baby Hex Stickers

Members Over Time

Ten years ago Josh Reich held the first ever New York Open Statistical Programming Meetup at Union Square Ventures. Back then it was called the New York R Meetup and 21 people RSVP’d. I discovered the Meetup three months later thanks to Andrew Gelman’s blog and by then the RSVP count had doubled and Drew Conway became a co-organizer. The experience was so much fun and I learned a lot of good stuff. I remember learning about the head() and tail() functions, wishing I learned them in grad school.

Early Meetup
The crowd at an early meetup fit in a small room at Columbia with room to spare.

I started attending regularly and pretty soon Drew decided to serve pizza which later led to years of pizza data. He also designed a logo for the NYC Data Mafia, which made for a great t-shirt that we still sell. One time, a number of us were talking and realized we were all answering each other’s questions on StackOverflow. Our community was growing both in person and online. I fell in love with the group because it was a great place to learn and hang out with smart, welcoming people.

During the first two years our hosts included NYU, Columbia, AOL and a handful of others. At this time there were about 1,800 members with Drew as the sole organizer who was ready to focus on other parts of his life, so he asked Wes McKinney and me to take over as organizers. This was after Drew renamed the group the Open Statistical Programming Meetup as to include other open source languages like Python, Julia, Go and SQL. I was incredibly thrilled to organize this group which meant so much to me.

Over the next eight years our numbers swelled to almost 11,000 with members and speakers coming from all over the world. We have held the Meetup at places such as eBay, AT&T, iHeartRadio, Work-Bench, Knewton, Twitter, New York Presbyterian, Rise New York and Google. The most popular event welcomed nearly 400 people when Hadley Wickham spoke in September of 2015, the only time the Meetup met on a Friday.

Hadley, Becky and Jared
The Meetup’s most popular night featured Hadley Wickham and coincided with my fifth date with Rebecca Martin, my future wife.

That night was also my fifth date with Rebecca Martin. We originally met during Michael Kane’s talk about PubMed then reconnected about a year later. We went on to get married and have a kid together. The New York Times used the nerdiest closing line ever for our wedding announcement: “The couple met in New York in May 2014 at a meet-up about statistical programming organized by the groom.”

Baby Hex Stickers
Rebecca and I recently added a new R programmer to our family and celebrated with baby-themed hex sticker shirts.

The Meetup has grown not only in numbers but in reach as well. There’s a website hosting all of the presentations, we livestream the Meetups and people from all over the world chat in our Slack team. Our live events include an ongoing workshop series and conferences in New York and Washington DC, which just hit their fifth anniversary, all for building and supporting the community and open source software.

NY R Conference
The crowd at the New York R Conference.

These past ten years have been a collection of amazing experiences for me where I got to learn from some of the world’s best experts and develop lasting relationships with great people. This community means so much to me and I very much look forward to its continued growth over the next decade.

Related Posts

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

After four sold-out years in New York City, the R Conference made its debut in Washington DC to a sold-out crowd of data scientists at the Ronald Reagan Building on November 8th & 9th. Our speakers shared presentations on a variety of R-related topics.

A big thank you to our speakers Max Kuhn, Emily Robinson, Mike Powell, Mara Averick, Max Richman, Stephanie Hicks, Michael Garris, Kelly O’Briant, David Smith, Anna Kirchner, Roger Peng, Marck Vaisman, Soumya Kalra, Jonathan Hersh, Vivian Peng, Dan Chen, Catherine Zhou, Jim Klucar, Lizzy Huang, Refael Lav, Ami Gates, Abhijit Dasgupta, Angela Li  and Tommy Jones.

Some of the amazing speakers

Some highlights from the conference:

R Superstars Mara Averick, Roger Peng and Emily Robinson

Mara Averick, Roger Peng and Emily Robinson

A hallmark of our R conferences is that the speakers hang out with all the attendees and these three were crowd favorites.

Michael Powell Brings R to the aRmy

Major Michael Powell

Major Michael Powell describes how R has brought efficiency to the Army Intelligence and Security Command by getting analysts out of Excel and into the Tidyverse. “Let me turn those 8 hours into 8 seconds for you,” says Powell.

Max Kuhn Explains the Applications of Equivocals to Apply Levels of Certainty to Predictions

Max Kuhn

After autographing his book, Applied Predictive Modeling, for a lucky attendee, Max Kuhn explains how Equivocals can be applied to individual predictions in order to avoid reporting predictions when there is significant uncertainty.

NYR and DCR Speaker Emily Robinson Getting an NYR Hoodie for her Awesome Tweeting

Emily Robinson

Emily Robinson tweeted the most at the 2018 NYR conference, winning her a WASD mechanical keyboard and at DCR she came in second so we gave her a limited edition NYR hoodie.

Max Richman Shows How SQL and R can Co-Exist

Max Richman

Max Richman, wearing the same shirt he wore when he spoke at the first NYR, shows parallels between dplyr and SQL.

Michael Garris Tells the Story of the MNIST Dataset

Michael Garris

Michael Garris was a member of the team that built the original MNIST dataset, which set the standard for handwriting image classification in the early 1990s. This talk may have been the first time the origin story was ever told.

R Stats Luminary Roger Peng Explains Relationship Between Air Pollution and Public Health

Roger Peng

Roger Peng shows us how air pollution levels has fallen over the past 50 years resulting in dramatic improvements in air quality and health (with help from R).

Kelly O’Briant Combining R with Serverless Computing

Kelly O'Briant

Kelly O’Briant demonstrates how to easily deploy R projects on Google Compute Engine and promoted the new #radmins hashtag.

Hot Dog vs Not Hot Dog by David Smith (Inspired by Jian-Yang from HBO’s Silicon Valley)

David Smith

David Smith, one of the original R users, shows how to recreate HBO’s Silicon Valley’s Not Hot Dog app using R and Azure

Jon Hersh Describes How to Push for Data Science Within Your Organization

Jon Hersh

Jon Hersh discusses the challenges, and solutions, of getting organizations to embrace data science.

Vivian Peng and the Importance of Data Storytelling

Vivian Peng

Vivian Peng asks the question, how do we protect the integrity of our data analysis when it’s published for the world to see?

Dan Chen Signs His Book for David Smith

Dan Chen and David Smith

Dan Chen autographing a copy of his book, Pandas for Everyone, for David Smith. Now David Smith has to sign his book, An Introduction to R, for Dan.

Malorie Hughes Analyzing Tweets

Malorie Hughes

On the first day I challenged the audience to analyze the tweets from the conference and Malorie Hughes, a data scientist with NPR, designed a Twitter analytics dashboard to track the attendee with the most tweets with the hashtag #rstatsdc. Seth Wenchel won a WASD keyboard for the best tweeting. And we presented Malorie wit a DCR speaker mug.

Strong Showing from the #RLadies!

The R-Ladies

The #rladies group is growing year after year and it is great seeing them in force at NYR and DCR!


Matthew Hendrickson, a DCR attendee, posted on twitter every package mentioned during the two-day conference: tidyverse, tidycensus, leaflet, leaflet.extras, funneljoin, glmnet, xgboost, rstan, rstanarm, LowRankQP, dplyr, coefplot, bayesplot, keras, tensorflow, lars, magrittr, purrr, rsample, useful, knitr, rmarkdown, ggplot2, ggiraph, ggrepel, ggraph, ggthemes, gganimate, ggmap, plotROC, ggridges, gtrendsr, tlnise, tm, Bioconductor, plyranges, sf, tmap, textmineR, tidytext, gmailr, rtweet, shiny, httr, parsnip, probably, plumber, reprex, crosstalk, arules and arulesviz.

Data Community DC

Data Community DC

A special thanks to the Data Community DC for helping us make the DC R Conference an incredible experience.


The videos for the conference will be posted in the coming weeks to

See You Next Year

Looking forward to more great conferences at next year’s NYR and DCR!

Related Posts

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Getting People Started

A large part of my work is teaching R–for private clients, at Columbia Business School, at conferences and facilitating public workshops for others.

A common theme is that getting everyone setup on their individual computers is very difficult. No matter how many instructions I provide, there are always a good number of people without a proper environment. This can mean not using RStudio projects, not having the right packages installed, not downloading the data and sometimes not even installing R.


After many experiments I finally came upon a solution. For every class I teach I now create a skeleton project hosted on GitHub with instructions for setup.

The instructions (in the README) consist of three blocks of code.

  1. Package installation
  2. Copying the project structure from the repo (no git required)
  3. Downloading data

All the user has to do is copy and paste these three blocks of code into the R console and they have the exact same environment as the instructor and other students.

packages <- c( 
newProject <- usethis::use_course('')

Using this process, 95% of my students are prepared for class.

The inspiration for this idea came from a fun coffee with Hadley Wickham and Jenny Bryan during a conference in New Zealand and the implementation is made possible thanks to the usethis package.

Automating the Setup

Now that I found a good way to get students started, I wanted to make it easier for me to setup the repo. So I created an R package called RepoGenerator and put it on CRAN.

The first step to using the package is to create a GitHub Personal Access Token (instructions are in the README). Then you build a data.frame listing datasets you want the students to download. The data.frame needs at least the following three columns.

  • Local: The name, not path, the file should have on disk
  • Remote: The URL where the data files are stored online
  • Mode: The mode needed to write the file to disk, ‘w’ for regular text files, ‘wb’ for binary files such as Excel or rds files

An example data.frame is available in the RepoGenerator package.

data(datafiles, package='RepoGenerator')
datafiles[1:6, c('Local', 'Remote', 'Mode')]

After that you define the packages you want your students to use. There can be as few or as many as you want. In addition to any packages you list, rprojroot and usethis are added so that the instructions in the new repo will be certain to work.

packages <- c('caret', 'coefplot','DBI', 'dbplyr', 'doParallel', 'dygraphs', 
              'foreach', 'ggthemes', 'glmnet', 'jsonlite', 'leaflet', 'odbc', 
              'recipes', 'rmarkdown', 'rprojroot', 'RSQLite', 'rvest', 
              'tidyverse', 'threejs', 'usethis', 'UsingR', 'xgboost', 'XML', 

Now all you need to do is call the createRepo() function.

    # the name to use for the repo and project
    # the location on disk to build the project
    # the data.frame listing data files for the user to download
    # vector of packages the user should install
    # the GitHub username to create the repo for
    # the new repo's README has the name of who is organizing the class
    organizer='Lander Analytics',
    # the name of the environment variable storing the GitHub Personal Access Token

After this you will have a new repo setup for your users to copy, including instructions.

That’s All

Reducing setup issues at the start of a training can really improve the experience for everyone and allow you to get straight into teaching.

Please check it out and let me know how it works for you.


Related Posts

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

2018 New York R Conference

The 2018 New York R Conference was the biggest and best yet. This is both in terms of the crowd size and content.  The speakers included some of the R community’s best such as Hadley Wickham, David Robinson, Jennifer Hill, Max Kuhn, Andreas Mueller (ok, a little Python), Evelina Gabasova, Sean Taylor and Jeff Ryan. I am proud to say we were almost at gender parity for both attendees and speakers which is amazing for a tech conference. Brooke Watson even excitedly noted that we had a line for the women’s room.

Particularly gratifying for me was seeing so many of my students speak. Eurry Kim, Dan Chen and Alex Boghosian all gave excellent talks.

Some highlights that stuck out to me are:

Emily Robinson Shows There is More to the Tidyverse than Hadley

The Expanded Tidyverse

Emily Robinson, otherwise known as ERob, gave an excellent talk showing how the Tidyverse is so much more than just Hadley and that there are many people inspired by him to contribute in the Tidy way.

Sean Taylor Forecasted the Future with Prophet

Sean Taylor

Sean Taylor, former New Yorker and unrepentant Eagles fan, demonstrate his powerful R and Python, package Prophet, for forecasting time series data. Facebook open sourced his work so we could all benefit.

OG Data Mafia Founder Drew Conway Popped In

Giving away a data mafia shirt

A lucky fan got an autographed NYC Data Mafia t-shirt from Drew Conway.

David Smith Playing Minecraft Through R

Minecraft in R

David Smith played Minecraft through R, including building objects and moving through the world.

Evelina Gabasova Used Social Network Analysis to Break Down Star Wars

It's a Trap

Evelina Gabasova wowed the audience with her fun talk and detailed analysis of character interaction in Star Wars.

Dusty Turner Represented West Point

Dusty Talking Army Sports

Dusty Turner taught us how the United States Military Academy uses R for both student instruction and evaluation.

Hadley Wickham Delved into the Nitty Gritty of R

Hadley shows off objects are stored in memory

Hadley Wickham showed us how to get into the internals of R and figure out how to examine objects from a memory perspective.

Jennifer Hill Demonstrated Awesome Machine Learning Techniques for Causal Inference

Jennifer Hill Explaining Causal Inference

Following her sold-out meetup appearance in March, Jennifer continued to push the boundaries of causal inference.

I Made the Authors of Caret and scitkit-learn Show That R and Python Can Get Along

Caret and Scikit-learn in one place

While both Andreas and Max gave great individual talks, I made them pose for this peace-making photo.

David Robinson Got the Upper Hand in a Sibling Twitter Duel

DRob Teaching

Given only about 30 minutes notice, David put together an entire slideshow on how to livetweet and how to compete with your sibling.

In the End Emily Robinson Beat Her Brother For Best Tweeting

Emily won the prize for best tweeting

Despite David’s headstart Emily was the best tweeter (as calculated by Max Kuhn and Mara Averick) so she won the WASD Code mechanical keyboard with MX Cherry Clear switches.

Silent Auction of Data Paintings

The Robinson Family bought the Pizza Data painting for me

Thomas Levine made paintings of famous datasets that we auctioned off with the proceeds supporting the R Foundation and the Free Software Foundation. The Robinson family very graciously chipped in and bought the painting of the Pizza Poll data for me! I’m still floored by this and in love with the painting.

Ice Cream Sandwiches

Ice Cream Sandwiches

In addition to bagels and eggs sandwiches from Murray’s Bagels, Israeli food from Hummus and Pita Company, avocado toast and coffee from Bluestone Lane Coffee and pizza from Fiore’s, we also had ice creams sandwiches from World’s Best Cookie Dough.

All the Material

To catch up on all the presentations check out Mara Averick’s excellent notes:

Or check out all of Brooke’s drawings, collated by Dan Chen.

Videos and Upcoming Events

The videos will be posted at in a few weeks for all to enjoy.

There are a number of other events coming up including:

We are already beginning plans for next year’s conference and are working on bringing it to DC as well! Stay tuned for all that and more.

Dan loves his mug

Related Posts

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

In my last post I discussed using coefplot on glmnet models and in particular discussed a brand new function, coefpath, that uses dygraphs to make an interactive visualization of the coefficient path.

Another new capability for version 1.2.5 of coefplot is the ability to show coefficient plots from xgboost models. Beyond fitting boosted trees and boosted forests, xgboost can also fit a boosted Elastic Net. This makes it a nice alternative to glmnet even though it might not have some of the same user niceties.

To illustrate, we use the same data as our previous post.

First, we load the packages we need and note the version numbers.

# list the packages that we load
# alphabetically for reproducibility
packages <- c('caret', 'coefplot', 'DT', 'xgboost')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)

# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'dygraphs', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')
versions <- c(packages, packagesColon) %>% 
    purrr::map(packageVersion) %>% 
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% 
Package Version
caret 6.0.78
coefplot 1.2.6
dplyr 0.7.4
DT 0.2
knitr 1.18
magrittr 1.5
purrr 0.2.4
tibble 1.4.2
useful 1.2.3
xgboost 0.6.4

Then, we read the data. The data are available at with the CSV version at We also get validation data which is helpful when fitting xgboost mdoels.

manTrain <- readRDS(url(''))
manVal <- readRDS(url(''))

The data are about New York City land value and have many columns. A sample of the data follows. There’s an odd bug where you have to click on one of the column names for the data to display the actual data.

datatable(manTrain %>% dplyr::sample_n(size=1000), elementId='TrainingSampled',
              extensions=c('FixedHeader', 'Scroller'),

While glmnet automatically standardizes the input data, xgboost does not, so we calculate that manually. We use preprocess from caret to compute the mean and standard deviation of each numeric column then use these later.

preProc <- preProcess(manTrain, method=c('center', 'scale'))

Just like with glmnet, we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with xgboost we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix and vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
    Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
    OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + 
    LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + 
manX <- useful::build.x(valueFormula, data=predict(preProc, manTrain),
                        # do not drop the baselines of factors
                        # use a sparse matrix

manY <- useful::build.y(valueFormula, data=manTrain)

manX_val <- useful::build.x(valueFormula, data=predict(preProc, manVal),
                        # do not drop the baselines of factors
                        # use a sparse matrix

manY_val <- useful::build.y(valueFormula, data=manVal)

There are two functions we can use to fit xgboost models, the eponymous xgboost and xgb.train. When using xgb.train we first store our X and Y matrices in a special xgb.DMatrix object. This is not a necessary step, but makes things a bit cleaner.

manXG <- xgb.DMatrix(data=manX, label=manY)
manXG_val <- xgb.DMatrix(data=manX_val, label=manY_val)

We are now ready to fit a model. All we need to do to fit a linear model instead of a tree is set booster='gblinear' and objective='reg:linear'.

mod1 <- xgb.train(
    # the X and Y training data
    # use a linear model
    # minimize the a regression criterion 
    # use MAE as a measure of quality
    # boost for up to 500 rounds
    # print out the eval_metric for both the train and validation data
    watchlist=list(train=manXG, validate=manXG_val),
    # print eval_metric every 10 rounds
    # if the validate eval_metric hasn't improved by this many rounds, stop early
    # penalty terms for the L2 portion of the Elastic Net
    lambda=10, lambda_bias=10,
    # penalty term for the L1 portion of the Elastic Net
    # randomly sample rows
    # randomly sample columns
    # set the learning rate for gradient descent
## [1]  train-mae:1190145.875000    validate-mae:1433464.750000 
## Multiple eval metrics are present. Will use validate_mae for early stopping.
## Will train until validate_mae hasn't improved in 25 rounds.
## [11] train-mae:938069.937500 validate-mae:1257632.000000 
## [21] train-mae:932016.625000 validate-mae:1113554.625000 
## [31] train-mae:931483.500000 validate-mae:1062618.250000 
## [41] train-mae:931146.750000 validate-mae:1054833.625000 
## [51] train-mae:930707.312500 validate-mae:1062881.375000 
## [61] train-mae:930137.375000 validate-mae:1077038.875000 
## Stopping. Best iteration:
## [41] train-mae:931146.750000 validate-mae:1054833.625000

The best fit was arrived at after 41 rounds. We can see how the model did on the train and validate sets using dygraphs.


We can now plot the coefficients using coefplot. Since xgboost does not save column names, we specify it with feature_names=colnames(manX). Unlike with glmnet models, there is only one penalty so we do not need to specify a specific penalty to plot.

coefplot(mod1, feature_names=colnames(manX), sort='magnitude')

This is another nice addition to coefplot utilizing the power of xgboost.

Related Posts

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

The biggest event for me this year was completely outside of work and had nothing to do with statistics or R: I got married. We technically met at the Open Stats meetup and I did build our wedding website with RMarkdown, so R was still involved. We just returned from our around-the-world honeymoon so I thought the best way to track our travels would be with maps and globes using leaflet and threejs.

Before we get to any code, the following packages were used in making this post.

This was an extensive trip that, in addition to traditional vacation activities, included a few visits to clients and speaking and a few conferences and meetups. In all, we visited, London, Singapore, Hong Kong, Auckland, Queenstown, Bora Bora, Tahiti, Moorea, San Jose and San Francisco, with a connection or two in between.

The airport/ferry codes for our trip were the following.

Origin Destination Airline
JFK LGW Norwegian Air
LHR SIN Singapore Airlines
SIN HKG Singapore Airlines
HKG AKL Cathay Pacific
AKL ZQN Air New Zealand
ZQN AKL Air New Zealand
AKL PPT Air New Zealand
PPT BOB Air Tahiti
BOB MOZ Air Tahiti
MOZ PPT Terevau
PPT LAX Air Tahiti Nui
LAX SJC Alaska Airlines

Converting these to latitude and longitude is easy thanks to Open Flights.

# read in the data
airports <- readr::read_csv('',
                            # give it good column names since the data are headerless
                            col_names=c('ID', 'Name', 'City', 'Country', 
                                        'IATA', 'ICAO', 'Latitude', 'Longitude', 
                                        'Altitude', 'Timezone', 'DST', 'Tz', 
                                        'Type', 'Source'))

We then use filter to get just the ports we visited. Notice how we use a second tbl inside filter.

visited <- airports %>% 
    select(Name, City, Country, IATA, Latitude, Longitude) %>% 
    filter(IATA %in% (codes %>% select(Origin, Destination) %>% unlist))
DT::datatable(visited, elementId='AirportsTable',
              extensions=c('FixedHeader', 'Scroller'),
) %>% 
    DT::formatRound(columns=c('Latitude', 'Longitude'), digits=2)

We then manually reorder the airports so that edges can be drawn nicely between them. This is akin to creating an edgelist of airport-pairs. This is not the most robust way of creating this list, but suffices for our purposes.

visitedOrdered <- visited %>% 
    slice(c(12, 1, 2, 8, 7, 5, 6, 5, 13, 3, 4, 13, 10, 11, 12))

For the first visualization let’s create a map using leaflet.

# initialize the widget
leaflet(data=visitedOrdered) %>% 
    # overlay map tiles
    addTiles() %>% 
    # plot lines from one point to another
    addPolylines(lng=~Longitude, lat=~Latitude) %>% 
    # add markers with city names
    addMarkers(lng=~Longitude, lat=~Latitude, popup=~City)

Unfortunately, this doesn’t quite capture the directions of the flights as it makes it look like we flew back west to get to Papeete. So let’s try a globe instead using threejs.

We augment the edgelist of airports so that it has the latitude and longitude of the origin and destination airports for each flight.

flightPaths <- codes %>% 
    left_join(visited %>% select(IATA, Longitude, Latitude), by=c('Origin'='IATA')) %>% 
    rename(oLong=Longitude, oLat=Latitude) %>% 
    left_join(visited %>% select(IATA, Longitude, Latitude), by=c('Destination'='IATA')) %>% 
    rename(dLong=Longitude, dLat=Latitude)
DT::datatable(visited, elementId='FlightPathLatLong',
              extensions=c('FixedHeader', 'Scroller'),
) %>% 
    DT::formatRound(columns=c('Latitude', 'Longitude'), digits=2)

Now we can provide that data to threejs. We first specify an image to overlay on the globe. Then we specify the latitude and longitude of visited airports. After that, we provide the origin and destination latitudes and longitudes of our flights. The rest of the arguments are cosmetic.

    # the image to overlay on the globe
    # lat/long of visited airports
    lat=visited$Latitude, long=visited$Longitude,
    # lat/long of origin and destination
    arcs=flightPaths %>% select(oLat, oLong, dLat, dLong),
    # cosmetic adjustments
    arcsHeight=.4, arcsLwd=7, arcsColor="red", arcsOpacity=.95,
    atmosphere=FALSE, fov=30, rotationlat=0.3, rotationlong=.8*pi)

We now calculate the total distance traveled (not including car trips) using Haversine Distance to account for the curvature of the Earth.

distHaversine(visitedOrdered %>% select(Longitude, Latitude), r=3959) %>% sum
## [1] 28660.52

So we traveled 3,760 more miles than the circumference of the Earth.

Beyond the epic proportions of our travel, this honeymoon was outstanding from the sheer length, to the vastly different places we visited, to the food we ate and the sights we saw, to the activities we participated in, to the great people along the way. And, of course, it’s amazing to spend a month traveling with your favorite person.

Related Posts

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

This is the code for a webinar I gave with Dan Mbanga for Amazon’s AWS Webinar Series about deep learning using MXNet in R. The video is available on YouTube.

I am experimentally saving htmlwidget objects to HTML files then loading them with iframes so please excuse the scroll bars.


These are the packages we are using either by loading the entire package or using individual functions.


This dataset is about property lots in Manhattan and includes descriptive information as well as value. The original data are available from NYC Planning and the prepared files seen here at the Lander Analytics repo.

data_train <- readr::read_csv(file.path(dataDir, 'manhattan_Train.csv'))
data_validate <- readr::read_csv(file.path(dataDir, 'manhattan_Validate.csv'))
data_test <- readr::read_csv(file.path(dataDir, 'manhattan_Test.csv'))
# Remove some variables
data_train <- data_train %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)
data_validate <- data_validate %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)
data_test <- data_test %>% 
    select(-ID, -TotalValue, -Borough, -ZoneDist4, 
           -SchoolDistrict, -Council, -PolicePrct, -HealthArea)

This is a glimpse of the data.

Here is a visualization of the class balance.

dataList <- list(data_train, data_validate, data_test)
dataList %>% 
    purrr::map(function(x) figure(width=600/NROW(dataList), height=500, legend_location=NULL) %>% 
                   ly_bar(x=High, color=factor(High), data=x, legend='code')
    ) %>% 
    grid_plot(nrow=1, ncol=NROW(dataList), same_axes=TRUE)

We use vtreat to do some automated feature engineering.

# The column name for the response
responseName <- 'High'
# The target value for the response
responseTarget <- TRUE
# The remaining columns are predictors
varNames <- setdiff(names(data_train), responseName)

# build the treatment design
treatmentDesign <- designTreatmentsC(dframe=data_train, varlist=varNames, 
## [1] "desigining treatments Mon Jun 26 01:23:38 2017"
## [1] "designing treatments Mon Jun 26 01:23:38 2017"
## [1] " have level statistics Mon Jun 26 01:23:39 2017"
## [1] "design var FireService Mon Jun 26 01:23:39 2017"
## [1] "design var ZoneDist1 Mon Jun 26 01:23:39 2017"
## [1] "design var ZoneDist2 Mon Jun 26 01:23:40 2017"
## [1] "design var ZoneDist3 Mon Jun 26 01:23:40 2017"
## [1] "design var Class Mon Jun 26 01:23:41 2017"
## [1] "design var LandUse Mon Jun 26 01:23:42 2017"
## [1] "design var Easements Mon Jun 26 01:23:43 2017"
## [1] "design var OwnerType Mon Jun 26 01:23:43 2017"
## [1] "design var LotArea Mon Jun 26 01:23:43 2017"
## [1] "design var BldgArea Mon Jun 26 01:23:43 2017"
## [1] "design var ComArea Mon Jun 26 01:23:43 2017"
## [1] "design var ResArea Mon Jun 26 01:23:44 2017"
## [1] "design var OfficeArea Mon Jun 26 01:23:44 2017"
## [1] "design var RetailArea Mon Jun 26 01:23:44 2017"
## [1] "design var GarageArea Mon Jun 26 01:23:44 2017"
## [1] "design var StrgeArea Mon Jun 26 01:23:44 2017"
## [1] "design var FactryArea Mon Jun 26 01:23:44 2017"
## [1] "design var OtherArea Mon Jun 26 01:23:44 2017"
## [1] "design var NumBldgs Mon Jun 26 01:23:44 2017"
## [1] "design var NumFloors Mon Jun 26 01:23:44 2017"
## [1] "design var UnitsRes Mon Jun 26 01:23:44 2017"
## [1] "design var UnitsTotal Mon Jun 26 01:23:44 2017"
## [1] "design var LotFront Mon Jun 26 01:23:44 2017"
## [1] "design var LotDepth Mon Jun 26 01:23:44 2017"
## [1] "design var BldgFront Mon Jun 26 01:23:44 2017"
## [1] "design var BldgDepth Mon Jun 26 01:23:45 2017"
## [1] "design var Extension Mon Jun 26 01:23:45 2017"
## [1] "design var Proximity Mon Jun 26 01:23:45 2017"
## [1] "design var IrregularLot Mon Jun 26 01:23:45 2017"
## [1] "design var LotType Mon Jun 26 01:23:46 2017"
## [1] "design var BasementType Mon Jun 26 01:23:46 2017"
## [1] "design var Landmark Mon Jun 26 01:23:47 2017"
## [1] "design var BuiltFAR Mon Jun 26 01:23:47 2017"
## [1] "design var ResidFAR Mon Jun 26 01:23:47 2017"
## [1] "design var CommFAR Mon Jun 26 01:23:47 2017"
## [1] "design var FacilFAR Mon Jun 26 01:23:47 2017"
## [1] "design var Built Mon Jun 26 01:23:47 2017"
## [1] "design var HistoricDistrict Mon Jun 26 01:23:48 2017"
## [1] " scoring treatments Mon Jun 26 01:23:48 2017"
## [1] "have treatment plan Mon Jun 26 01:24:19 2017"
## [1] "rescoring complex variables Mon Jun 26 01:24:19 2017"
## [1] "done rescoring complex variables Mon Jun 26 01:24:32 2017"

Then we create train, validate and test matrices.

# build design data.frames
dataTrain <- prepare(treatmentplan=treatmentDesign, dframe=data_train)
dataValidate <- prepare(treatmentplan=treatmentDesign, dframe=data_validate)
dataTest <- prepare(treatmentplan=treatmentDesign, dframe=data_test)

# use all the level names as predictors
predictorNames <- setdiff(names(dataTrain), responseName)

# training matrices
trainX <- data.matrix(dataTrain[, predictorNames])
trainY <- dataTrain[, responseName]

# validation matrices
validateX <- data.matrix(dataValidate[, predictorNames])
validateY <- dataValidate[, responseName]

# test matrices
testX <- data.matrix(dataTest[, predictorNames])
testY <- dataTest[, responseName]

# Sparse versions for some models
trainX_sparse <- sparse.model.matrix(object=High ~ ., data=dataTrain)
validateX_sparse <- sparse.model.matrix(object=High ~ ., data=dataValidate)
testX_sparse <- sparse.model.matrix(object=High ~ ., data=dataTest)

Feedforward Network

Helper Functions

This is a function that allows mxnet to calculate log-loss based on the logloss function from the Metrics package.

# log-loss
mx.metric.mlogloss <- mx.metric.custom("mlogloss", function(label, pred){
    return(Metrics::logLoss(label, pred))

Network Formulation

We build the model symbolically. We use a feedforward network with two hidden layers. The first hidden layer has 256 units and the second has 128 units. We also use dropout and batch normalization for regularization. The last step is to use a logistic sigmoid (inverse logit) for the logistic regression output.

net <- mx.symbol.Variable('data') %>%
    # drop out 20% of predictors
    mx.symbol.Dropout(p=0.2, name='Predictor_Dropout') %>%
    # a fully connected layer with 256 units
    mx.symbol.FullyConnected(num_hidden=256, name='fc_1') %>%
    # batch normalize the units
    mx.symbol.BatchNorm(name='bn_1') %>%
    # use the rectified linear unit (relu) for the activation function
    mx.symbol.Activation(act_type='relu', name='relu_1') %>%
    # drop out 50% of the units
    mx.symbol.Dropout(p=0.5, name='dropout_1') %>%
    # a fully connected layer with 128 units
    mx.symbol.FullyConnected(num_hidden=128, name='fc_2') %>%
    # batch normalize the units
    mx.symbol.BatchNorm(name='bn_2') %>%
    # use the rectified linear unit (relu) for the activation function
    mx.symbol.Activation(act_type='relu', name='relu_2') %>%
    # drop out 50% of the units
    mx.symbol.Dropout(p=0.5, name='dropout_2') %>%
    # fully connect to the output layer which has just the 1 unit
    mx.symbol.FullyConnected(num_hidden=1, name='out') %>%
    # use the sigmoid output

Inspect the Network

By inspecting the symbolic network we see that it is actually just a C++ pointer. We also see its arguments and a visualization.

## C++ object <0000000018ed9aa0> of class 'MXSymbol' <0000000018c30d00>
##  [1] "data"         "fc_1_weight"  "fc_1_bias"    "bn_1_gamma"  
##  [5] "bn_1_beta"    "fc_2_weight"  "fc_2_bias"    "bn_2_gamma"  
##  [9] "bn_2_beta"    "out_weight"   "out_bias"     "output_label"

Network Training

With the data prepared and the network specified we now train the model. First we set the envinronment variable MXNET_CPU_WORKER_NTHREADS=4 since this demo is on a laptop with four threads. Using a GPU will speed up the computations. We also set the random seed with mx.set.seed for reproducibility.

We use the Adam optimization algorithm which has an adaptive learning rate which incorporates momentum.

# use four CPU threads

# set the random seed

# train the model
mod_net <- mx.model.FeedForward.create(
    symbol            = net,    # the symbolic network
    X                 = trainX, # the predictors
    y                 = trainY, # the response
    optimizer         = "adam", # using the Adam optimization method         = list(data=validateX, label=validateY), # validation data
    ctx               = mx.cpu(), # use the cpu for training
    eval.metric       = mx.metric.mlogloss, # evaluate with log-loss
    num.round         = 50,     # 50 epochs
    learning.rate     = 0.001,   # learning rate
    array.batch.size  = 256,    # batch size
    array.layout      = "rowmajor"  # the data is stored in row major format


Statisticians call this step prediction while the deep learning field calls it inference which has an entirely different meaning in statistics.

preds_net <- predict(mod_net, testX, array.layout="rowmajor") %>% t

Elastic Net

Model Training

We fit an Elastic Net model with glmnet.


mod_glmnet <- cv.glmnet(x=trainX_sparse, y=trainY, 
                        alpha=.5, family='binomial', 
                        nfolds=5, parallel=TRUE)


preds_glmnet <- predict(mod_glmnet, newx=testX_sparse, s='lambda.1se', type='response')


Model Training

We fit a random forest with xgboost.


trainXG <- xgb.DMatrix(data=trainX_sparse, label=trainY)
validateXG <- xgb.DMatrix(data=validateX_sparse, label=validateY)

watchlist <- list(train=trainXG, validate=validateXG)

mod_xgboost <- xgb.train(data=trainXG, 
                nrounds=1, nthread=4, 
                num_parallel_tree=500, subsample=0.5, colsample_bytree=0.5,
                eval_metric = "error", eval_metric = "logloss",
                print_every_n=1, watchlist=watchlist)
## [1]  train-error:0.104713    train-logloss:0.525032  validate-error:0.108254 validate-logloss:0.527535


preds_xgboost <- predict(mod_xgboost, newdata=testX_sparse)


Model Training

mod_svm <- e1071::svm(x=trainX_sparse, y=trainY, probability=TRUE, type='C')

This model did not train in a reasonable time.



rocData <- dplyr::bind_rows(
    cbind(data.frame(roc(testY, preds_glmnet, direction="<")[c('specificities', 'sensitivities')]), Model='glmnet'),
    cbind(data.frame(roc(testY, preds_xgboost, direction="<")[c('specificities', 'sensitivities')]), Model='xgboost'),
    cbind(data.frame(roc(testY, preds_net, direction="<")[c('specificities', 'sensitivities')]), Model='Net')
ggplotly(ggplot(rocData, aes(x=specificities, y=sensitivities)) + geom_line(aes(color=Model, group=Model)) + scale_x_reverse(), width=800, height=600)

Related Posts

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Shortly after I learned LaTeX I used it to write my resume (or CV if you will), freeing me from the headache of using Microsoft Word and the associated formatting troubles. Even that wasn’t enough though because different audiences needed different information and job listings. I could have stored all the information in the file and commented out bullet points I did not want to use, but that seemed sloppy. So instead I wrote an R package called resumer.

The trick is to store all of the data in a CSV, one row per bullet point.1

JobName Company Location Title Start End Bullet BulletName Type Description
Tech Startup Pied Piper New York, NY CTO 2013 Present Set up company’s computing platform 1 Job NA
Tech Startup Pied Piper New York, NY CTO 2013 Present Designed data strategy overseeing many datasources 2 Job NA
Tech Startup Pied Piper New York, NY CTO 2013 Present Constructed statistical models for predictive analytics of big data 3 Job NA
Large Bank Goliath National Bank New York, NY Quant 2011 2013 Built quantitative models for derivatives trades 1 Job NA
Large Bank Goliath National Bank New York, NY Quant 2011 2013 Wrote algorithms using the R statistical programming language 2 Job NA
Bank Intern Goliath National Bank New York, NY Intern 2010 NA Got coffee for senior staff 1 Job NA

Each row represents a detail about a job. So a job may take multiple rows.

The columns are:

  • JobName: Name identifying this job. This is identifying information used when selecting which jobs to display.
  • Company: Name of company.
  • Location: Physical location of job.
  • Title: Title held at job.
  • Start: Start date of job, usually represented by a year.
  • End: End date of job. This would ordinarily by a year, ‘Present’ or blank.
  • Bullet: The detail about the job.
  • BulletName: Identifier for this detail, used when selecting which details to display.
  • Type: Should be either Job or Research.
  • Description: Used for a quick blurb about research roles.

There are many parts to using this package which are all explained in the README and mostly reproduced here.

The yaml header holds your name, address, the location of the jobs CSV file, education information and any highlights. Remember, proper indenting is required for yaml.

The name and address fields are self explanatory. output takes the form of package::function which for this package is resumer::resumer.

The location of the jobs CSV is specified in the JobFile slot of the params entry. This should be the absolute path to the CSV.

These would look like this.

name: "Generic Name"
address: "New York"
output: resumer::resumer
    JobFile: "examples/jobs.csv"

Supplying education information is done as a list in the education entry, with each school containing slots for school, dates and optionally notes. Each slot of the list is started with a -. The notes slot starts with a | and each line (except the last line) must end with two spaces.

For example:

-   school: "Hudson University"
    dates: "2007--2009"
    notes: |
        GPA 3.955  
        Master of Arts in Statistics
-   school: "Smallville College"
    dates: "2000--2004"
    notes: |
        Cumulative GPA 3.838 Summa Cum Laude, Honors in Mathematics  
        Bachelor of Science in Mathematics, Journalism Minor  
        The Wayne Award for Excellence in Mathematics  
        Member of Pi Mu Epsilon, a national honorary mathematics society

To provide a highlights section set doHighlights: yes and create a highlights tag.

Each bullet in the highlights entry should be a list slot started by -. For example.

doHighlights: yes
-   bullet: Author of \emph{Pulitzer Prize} winning article
-   bullet: Organizer of \textbf{Glasses and Cowl} Meetup
-   bullet: Analyzed global survey by the \textbf{Surveyors Inc}
-   bullet: Professor of Journalism at \textbf{Hudson University}
-   bullet: Thesis on \textbf{Facial Recognition Errors}
-   bullet: Served as reporter in \textbf{Vientiane, Laos}

Jobs and details are selected for display by building a list of lists named jobList. Each inner list represents a job and should have three unnamed elements: – CompanyNameJobName – Vector of BulletNames

An example is:

jobList <- list(
    list("Pied Piper", "Tech Startup", c(1, 3)),
    list("Goliath National Bank", "Large Bank", 1:2),
    list("Goliath National Bank", "Bank Intern", 1:3),
    list("Surveyors Inc", "Survery Stats", 1:2),
    list("Daily Planet", "Reporting", 2:4),
    list("Hudson University", "Professor", c(1, 3:4)),
    list("Hooli", "Coding Intern", c(1:3))

Research is specified similarly in researchList.

# generate a list of lists of research that list the company name, job name and bullet
researchList <- list(
    list("Hudson University", "Oddie Research", 4:5),
    list("Daily Planet", "Winning Article", 2)

The job file is read into the jobs variable using read.csv2.

jobs <- read.csv2(params$JobFile, header=TRUE, sep=',', stringsAsFactors=FALSE)

The jobs and details are written to LaTeX using a code chunk with results='asis'.

Same with research details.

Regular LaTeX code can be used, such as in specifying an athletics section. Note that this uses a special rSection environment.

\textbf{Ice Hockey} \emph{Goaltender} | \textbf{Hudson University} | 2000--2004 \\
\textbf{Curling} \emph{Vice Skip} | \textbf{Hudson University} | 2000--2004

A complete template is available when creating a new file in RStudio.

Any suggestions or, even better, pull requests are welcome at the GitHub page.

  1. A helper function, createJobFile, creates a CSV with the correct headers.

Related Posts

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.