The inaugural Government & Public Sector R Conference took place virtually from December 2nd to December 4th. With over 240 attendees, 26 speakers, three panelists and a rum masterclass class leader, the R|Gov conference was a place where data scientists could gather remotely to explore, share, and inspire ideas.

We had so many amazing speakers, whom we would like to thank: Lucy D’Agostino McGowan (Wake Forest University), Dr. Andrew Gelman (Columbia University), Dr. Graciela Chichilnisky (Global Thermostat), Dr. David Meza (NASA), Maj. Maxine Drake (US Army), Alex Gold (RStudio), Kimberly F. Sellers (Georgetown University; The U. S. Census Bureau), Dr. Tyler Morgan-Wall (Institute for Defense Analyses (IDA)), Imane El Idrissi & Dr. Anna Mantsoki (Foundation for Innovative New Diagnostics), Dr. Wendy Martinez (Bureau of Labor Statistics (BLS)), Col. Alfredo Corbett (US Air Force), Rose Martinez & Brooke Frye (New York City Council Data Team), Yvan Gauthier (Department of National Defence), Michael Jadoo (BLS), Tommy Jones (In-Q-Tel), Selina Carter (IDB), Refael Lav (Deloitte’s Federal Government Services teams), Dr. Abhijit Dasgupta (Zansors), Dr. Simina Boca (Georgetown University Medical Center), Dr. Wil Doane (IDA), Mo Johnson-León (Insight Lane), Dan Chen (Virginia Tech), Dr. Gwynn Sturdevant (HBS & R-Ladies DC), Marck Vaisman (Microsoft), Jonathan Hersh (Argyros School of Business), Kaz Sakamoto (Lander Analytics & Columbia University), Emily Martinez (NYC Department of Health and Mental Hygiene), Dan Whitenack (SIL International & Practical AI Podcast), Danya Murali (Arcadia Power), Malcolm Barrett (Teladoc Health) and myself.

All the talks will be shared on rstats.ai and the Lander Analytics YouTube channel in the very near future. Stay tuned!

Check out some of the highlights from the conference:

Graciela Chichilnisky explains how financial instruments can resolve climate change

One of my former professors at Columbia University, Dr. Graciela Chichilnisky, gave a presentation on how financial instruments can resolve climate change quickly and effectively by using existing capital markets to benefit high—and, especially, low—income groups. The process Dr. Chichilnisky proposes is simple and can lead to a transformation of our capitalistic economy in the direction of human survival. Furthermore, it is realistic and is profitable. Dr. Chichilnisky acted as the lead U.S. author on the Intergovernmental panel on Climate Change, which received the 2007 Nobel Prize for its work in deciding world policy with respect to climate change, and she worked extensively on the Kyoto Protocol, creating and designing the carbon market that became international law in 2005. 

Another classic no-slides talk from Andrew Gelman on how his team and The Economist Magazine built a presidential election forecasting model

Another professor of mine, Andrew Gelman told us he wanted to give a talk on how his team’s election forecasting succeeded brilliantly, failed miserably, or landed somewhere in between. To build the model, they combined national polls, state polls, and political and economic fundamentals. Because we didn’t know the results of the election at the time, he didn’t know which of the three he’d be talking about… So how did his election forecast perform? The model predicted 49 out of 50 states correctly… But that doesn’t mean the forecast was perfect… For some background, see this article.

Wendy Martinez inspires and shares lessons about the rocky road she traveled to using R at a U.S. Government agency

Wendy Martinez described some of her experiences — both successes and failures — using R at several U.S. government agencies. In addition to serving as the Director of the Mathematical Statistics Research Center at the Bureau of Labor Statistics (BLS) for the last eight years, she is currently the President of the American Statistical Association (ASA), and she also served in several research positions throughout the Department of Defense. She has also written two books on MATLAB! It’s nice to see that she switched to open source.

Colonel Alfredo Corbett Spoke On Air Combat Command Enterprise Data Improvements

Deputy Director of Communications of the United States Air Force Colonel Alfedo Corbett showed us why, in his work, data can be a warfighting asset, fundamental to how Air Combat Command (ACC) operates in—and supports—all domains of warfare. In coordination with the Department of Defense and the Department of the Air Force, ACC is working to improve its data governance, data architecture, data standards, and data talent & culture, implementing major improvements to the way it manages, acquires, ingests, stores, processes, exploits, analyzes, and delivers data to its almost 100,000 operators.

We Participated in Two Virtual Happy Hours!

At lunch on the first day of the conference, we took a dive into the history and distillation process of a legendary rum made at the longest continuously running distillery in the world, Mount Gay Brand Ambassador Darrio Prescod shared his knowledge and transported us to Barbados (where he tuned in from virtually). Following the second day of the conference, members of the Mount Gay brand development team took us through a rum tasting and shook up a couple of cocktails. Attendees and speakers listened and hung out, drinking rum, matcha, soda or water during our virtual happy hour.

All proceeds from the A(R)T Auction went to the R Foundation

The A(R)T Auction was held in support of the  R Foundation, featuring pieces by artists in the R Community. Artists included Nadieh Bremer (left), Selina Carter, Thomas Lin Pedersen (right), Will Chase, DiKayo Data.

Traditional R-Ladies Group Photo Happened Again

We took an R-Ladies group [virtual] selfie. We would like to note that more R-Ladies participated, but chose not to share video.:

Jon Harmon, Selina Carter, Mayarí Montes de Oca & DiKayo Data win Raspberry Pis, Noise Cancelling Headphones, and Gaming Mechanical Keyboards for Most Active Tweeting
You can see the R|Gov 2020 R Shiny Scoreboard here! A custom started at DCR 2018 by our Twitter scorekeeper Malorie Hughes (@data_all_day), has returned every year by popular demand. Congratulations to our winners!

52 Conference Attendees Participated in Pre-Conference Workshops

We ran the following workshops prior to the conference:

Moving from DCR to R|Gov
With the shift to remote, we realized we could welcome a global audience to our annual conference, as we did for the virtual New York R Conference in August. And that gave birth to R|Gov, the Government and Public Sector R Conference. This new industry-focused conference focused on work in government, defense, NGOs and the public sector, and we have speakers from not only the DC-area, but also from Geneva, Switzerland, Nashville, Tennessee, Quebec, Canada and Los Angeles, California. For next year, we are working to invite speakers from more levels of government–local, state and federal. You can read more about this choice here.

Like NYR, R|Gov featured many in-person components of the gathering, like networking sessions, speaker walk-on songs and fun facts, happy hours, lots of giveaways, the Twitter contest, and the auction.

Thank you, Lander Analytics Team!

Even though it was virtual, there was a lot of work that went into the conference, and I want to thank my amazing team at Lander Analytics along with our producer, Bill Prickett, for making it all come together.

Looking Forward to New York, R|Gov, and Dublin!

If you attended, we hope you had an incredible experience. If you did not attend this year’s conference, we hope to see you at the at the New York R Conference and R|Gov in 2021, and, soon, the first Dublin R Conference.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

My team at Lander Analytics has been putting together conferences for six years, and they’ve always had the same fun format, which the community has really enjoyed. There’s the NYR conference for New Yorkers and those who want to fly, drive or train to join the New York community, and there’s DCR, which gathers the DC-area community. The last DCR Conference at Georgetown University went really well, as you can see in this recap. With the shift to virtual gatherings brought on by the pandemic, our community has gone fully remote, including the monthly Open Statistical Programming Meetup. With that, we realized the DCR Conference didn’t just need to be for folks from the DC-area anymore, instead, we could welcome a global audience like we did with this year’s NYR. And that gave birth to R|Gov, the Government and Public Sector R Conference.

R|Gov is really a new industry-focused conference. Instead of drawing on speakers from a particular city or area, the talks will focus on work done in specific fields. In this case, in government, defense, NGOs and the public sector, and we have speakers from not only the DC-area, but also from Geneva, Switzerland, Nashville, Tennessee, Quebec, Canada and Los Angeles, California. For the last three years, we have been working with Data Community DC, R-Ladies DC, and the Statistical Programming DC Meetup, to put on DCR, and continue to do so for R|Gov as we find great speakers and organizations who want to collaborate in driving attendance and building the community.

Like NYR and DCR, the topics at R|Gov  range from practical how-tos, to theoretical findings, to processes, to tooling and the speakers this year come from the Center for Army Analysis, NASA, Columbia University, The U.S. Bureau of Labor Statistics, the Inter-American Development Bank, The United States Census Bureau, Harvard Business School, In-Q-Tel, Virginia Tech, Deloitte, NYC Department of Health and Mental Hygiene and Georgetown University, among others. We will also be hosting two rum and gin master classes, including one with Mount Gay, which comes from the oldest continuously running rum distillery in the world, and which George Washington served at his inauguration!

The R Conference series is quite a bit different from other industry and academic conferences.  The talks are twenty minutes long with no audience questions with the exception of special talks from the likes of Andrew Gelman or Hadley Wickham. Whether in person or virtual, we play music, have prize giveaways and involve food in the programming. When they were in person, we prided ourselves on avocado toast, pizza, ice cream and beer. For prizewinners, we autographed books right on stage since the authors were either speakers or in the audience. With the virtual events we try to capture as much of that spirit as possible, and the community really enjoyed the virtual R Conference | NY in August. A very lively event remotely and in the flesh, it is also one of the more informative conferences I have ever seen.

This virtual conference will include much of the in-person format, just recreated virtually. We will have 24 talks, a panel, workshops, community and networking breaks, happy hours, prizes and giveaways, a Twitter Contest, Meet the Speaker series, Job Board access, and participation in the Art Auction. We hope to see you there December 2-4, on a comfy couch near you.

Contact my team if you would like to sponsor. To learn more about the speaker lineup, workshops, and agenda visit rstats.ai/gov and landeranalytics.com. Join our slack team and follow us at @rstatsdc and @landeranalytics.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

In my last post I discussed using coefplot on glmnet models and in particular discussed a brand new function, coefpath, that uses dygraphs to make an interactive visualization of the coefficient path.

Another new capability for version 1.2.5 of coefplot is the ability to show coefficient plots from xgboost models. Beyond fitting boosted trees and boosted forests, xgboost can also fit a boosted Elastic Net. This makes it a nice alternative to glmnet even though it might not have some of the same user niceties.

To illustrate, we use the same data as our previous post.

First, we load the packages we need and note the version numbers.

# list the packages that we load
# alphabetically for reproducibility
packages <- c('caret', 'coefplot', 'DT', 'xgboost')
# call library on each package
purrr::walk(packages, library, character.only=TRUE)

# some packages we will reference without actually loading
# they are listed here for complete documentation
packagesColon <- c('dplyr', 'dygraphs', 'knitr', 'magrittr', 'purrr', 'tibble', 'useful')
versions <- c(packages, packagesColon) %>% 
    purrr::map(packageVersion) %>% 
    purrr::map_chr(as.character)
packageDF <- tibble::data_frame(Package=c(packages, packagesColon), Version=versions) %>% 
    dplyr::arrange(Package)
knitr::kable(packageDF)
Package Version
caret 6.0.78
coefplot 1.2.6
dplyr 0.7.4
DT 0.2
dygraphs 1.1.1.4
knitr 1.18
magrittr 1.5
purrr 0.2.4
tibble 1.4.2
useful 1.2.3
xgboost 0.6.4

Then, we read the data. The data are available at http://www.jaredlander.com/data/manhattan_Train.rds with the CSV version at data.world. We also get validation data which is helpful when fitting xgboost mdoels.

manTrain <- readRDS(url('http://www.jaredlander.com/data/manhattan_Train.rds'))
manVal <- readRDS(url('http://www.jaredlander.com/data/manhattan_Validate.rds'))

The data are about New York City land value and have many columns. A sample of the data follows. There’s an odd bug where you have to click on one of the column names for the data to display the actual data.

datatable(manTrain %>% dplyr::sample_n(size=1000), elementId='TrainingSampled',
              rownames=FALSE,
              extensions=c('FixedHeader', 'Scroller'),
              options=list(
                  scroller=TRUE
              ))

While glmnet automatically standardizes the input data, xgboost does not, so we calculate that manually. We use preprocess from caret to compute the mean and standard deviation of each numeric column then use these later.

preProc <- preProcess(manTrain, method=c('center', 'scale'))

Just like with glmnet, we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with xgboost we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix and vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

valueFormula <- TotalValue ~ FireService + ZoneDist1 + ZoneDist2 +
    Class + LandUse + OwnerType + LotArea + BldgArea + ComArea + ResArea +
    OfficeArea + RetailArea + NumBldgs + NumFloors + UnitsRes + UnitsTotal + 
    LotDepth + LotFront + BldgFront + LotType + HistoricDistrict + Built + 
    Landmark
manX <- useful::build.x(valueFormula, data=predict(preProc, manTrain),
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY <- useful::build.y(valueFormula, data=manTrain)

manX_val <- useful::build.x(valueFormula, data=predict(preProc, manVal),
                        # do not drop the baselines of factors
                        contrasts=FALSE,
                        # use a sparse matrix
                        sparse=TRUE)

manY_val <- useful::build.y(valueFormula, data=manVal)

There are two functions we can use to fit xgboost models, the eponymous xgboost and xgb.train. When using xgb.train we first store our X and Y matrices in a special xgb.DMatrix object. This is not a necessary step, but makes things a bit cleaner.

manXG <- xgb.DMatrix(data=manX, label=manY)
manXG_val <- xgb.DMatrix(data=manX_val, label=manY_val)

We are now ready to fit a model. All we need to do to fit a linear model instead of a tree is set booster='gblinear' and objective='reg:linear'.

mod1 <- xgb.train(
    # the X and Y training data
    data=manXG,
    # use a linear model
    booster='gblinear',
    # minimize the a regression criterion 
    objective='reg:linear',
    # use MAE as a measure of quality
    eval_metric=c('mae'),
    # boost for up to 500 rounds
    nrounds=500,
    # print out the eval_metric for both the train and validation data
    watchlist=list(train=manXG, validate=manXG_val),
    # print eval_metric every 10 rounds
    print_every_n=10,
    # if the validate eval_metric hasn't improved by this many rounds, stop early
    early_stopping_rounds=25,
    # penalty terms for the L2 portion of the Elastic Net
    lambda=10, lambda_bias=10,
    # penalty term for the L1 portion of the Elastic Net
    alpha=900000000,
    # randomly sample rows
    subsample=0.8,
    # randomly sample columns
    col_subsample=0.7,
    # set the learning rate for gradient descent
    eta=0.1
)
## [1]  train-mae:1190145.875000    validate-mae:1433464.750000 
## Multiple eval metrics are present. Will use validate_mae for early stopping.
## Will train until validate_mae hasn't improved in 25 rounds.
## 
## [11] train-mae:938069.937500 validate-mae:1257632.000000 
## [21] train-mae:932016.625000 validate-mae:1113554.625000 
## [31] train-mae:931483.500000 validate-mae:1062618.250000 
## [41] train-mae:931146.750000 validate-mae:1054833.625000 
## [51] train-mae:930707.312500 validate-mae:1062881.375000 
## [61] train-mae:930137.375000 validate-mae:1077038.875000 
## Stopping. Best iteration:
## [41] train-mae:931146.750000 validate-mae:1054833.625000

The best fit was arrived at after 41 rounds. We can see how the model did on the train and validate sets using dygraphs.

dygraphs::dygraph(mod1$evaluation_log)

We can now plot the coefficients using coefplot. Since xgboost does not save column names, we specify it with feature_names=colnames(manX). Unlike with glmnet models, there is only one penalty so we do not need to specify a specific penalty to plot.

coefplot(mod1, feature_names=colnames(manX), sort='magnitude')

This is another nice addition to coefplot utilizing the power of xgboost.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.









I’m speaking in a few places over the next few weeks, so rather than just giving people a day’s notice I figured I should lay it out a bit. Right now I have three public talks lined up with a few more about to solidify. Soon I will update this map to have past talks too.


Talk Event City Date
Modeling and Machine Learning in R ODSC San Francisco 2017-03-01
Scraping and Analyzing NFL Data Sloan Sports Analytics Conference Boston 2017-03-03
Fun with R New York R Conference New York 2017-04-21


Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Ohio State Buckeyes defensive lineman Joey Bosa

Been a busy few weeks with the New York R Conference, speaking engagements, writing the second edition of R for Everyone and coding open source packages.  The most exciting news involves the news as the Wall Street Journal wrote an article about my NFL Draft work.

It is a great piece with some nice quotes from the Vikings General Manager Rick Spielman and ESPN’s legendary John Clayton that succinctly sums up the work I did and runs the numbers on a few select players.

So now I’ve been in the news for pizza, the lottery and football.  Fun mix.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

We are fighting the large complex data war on a many fronts from theoretical statistics to distributed computing to our own large complex datasets.  So time is tight.

Bill Cleveland

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

While President Obama made big news for his trip to Myanmar I would like to point out I rang the same bell as him (picture above) three years before he did.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.