My team at Lander Analytics has been putting together conferences for six years, and they’ve always had the same fun format, which the community has really enjoyed. There’s the NYR conference for New Yorkers and those who want to fly, drive or train to join the New York community, and there’s DCR, which gathers the DC-area community. The last DCR Conference at Georgetown University went really well, as you can see in this recap. With the shift to virtual gatherings brought on by the pandemic, our community has gone fully remote, including the monthly Open Statistical Programming Meetup. With that, we realized the DCR Conference didn’t just need to be for folks from the DC-area anymore, instead, we could welcome a global audience like we did with this year’s NYR. And that gave birth to R|Gov, the Government and Public Sector R Conference.
R|Gov is really a new industry-focused conference. Instead of drawing on speakers from a particular city or area, the talks will focus on work done in specific fields. In this case, in government, defense, NGOs and the public sector, and we have speakers from not only the DC-area, but also from Geneva, Switzerland, Nashville, Tennessee, Quebec, Canada and Los Angeles, California. For the last three years, we have been working with Data Community DC, R-Ladies DC, and the Statistical Programming DC Meetup, to put on DCR, and continue to do so for R|Gov as we find great speakers and organizations who want to collaborate in driving attendance and building the community.
Like NYR and DCR, the topics at R|Gov range from practical how-tos, to theoretical findings, to processes, to tooling and the speakers this year come from the Center for Army Analysis, NASA, Columbia University, The U.S. Bureau of Labor Statistics, the Inter-American Development Bank, The United States Census Bureau, Harvard Business School, In-Q-Tel, Virginia Tech, Deloitte, NYC Department of Health and Mental Hygiene and Georgetown University, among others. We will also be hosting two rum and gin master classes, including one with Mount Gay, which comes from the oldest continuously running rum distillery in the world, and which George Washington served at his inauguration!
The R Conference series is quite a bit different from other industry and academic conferences. The talks are twenty minutes long with no audience questions with the exception of special talks from the likes of Andrew Gelman or Hadley Wickham. Whether in person or virtual, we play music, have prize giveaways and involve food in the programming. When they were in person, we prided ourselves on avocado toast, pizza, ice cream and beer. For prizewinners, we autographed books right on stage since the authors were either speakers or in the audience. With the virtual events we try to capture as much of that spirit as possible, and the community really enjoyed the virtual R Conference | NY in August. A very lively event remotely and in the flesh, it is also one of the more informative conferences I have ever seen.
This virtual conference will include much of the in-person format, just recreated virtually. We will have 24 talks, a panel, workshops, community and networking breaks, happy hours, prizes and giveaways, a Twitter Contest, Meet the Speaker series, Job Board access, and participation in the Art Auction. We hope to see you there December 2-4, on a comfy couch near you.
The sixth annual (and first virtual) “New York” R Conference took place August 5-6 & 12-15. Almost 300 attendees, and 30 speakers, plus a stand-up comedian and a whiskey masterclass leader, gathered remotely to explore, share, and inspire ideas.
Let’s take a look at some of the highlights from the conference:
Andrew Gelman Gave Another 40-Minute Talk (no slides, as always)
Our favorite quotes from Andrew Gelman’s talk, Truly Open Science: From Design and Data Collection to Analysis and Decision Making, which had no slides, as usual:
“Everyone training in statistics becomes a teacher.”
“The most important thing you should take away — put multiple graphs on a page.”
“Honesty and transparency are not enough.”
“Bad science doesn’t make someone a bad person.”
Laura Gabrysiak Shows us We Are Driven By Experience, and not Brand Loyalty…Hope you Folks had a Good Experience!
Laura’s talk on re-Inventing customer engagement with machine learning went through several interesting use cases from her time at Visa. In addition to being a data scientist, she is an active community organizer and the co-founder of R-Ladies Miami.
Adam Obeng Delivered a Talk on Adaptive Experimentation
One of my former students at Columbia University, Adam Obeng, gave a great presentation on his adaptive experimentation. We learned that adaptive experimentation is three things: The name of (1) a family of techniques, (2) Adam’s team at Facebook, and (3) an open source package produced by said team. He went through the applications which are hyper-parameter optimization for ML, experimentation with multiple continuous treatments, and physical experiments or manufacturing.
Dr. Jacqueline Nolis Invited Us to Crash Her Viral Website, Tweet Mashup
Jacqueline asked the crowd to crash her viral website,Tweet Mashup, and gave a great talk on her experience building it back in 2016. Her website that lets you combine the tweets of two different people. After spending a year making it in .NET, when she launched the site it became an immediate sensation. Years later, she was getting more and more frustrated maintaining the F# code and decided to see if I could recreate it in Shiny. Doing so would require having Shiny integrate with the Twitter API in ways that hadn’t been done by anyone before, and pushing the Twitter API beyond normal use cases.
Attendees Participated in Two Virtual Happy Hours Packed with Fun
At the Friday Happy Hour, we had a mathematical standup comedian for the first time in R Conference history. Comic and math major Rachel Lander (no relationship to me!) entertained us with awesome math and stats jokes.
Following the stand up, we had a Whiskey Master Class with our Vibe Sponsor Westland Distillery, and another one on Saturday with Bruichladdich Distillery (hard to pronounce and easy to drink). Attendees and speakers learned and drank together, whether it be their whiskey, matchas, soda or water.
All Proceeds from the A(R)T Auction went to the R Foundation Again
A newer tradition, the A(R)T Auction, took place again! We featured pieces by artists in the R Community, and all proceeds were donated to the R Foundation. The highest-selling piece at auction was Street Cred (2020) by Vivian Peng (Lander Analytics and Los Angeles Mayor’s Office, Innovation Team). The second highest was a piece by Jacqueline Nolis (Brightloom, and Build a Career in Data Science co-author), R Conference speaker, Designed by Allison Horst, artist in residence at RStudio.
The R-Ladies Group Photo Happened, Even Remotely!
As per tradition, we took an R-Ladies group photo, but, for the first time, remotely– as a screenshot! We would like to note that many more R-Ladies were present in the chat, but just chose not to share video.
Jon Harmon, Edna Mwenda, and Jessica Streeter win Raspberri Pis, Bluetooth Headphones, and Tenkeyless Keyboards for Most Active Tweeting During the Conference
This year’s Twitter Contest, in Malorie’s words, was a “ruthless but noble war.” You can see the NYR 2020 Dashboard here. A custom started that DCR 2018 by our Twitter scorekeeper Malorie Hughes (@data_all_day) has returned every year by popular demand, and now she’s stuck with it forever! Congratulations to our winners!
50+ Conference Attendees Participated in Pre-Conference Workshops Before
For the first time ever, workshops took place over the course of several days to promote work-life balance, and to give attendees the chance to take more than one course. We ran the following seven workshops:
We recreated as much of the in-person experience as possible with attendee networking sessions, the speaker walk-on songs and fun facts, abundant prizes and giveaways, the Twitter contest, an art auction, and happy hours. In addition to all of this, we mailed conference programs, hex stickers, and other swag to each attendee (in the U.S.), along with discount codes from our Vibe Sponsors, MatchaBar, Westland Distillery and Bruichladdich Distillery.
Thank you, Lander Analytics Team!
Even though it was virtual, there was a lot of work that went into the conference, and I want to thank my amazing team at Lander Analytics along with our producer, Bill Prickett, for making it all come together.
Looking Forward to D.C. and Dublin If you attended, we hope you had an incredible experience. If you did not, we hope to see you at the virtual DC R Conference in the fall, and at the first Dublin R Conference and the NYR next year!
The Second Annual DCR Conference made its way to the ICC Auditorium at Georgetown University last week on November 8th and 9th. A sold-out crowd of R enthusiasts and data scientists gathered to explore, share and inspire ideas.
As always, the food was delicious! Our caterer even surprised us with Lander cookies.
David Robinson shared his Ten Tremendous Tricks in the Tidyverse. Always enthusiastic, DRob did a great job showing both well known and obscure functions for an easier data workflow.
Elizabeth Sweeney gave an awesome talk on Visualizing the Environmental Impact of Beef Consumption using Plotly and Shiny. We explored the impact of eating different cuts of beef in terms of the number of animal lives, Co2 emissions, water usage, and land usage. Did you know that there is a big difference in the environmental impact of consuming 100 pounds of hanger steak versus the same weight in ground beef? She used plotly to make interactive graphics and R Shiny to make an interactive webpage to explore the data.
The integrated development environment, RStudio, fully integrated themselves into the environment.
As a father, I’ve earned the right to make dad jokes (see above). You can see the slides for my talk, Raising Baby with R. While babies are commonly called bundles of joy, they are also bundles of data. Being the child of a data scientist and neuroscientist my son was certain to be analyzed myriad ways. I discussed how we used data to narrow down possible names then looked at using time series methods to analyze his sleeping and eating patterns. All in the name of science.
The costs involved with a health insurance plan can be confusing so I perform an analysis of different options to find which plan is most cost effective
My wife and I recently brought a new R programmer into our family so we had to update our health insurance. Becky is a researcher in neuroscience and psychology at NYU so we decided to choose an NYU insurance plan.
For families there are two main plans: Value and Advantage. The primary differences between the plans are the following:
Value Plan Amount
Advantage Plan Amount
The amount we pay every other week in order to have insurance
$160 ($4,160 annually)
$240 ($6,240 annually)
Amount we pay directly to health providers before the insurance starts covering costs
After the deductible is met, we pay this percentage of medical bills
This is the most we will have to pay to health providers in a year (premiums do not count toward this max)
We put them into a tibble for use later.
# use tribble() to make a quick and dirty tibble
parameters <- tibble::tribble(
~Plan, ~Premiums, ~Deductible, ~Coinsurance, ~OOP_Maximum,
'Value', 160*26, 1000, 0.2, 6000,
'Advantage', 240*26, 800, 0.1, 5000
Other than these cost differences, there is not any particular benefit of either plan over the other. That means whichever plan is cheaper is the best to choose.
This blog post walks through the steps of evaluating the plans to figure out which to select. Code is included so anyone can repeat, and improve on, the analysis for their given situation.
In order to figure out which plan to select we need to figure out the all-in cost, which is a function of how much we spend on healthcare in a year (we have to estimate our annual spending) and the aforementioned premiums, deductible, coinsurance and out-of-pocket maximum.
#' @title cost
#' @description Given healthcare spend and other parameters, calculate the actual cost to the user
#' @details Uses the formula above to caluclate total costs given a certain level of spending. This is the premiums plus either the out-of-pocket maximum, the actual spend level if the deductible has not been met, or the amount of the deductible plus the coinsurance for spend above the deductible but below the out-of-pocket maximum.
#' @author Jared P. Lander
#' @param spend A given amount of healthcare spending as a vector for multiple amounts
#' @param premiums The annual premiums for a given plan
#' @param deductible The deductible for a given plan
#' @param coinsurance The coinsurance percentage for spend beyond the deductible but below the out-of-pocket maximum
#' @param oop_maximum The maximum amount of money (not including premiums) that the insured will pay under a given plan
#' @return The total cost to the insured
#' cost(3000, 4160, 1000, .20, 6000)
#' cost(3000, 6240, 800, .10, 5000)
cost <- function(spend, premiums, deductible, coinsurance, oop_maximum)
# spend is vectorized so we use pmin to get the min between oop_maximum and (deductible + coinsurance*(spend - deductible)) for each value of spend provided
# we can never pay more than oop_maximum so that is one side
# if we are under oop_maximum for a given amount of spend,
# this is the cost
pmin(spend, deductible) + coinsurance*pmax(spend - deductible, 0)
# we add the premiums since that factors into our cost
With this function we can see if one plan is always, or mostly, cheaper than the other plan and that’s the one we would choose.
For the rest of the code we need these R packages.
We call our cost function on each amount of spend for the Value and Advantage plans.
spending <- spending %>%
# use our function to calcuate the cost for the value plan
# use our function to calcuate the cost for the Advantage plan
# compute the difference in costs for each plan
# the winner for a given amount of spend is the cheaper plan
mutate(Winner=if_else(Advantage < Value, 'Advantage', 'Value'))
The results are in the following table, showing every other row to save space. The Spend column is a theoretical amount of spending with a red bar giving a visual sense for the increasing amounts. The Value and Advantage columns are the corresponding overall costs of the plans for the given amount of Spend. The Difference column is the result of Advantage – Value where positive numbers in blue mean that the Value plan is cheaper while negative numbers in red mean that the Advantage plan is cheaper. This is further indicated in the Winner column which has the corresponding colors.
Of course, plotting often makes it easier to see what is happening.
select(Spend, Value, Advantage) %>%
# put the plot in longer format so ggplot can set the colors
gather(key=Plan, value=Cost, -Spend) %>%
ggplot(aes(x=Spend, y=Cost, color=Plan)) +
scale_color_brewer(type='qual', palette='Set1') +
labs(x='Healthcare Spending', y='Out-of-Pocket Costs') +
It looks like there is only a small window where the Advantage plan is cheaper than the Value plan. This will be more obvious if we draw a plot of the difference in cost.
ggplot(aes(x=Spend, y=Difference, color=Winner, group=1)) +
geom_hline(yintercept=0, linetype=2, color='grey50') +
y='Difference in Out-of-Pocket Costs Between the Two Plans'
scale_color_brewer(type='qual', palette='Set1') +
To calculate the exact cutoff points where one plan becomes cheaper than the other plan we have to solve for where the two curves intersect. Due to the out-of-pocket maximums the curves are non-linear so we need to consider four cases.
The spending exceeds the point of maximum out-of-pocket spend for both plans
The spending does not exceed the point of maximum out-of-pocket spend for either plan
The spending exceeds the point of maximum out-of-pocket spend for the Value plan but not the Advantage plan
The spending exceeds the point of maximum out-of-pocket spend for the Advantage plan but not the Value plan
When the spending exceeds the point of maximum out-of-pocket spend for both plans the curves are parallel so there will be no cross over point.
When the spending does not exceed the point of maximum out-of-pocket spend for either plan we set the cost calculations (not including the out-of-pocket maximum) for each plan equal to each other and solve for the amount of spend that creates the equality.
To keep the equations smaller we use variables such as \(d_v\) for the Value plan deductible, \(c_a\) for the Advantage plan coinsurance and \(oop_v\) for the out-of-pocket maximum for the Value plan.
When the spending exceeds the point of maximum out-of-pocket spend for the Value plan but not the Advantage plan, we set the out-of-pocket maximum plus premiums for the Value plan equal to the cost calculation of the Advantage plan.
#' @title calculate_crossover_points
#' @description Given healthcare parameters for two plans, calculate when one plan becomes more expensive than the other.
#' @details Calculates the potential crossover points for different scenarios and returns the ones that are true crossovers.
#' @author Jared P. Lander
#' @param premiums_1 The annual premiums for plan 1
#' @param deductible_1 The deductible plan 1
#' @param coinsurance_1 The coinsurance percentage for spend beyond the deductible for plan 1
#' @param oop_maximum_1 The maximum amount of money (not including premiums) that the insured will pay under plan 1
#' @param premiums_2 The annual premiums for plan 2
#' @param deductible_2 The deductible plan 2
#' @param coinsurance_2 The coinsurance percentage for spend beyond the deductible for plan 2
#' @param oop_maximum_2 The maximum amount of money (not including premiums) that the insured will pay under plan 2
#' @return The amount of spend at which point one plan becomes more expensive than the other
#' 160, 1000, 0.2, 6000,
#' 240, 800, 0.1, 5000
calculate_crossover_points <- function(
premiums_1, deductible_1, coinsurance_1, oop_maximum_1,
premiums_2, deductible_2, coinsurance_2, oop_maximum_2
# calculate the crossover before either has maxed out
neither_maxed_out <- (premiums_2 - premiums_1 +
deductible_2*(1 - coinsurance_2) -
deductible_1*(1 - coinsurance_1)) /
(coinsurance_1 - coinsurance_2)
# calculate the crossover when one plan has maxed out but the other has not
one_maxed_out <- (oop_maximum_1 +
premiums_1 - premiums_2 +
# calculate the crossover for the reverse
other_maxed_out <- (oop_maximum_2 +
premiums_2 - premiums_1 +
# these are all possible points where the curves cross
all_roots <- c(neither_maxed_out, one_maxed_out, other_maxed_out)
# now calculate the difference between the two plans to ensure that these are true crossover points
all_differences <- cost(all_roots, premiums_1, deductible_1, coinsurance_1, oop_maximum_1) -
cost(all_roots, premiums_2, deductible_2, coinsurance_2, oop_maximum_2)
# only when the difference between plans is 0 are the curves truly crossing
all_roots[all_differences == 0]
We then call the function with the parameters for both plans we are considering.
We see that the Advantage plan is only cheaper than the Value plan when spending between $20,000 and $32,000.
The next question is will our healthcare spending fall in that narrow band between $20,000 and $32,000 where the Advantage plan is the cheaper option?
Probability of Spending
This part gets tricky. I’d like to figure out the probability of spending between $20,000 and $32,000. Unfortunately, it is not easy to find healthcare spending data due to the opaque healthcare system. So I am going to make a number of assumptions. This will likely violate a few principles, but it is better than nothing.
Assumptions and calculations:
Healthcare spending follows a log-normal distribution
We will work with New York State data which is possibly different than New York City data
We know the mean for New York spending in 2014
We will use the accompanying annual growth rate to estimate mean spending in 2019
We have the national standard deviation for spending in 2009
In order to figure out the standard deviation for New York, we calculate how different the New York mean is from the national mean as a multiple, then multiply the national standard deviation by that number to approximate the New York standard deviation in 2009
We use the growth rate from before to estimate the New York standard deviation in 2019
We then take just New York spending for 2014 and multiply it by the corresponding growth rate.
ny_spend <- health_spend %>%
# get just New York
filter(State_Name == 'New York') %>%
# this row holds overall spending information
filter(Item == 'Personal Health Care ($)') %>%
# we only need a few columns
select(Y2014, Growth=Average_Annual_Percent_Growth) %>%
# we have to calculate the spending for 2019 by accounting for growth
# after converting it to a percentage
mutate(Y2019=Y2014*(1 + (Growth/100))^5)
We see that the New York average is 1.4187464 times the national average. So we multiply the national standard deviation from 2009 by this amount to estimate the New York State standard deviation and assume the same annual growth rate as the mean. Recall, we can multiply the standard deviation by a constant.
My original assumption was that spending would follow a normal distribution, but New York’s resident agricultural economist, JD Long, suggested that the spending distribution would have a floor at zero (a person cannot spend a negative amount) and a long right tail (there will be many people with lower levels of spending and a few people with very high levels of spending), so a log-normal distribution seems more appropriate.
So we only have a 2.35% probability of our spending falling in that band where the Advantage plan is more cost effective. Meaning we have a 97.65% probability that the Value plan will cost less over the course of a year.
We can also calculate the expected cost under each plan. We do this by first calculating the probability of spending each (thousand) dollar amount (since the log-normal is a continuous distribution this is an estimated probability). We multiply each of those probabilities against their corresponding dollar amounts. Since the distribution is log-normal we need to exponentiate the resulting number. The data are on the thousands scale, so we multiply by 1000 to put it back on the dollar scale. Mathematically it looks like this.
The following code calculates the expected cost for each plan.
# calculate the point-wise estimated probabilities of the healthcare spending
# based on a log-normal distribution with the appropriate mean and standard deviation
# compute the expected cost for each plan
# and the difference between them
# exponentiate the numbers so they are on the original scale
# the spending data is in increments of 1000
# so multiply by 1000 to get them on the dollar scale
mutate_each(funs=~ .x * 1000)
This shows that overall the Value plan is cheaper by about $1,324 dollars on average.
We see that there is a very small window of healthcare spending where the Advantage plan would be cheaper, and at most it would be about $600 cheaper than the Value plan. Further, the probability of falling in that small window of savings is just 2.35%.
So unless our spending will be between $20,000 and $32,000, which it likely will not be, it is a better idea to choose the Value plan.
Since the Value plan is so likely to be cheaper than the Advantage plan I wondered who would pick the Advantage plan. Economist Jon Hersh invokes behavioral economics to explain why people may select the Advantage plan. Some parts of the Advantage plan are lower than the Value plan, such as the deductible, coinsurance and out-of-pocket maximum. People see that under certain circumstances the Advantage plan would save them money and are enticed by that, not realizing how unlikely that would be. So they are hedging against a low probability situation. (A consideration I have not accounted for is family size. The number of members in a family can have a big impact on the overall spend and whether or not it falls into the narrow band where the Advantage plan is cheaper.)
In the end, the Value plan is very likely going to be cheaper than the Advantage plan.
Try it at Home
I created a Shiny app to allow users to plug in the numbers for their own plans. It is rudimentary, but it gives a sense for the relative costs of different plans.
Data scientists and R enthusiasts gathered for the 5th annual New York R Conference held on May 9th-11th. In front of a crowd of more than 300 attendees, 24 speakers gave presentations on topics ranging from deep learning and building packages in R to football and hockey analytics.
This year marked the ten-year anniversary of the New York Open Statistical Programming Meetup. It has been incredible to see the growth of meetup over the years. We now have over 10,000 members around the world!
Let’s take a look at some of the highlights from the conference:
Jonah Gabry Kicked Off “R” Week at the New York Open Statistical Programming Meetup with a Talk on Using Stan in R
Jonah Gabry from the Stan Development Team kicked off “R” week with a talk on making Bayes easier in the R ecosystem. Jonah went over the packages (rstanarm, rstantools, bayesplot and loo) which emulate other R model-fitting functions, unify function naming across Stan-based R packages, and develop plotting functions using ggplot objects.
50 Conference Attendees Participated in Pre-Conference Workshops on Thursday before the Conference
On the Thursday before the two-day conference, more than 50 conference attendees arrived at Work-Bench a day early for a full day of workshops. This was the first year of the R Conference Workshop Series. Max Kuhn, Dan Chen, Elizabeth Sweeney and Kaz Sakamoto each led a workshop which covered the following topics:
Machine Learning with Caret (Max Kuhn)
Git for Data Science (Dan Chen)
Introduction to Survival Analysis (Elizabeth Sweeney)
Geospatial Statistics and Mapping in R (Kaz Sakamoto)
The Growth of R-Ladies Summed Up in Three Pictures…
We are so excited to see the growth of the R-Ladies community and we appreciate their support for the NY R Conference over the years. Congratulations ladies!
Dr. Andrew Gelman Delivers Keynote Speech on the Fallacy of P-Values and Thinking like a Statistician—All Without Slides
There wasn’t a soul in the crowd who wasn’t hanging on every word from Columbia professor Dr. Andrew Gelman. The only speaker with a 40-minute time slot, and the only speaker to not use slides, Dr. Gelman talked about life as a statistician, warned of the perils of p-values and stressed the importance of simulation—before data collection—to improve our understanding of possible real-life scenarios. “Only through simulating fake data, can you really have statistical confidence about whatever performance metric you’re aiming for,” Gelman noted.
While we try not to pick a favorite speaker, Dr. Gelman runs away with that title every time he comes to speak at the New York R Conference.
Jacqueline and Heather Nolis Taught Us to Not to Be Afraid of Deep Learning and Model Deployment in Production
The final talk on day one was perhaps the most entertaining and insightful from the weekend. Jacqueline Nolis taught us how developing a deep learning model is easier than we thought and how humor can help us understand a complex idea in a simple form. Our top five favorite neural network-generate pet names: Dia, Spok, Jori, Lule, and Timuse!
On Saturday morning, Heather Nolis showed us how we can deploy the model into production. Heather walked through the steps involved in preparing an R model for production using containers (Docker) and container orchestration (Kubernetes) to share models throughout an organization or for the public. How can we put a model into production without your laptop running 24/7? By running the code safely on a server in the cloud!
Emily Robinson and Honey Berk Win Headphones for Most Tweets During the Conference
If you’re not following Emily Robinson (@robinson_es) and Honey Berk (@honeyberk), you’re missing out! Emily and Honey led all conference attendees in Twitter mentions according to our Twitter scorekeeper Malorie Hughes (@data_all_day). Because of Emily and Honey’s presence on Twitter, those who were unable to attend the conference were able to follow along with all of our incredible speakers throughout the two-day event.
Jared Lander Debuts New-Born R Package Hex Sticker T-Shirts: Congratulations to Jared and Rebecca on the Birth of their Son, Lev
During my talk I debuted a custom R package hex sticker t-shirt with my wife Rebecca and son Lev. We R a very nerdy family.
Looking Forward to 2020
If you attended the 2019 New York R Conference, we hope you had an incredible experience. If you did not attend the conference, we hope to see you next year!
I started attending regularly and pretty soon Drew decided to serve pizza which later led to years of pizza data. He also designed a logo for the NYC Data Mafia, which made for a great t-shirt that we still sell. One time, a number of us were talking and realized we were all answering each other’s questions on StackOverflow. Our community was growing both in person and online. I fell in love with the group because it was a great place to learn and hang out with smart, welcoming people.
During the first two years our hosts included NYU, Columbia, AOL and a handful of others. At this time there were about 1,800 members with Drew as the sole organizer who was ready to focus on other parts of his life, so he asked Wes McKinney and me to take over as organizers. This was after Drew renamed the group the Open Statistical Programming Meetup as to include other open source languages like Python, Julia, Go and SQL. I was incredibly thrilled to organize this group which meant so much to me.
That night was also my fifth date with Rebecca Martin. We originally met during Michael Kane’s talk about PubMed then reconnected about a year later. We went on to get married and have a kid together. The New York Times used the nerdiest closing line ever for our wedding announcement: “The couple met in New York in May 2014 at a meet-up about statistical programming organized by the groom.”
These past ten years have been a collection of amazing experiences for me where I got to learn from some of the world’s best experts and develop lasting relationships with great people. This community means so much to me and I very much look forward to its continued growth over the next decade.
On Pi Day this year I was giving a keynote talk at DataFest in Scotland, so we celebrated Pi Day a week later, on the 21st. While it wasn’t the exact date, there’s never a bad time to eat pizza and Pi Cake.
For pizza we went to the new Lombardi’s in Chelsea. They use an amazing electric oven instead of coal, so if you look closely you can tell the difference, but the pizza was still great and the decor was fantastic.
After four sold-out years in New York City, the R Conference made its debut in Washington DC to a sold-out crowd of data scientists at the Ronald Reagan Building on November 8th & 9th. Our speakers shared presentations on a variety of R-related topics.
R Superstars Mara Averick, Roger Peng and Emily Robinson
A hallmark of our R conferences is that the speakers hang out with all the attendees and these three were crowd favorites.
Michael Powell Brings R to the aRmy
Major Michael Powell describes how R has brought efficiency to the Army Intelligence and Security Command by getting analysts out of Excel and into the Tidyverse. “Let me turn those 8 hours into 8 seconds for you,” says Powell.
Max Kuhn Explains the Applications of Equivocals to Apply Levels of Certainty to Predictions
After autographing his book, Applied Predictive Modeling, for a lucky attendee, Max Kuhn explains how Equivocals can be applied to individual predictions in order to avoid reporting predictions when there is significant uncertainty.
NYR and DCR Speaker Emily Robinson Getting an NYR Hoodie for her Awesome Tweeting
Emily Robinson tweeted the most at the 2018 NYR conference, winning her a WASD mechanical keyboard and at DCR she came in second so we gave her a limited edition NYR hoodie.
Max Richman Shows How SQL and R can Co-Exist
Max Richman, wearing the same shirt he wore when he spoke at the first NYR, shows parallels between dplyr and SQL.
Michael Garris Tells the Story of the MNIST Dataset
Michael Garris was a member of the team that built the original MNIST dataset, which set the standard for handwriting image classification in the early 1990s. This talk may have been the first time the origin story was ever told.
R Stats Luminary Roger Peng Explains Relationship Between Air Pollution and Public Health
Roger Peng shows us how air pollution levels has fallen over the past 50 years resulting in dramatic improvements in air quality and health (with help from R).
Kelly O’Briant Combining R with Serverless Computing
On the first day I challenged the audience to analyze the tweets from the conference and Malorie Hughes, a data scientist with NPR, designed a Twitter analytics dashboard to track the attendee with the most tweets with the hashtag #rstatsdc. Seth Wenchel won a WASD keyboard for the best tweeting. And we presented Malorie wit a DCR speaker mug.
Strong Showing from the #RLadies!
The #rladies group is growing year after year and it is great seeing them in force at NYR and DCR!
A common theme is that getting everyone setup on their individual computers is very difficult. No matter how many instructions I provide, there are always a good number of people without a proper environment. This can mean not using RStudio projects, not having the right packages installed, not downloading the data and sometimes not even installing R.
After many experiments I finally came upon a solution. For every class I teach I now create a skeleton project hosted on GitHub with instructions for setup.
The instructions (in the README) consist of three blocks of code.
Copying the project structure from the repo (no git required)
All the user has to do is copy and paste these three blocks of code into the R console and they have the exact same environment as the instructor and other students.
Using this process, 95% of my students are prepared for class.
The inspiration for this idea came from a fun coffee with Hadley Wickham and Jenny Bryan during a conference in New Zealand and the implementation is made possible thanks to the usethis package.
Automating the Setup
Now that I found a good way to get students started, I wanted to make it easier for me to setup the repo. So I created an R package called RepoGenerator and put it on CRAN.
The first step to using the package is to create a GitHub Personal Access Token (instructions are in the README). Then you build a data.frame listing datasets you want the students to download. The data.frame needs at least the following three columns.
Local: The name, not path, the file should have on disk
Remote: The URL where the data files are stored online
Mode: The mode needed to write the file to disk, ‘w’ for regular text files, ‘wb’ for binary files such as Excel or rds files
An example data.frame is available in the RepoGenerator package.
After that you define the packages you want your students to use. There can be as few or as many as you want. In addition to any packages you list, rprojroot and usethis are added so that the instructions in the new repo will be certain to work.
Now all you need to do is call the createRepo() function.
# the name to use for the repo and project
# the location on disk to build the project
# the data.frame listing data files for the user to download
# vector of packages the user should install
# the GitHub username to create the repo for
# the new repo's README has the name of who is organizing the class
# the name of the environment variable storing the GitHub Personal Access Token
After this you will have a new repo setup for your users to copy, including instructions.
Reducing setup issues at the start of a training can really improve the experience for everyone and allow you to get straight into teaching.
Please check it out and let me know how it works for you.
Excited to announce that i am waiting in a LINE for the WOMEN’s restroom at a tech conference! Thanks @RLadiesNYC, @RLadiesGlobal, and @nyhackr for the opportunity, wouldn’t be here without your support
Emily Robinson Shows There is More to the Tidyverse than Hadley
Emily Robinson, otherwise known as ERob, gave an excellent talk showing how the Tidyverse is so much more than just Hadley and that there are many people inspired by him to contribute in the Tidy way.
Sean Taylor Forecasted the Future with Prophet
Sean Taylor, former New Yorker and unrepentant Eagles fan, demonstrate his powerful R and Python, package Prophet, for forecasting time series data. Facebook open sourced his work so we could all benefit.
Hadley Wickham showed us how to get into the internals of R and figure out how to examine objects from a memory perspective.
Jennifer Hill Demonstrated Awesome Machine Learning Techniques for Causal Inference
Following her sold-out meetup appearance in March, Jennifer continued to push the boundaries of causal inference.
I Made the Authors of Caret and scitkit-learn Show That R and Python Can Get Along
While both Andreas and Max gave great individual talks, I made them pose for this peace-making photo.
David Robinson Got the Upper Hand in a Sibling Twitter Duel
Given only about 30 minutes notice, David put together an entire slideshow on how to livetweet and how to compete with your sibling.
In the End Emily Robinson Beat Her Brother For Best Tweeting
Despite David’s headstart Emily was the best tweeter (as calculated by Max Kuhn and Mara Averick) so she won the WASD Code mechanical keyboard with MX Cherry Clear switches.
Silent Auction of Data Paintings
Thomas Levine made paintings of famous datasets that we auctioned off with the proceeds supporting the R Foundation and the Free Software Foundation. The Robinson family very graciously chipped in and bought the painting of the Pizza Poll data for me! I’m still floored by this and in love with the painting.