The costs involved with a health insurance plan can be confusing so I perform an analysis of different options to find which plan is most cost effective
My wife and I recently brought a new R programmer into our family so we had to update our health insurance. Becky is a researcher in neuroscience and psychology at NYU so we decided to choose an NYU insurance plan.
For families there are two main plans: Value and Advantage. The primary differences between the plans are the following:
Value Plan Amount
Advantage Plan Amount
The amount we pay every other week in order to have insurance
$160 ($4,160 annually)
$240 ($6,240 annually)
Amount we pay directly to health providers before the insurance starts covering costs
After the deductible is met, we pay this percentage of medical bills
This is the most we will have to pay to health providers in a year (premiums do not count toward this max)
We put them into a tibble for use later.
# use tribble() to make a quick and dirty tibble
parameters <- tibble::tribble(
~Plan, ~Premiums, ~Deductible, ~Coinsurance, ~OOP_Maximum,
'Value', 160*26, 1000, 0.2, 6000,
'Advantage', 240*26, 800, 0.1, 5000
Other than these cost differences, there is not any particular benefit of either plan over the other. That means whichever plan is cheaper is the best to choose.
This blog post walks through the steps of evaluating the plans to figure out which to select. Code is included so anyone can repeat, and improve on, the analysis for their given situation.
In order to figure out which plan to select we need to figure out the all-in cost, which is a function of how much we spend on healthcare in a year (we have to estimate our annual spending) and the aforementioned premiums, deductible, coinsurance and out-of-pocket maximum.
#' @title cost
#' @description Given healthcare spend and other parameters, calculate the actual cost to the user
#' @details Uses the formula above to caluclate total costs given a certain level of spending. This is the premiums plus either the out-of-pocket maximum, the actual spend level if the deductible has not been met, or the amount of the deductible plus the coinsurance for spend above the deductible but below the out-of-pocket maximum.
#' @author Jared P. Lander
#' @param spend A given amount of healthcare spending as a vector for multiple amounts
#' @param premiums The annual premiums for a given plan
#' @param deductible The deductible for a given plan
#' @param coinsurance The coinsurance percentage for spend beyond the deductible but below the out-of-pocket maximum
#' @param oop_maximum The maximum amount of money (not including premiums) that the insured will pay under a given plan
#' @return The total cost to the insured
#' cost(3000, 4160, 1000, .20, 6000)
#' cost(3000, 6240, 800, .10, 5000)
cost <- function(spend, premiums, deductible, coinsurance, oop_maximum)
# spend is vectorized so we use pmin to get the min between oop_maximum and (deductible + coinsurance*(spend - deductible)) for each value of spend provided
# we can never pay more than oop_maximum so that is one side
# if we are under oop_maximum for a given amount of spend,
# this is the cost
deductible + coinsurance*(spend - deductible)
# we add the premiums since that factors into our cost
With this function we can see if one plan is always, or mostly, cheaper than the other plan and that’s the one we would choose.
For the rest of the code we need these R packages.
We call our cost function on each amount of spend for the Value and Advantage plans.
spending <- spending %>%
# use our function to calcuate the cost for the value plan
# use our function to calcuate the cost for the Advantage plan
# compute the difference in costs for each plan
# the winner for a given amount of spend is the cheaper plan
mutate(Winner=if_else(Advantage < Value, 'Advantage', 'Value'))
The results are in the following table, showing every other row to save space. The Spend column is a theoretical amount of spending with a red bar giving a visual sense for the increasing amounts. The Value and Advantage columns are the corresponding overall costs of the plans for the given amount of Spend. The Difference column is the result of Advantage – Value where positive numbers in blue mean that the Value plan is cheaper while negative numbers in red mean that the Advantage plan is cheaper. This is further indicated in the Winner column which has the corresponding colors.
Of course, plotting often makes it easier to see what is happening.
select(Spend, Value, Advantage) %>%
# put the plot in longer format so ggplot can set the colors
gather(key=Plan, value=Cost, -Spend) %>%
ggplot(aes(x=Spend, y=Cost, color=Plan)) +
scale_color_brewer(type='qual', palette='Set1') +
labs(x='Healthcare Spending', y='Out-of-Pocket Costs') +
It looks like there is only a small window where the Advantage plan is cheaper than the Value plan. This will be more obvious if we draw a plot of the difference in cost.
ggplot(aes(x=Spend, y=Difference, color=Winner, group=1)) +
geom_hline(yintercept=0, linetype=2, color='grey50') +
y='Difference in Out-of-Pocket Costs Between the Two Plans'
scale_color_brewer(type='qual', palette='Set1') +
To calculate the exact cutoff points where one plan becomes cheaper than the other plan we have to solve for where the two curves intersect. Due to the out-of-pocket maximums the curves are non-linear so we need to consider four cases.
The spending exceeds the point of maximum out-of-pocket spend for both plans
The spending does not exceed the point of maximum out-of-pocket spend for either plan
The spending exceeds the point of maximum out-of-pocket spend for the Value plan but not the Advantage plan
The spending exceeds the point of maximum out-of-pocket spend for the Advantage plan but not the Value plan
When the spending exceeds the point of maximum out-of-pocket spend for both plans the curves are parallel so there will be no cross over point.
When the spending does not exceed the point of maximum out-of-pocket spend for either plan we set the cost calculations (not including the out-of-pocket maximum) for each plan equal to each other and solve for the amount of spend that creates the equality.
To keep the equations smaller we use variables such as \(d_v\) for the Value plan deductible, \(c_a\) for the Advantage plan coinsurance and \(oop_v\) for the out-of-pocket maximum for the Value plan.
When the spending exceeds the point of maximum out-of-pocket spend for the Value plan but not the Advantage plan, we set the out-of-pocket maximum plus premiums for the Value plan equal to the cost calculation of the Advantage plan.
#' @title calculate_crossover_points
#' @description Given healthcare parameters for two plans, calculate when one plan becomes more expensive than the other.
#' @details Calculates the potential crossover points for different scenarios and returns the ones that are true crossovers.
#' @author Jared P. Lander
#' @param premiums_1 The annual premiums for plan 1
#' @param deductible_1 The deductible plan 1
#' @param coinsurance_1 The coinsurance percentage for spend beyond the deductible for plan 1
#' @param oop_maximum_1 The maximum amount of money (not including premiums) that the insured will pay under plan 1
#' @param premiums_2 The annual premiums for plan 2
#' @param deductible_2 The deductible plan 2
#' @param coinsurance_2 The coinsurance percentage for spend beyond the deductible for plan 2
#' @param oop_maximum_2 The maximum amount of money (not including premiums) that the insured will pay under plan 2
#' @return The amount of spend at which point one plan becomes more expensive than the other
#' 160, 1000, 0.2, 6000,
#' 240, 800, 0.1, 5000
calculate_crossover_points <- function(
premiums_1, deductible_1, coinsurance_1, oop_maximum_1,
premiums_2, deductible_2, coinsurance_2, oop_maximum_2
# calculate the crossover before either has maxed out
neither_maxed_out <- (premiums_2 - premiums_1 +
deductible_2*(1 - coinsurance_2) -
deductible_1*(1 - coinsurance_1)) /
(coinsurance_1 - coinsurance_2)
# calculate the crossover when one plan has maxed out but the other has not
one_maxed_out <- (oop_maximum_1 +
premiums_1 - premiums_2 +
# calculate the crossover for the reverse
other_maxed_out <- (oop_maximum_2 +
premiums_2 - premiums_1 +
# these are all possible points where the curves cross
all_roots <- c(neither_maxed_out, one_maxed_out, other_maxed_out)
# now calculate the difference between the two plans to ensure that these are true crossover points
all_differences <- cost(all_roots, premiums_1, deductible_1, coinsurance_1, oop_maximum_1) -
cost(all_roots, premiums_2, deductible_2, coinsurance_2, oop_maximum_2)
# only when the difference between plans is 0 are the curves truly crossing
all_roots[all_differences == 0]
We then call the function with the parameters for both plans we are considering.
We see that the Advantage plan is only cheaper than the Value plan when spending between $20,000 and $32,000.
The next question is will our healthcare spending fall in that narrow band between $20,000 and $32,000 where the Advantage plan is the cheaper option?
Probability of Spending
This part gets tricky. I’d like to figure out the probability of spending between $20,000 and $32,000. Unfortunately, it is not easy to find healthcare spending data due to the opaque healthcare system. So I am going to make a number of assumptions. This will likely violate a few principles, but it is better than nothing.
Assumptions and calculations:
Healthcare spending follows a log-normal distribution
We will work with New York State data which is possibly different than New York City data
We know the mean for New York spending in 2014
We will use the accompanying annual growth rate to estimate mean spending in 2019
We have the national standard deviation for spending in 2009
In order to figure out the standard deviation for New York, we calculate how different the New York mean is from the national mean as a multiple, then multiply the national standard deviation by that number to approximate the New York standard deviation in 2009
We use the growth rate from before to estimate the New York standard deviation in 2019
We then take just New York spending for 2014 and multiply it by the corresponding growth rate.
ny_spend <- health_spend %>%
# get just New York
filter(State_Name == 'New York') %>%
# this row holds overall spending information
filter(Item == 'Personal Health Care ($)') %>%
# we only need a few columns
select(Y2014, Growth=Average_Annual_Percent_Growth) %>%
# we have to calculate the spending for 2019 by accounting for growth
# after converting it to a percentage
mutate(Y2019=Y2014*(1 + (Growth/100))^5)
We see that the New York average is 1.4187464 times the national average. So we multiply the national standard deviation from 2009 by this amount to estimate the New York State standard deviation and assume the same annual growth rate as the mean. Recall, we can multiply the standard deviation by a constant.
My original assumption was that spending would follow a normal distribution, but New York’s resident agricultural economist, JD Long, suggested that the spending distribution would have a floor at zero (a person cannot spend a negative amount) and a long right tail (there will be many people with lower levels of spending and a few people with very high levels of spending), so a log-normal distribution seems more appropriate.
So we only have a 2.35% probability of our spending falling in that band where the Advantage plan is more cost effective. Meaning we have a 97.65% probability that the Value plan will cost less over the course of a year.
We can also calculate the expected cost under each plan. We do this by first calculating the probability of spending each (thousand) dollar amount (since the log-normal is a continuous distribution this is an estimated probability). We multiply each of those probabilities against their corresponding dollar amounts. Since the distribution is log-normal we need to exponentiate the resulting number. The data are on the thousands scale, so we multiply by 1000 to put it back on the dollar scale. Mathematically it looks like this.
The following code calculates the expected cost for each plan.
# calculate the point-wise estimated probabilities of the healthcare spending
# based on a log-normal distribution with the appropriate mean and standard deviation
# compute the expected cost for each plan
# and the difference between them
# exponentiate the numbers so they are on the original scale
# the spending data is in increments of 1000
# so multiply by 1000 to get them on the dollar scale
mutate_each(funs=~ .x * 1000)
This shows that overall the Value plan is cheaper by about $1,324 dollars on average.
We see that there is a very small window of healthcare spending where the Advantage plan would be cheaper, and at most it would be about $600 cheaper than the Value plan. Further, the probability of falling in that small window of savings is just 2.35%.
So unless our spending will be between $20,000 and $32,000, which it likely will not be, it is a better idea to choose the Value plan.
Since the Value plan is so likely to be cheaper than the Advantage plan I wondered who would pick the Advantage plan. Economist Jon Hersh invokes behavioral economics to explain why people may select the Advantage plan. Some parts of the Advantage plan are lower than the Value plan, such as the deductible, coinsurance and out-of-pocket maximum. People see that under certain circumstances the Advantage plan would save them money and are enticed by that, not realizing how unlikely that would be. So they are hedging against a low probability situation. (A consideration I have not accounted for is family size. The number of members in a family can have a big impact on the overall spend and whether or not it falls into the narrow band where the Advantage plan is cheaper.)
In the end, the Value plan is very likely going to be cheaper than the Advantage plan.
Try it at Home
I created a Shiny app to allow users to plug in the numbers for their own plans. It is rudimentary, but it gives a sense for the relative costs of different plans.
After four sold-out years in New York City, the R Conference made its debut in Washington DC to a sold-out crowd of data scientists at the Ronald Reagan Building on November 8th & 9th. Our speakers shared presentations on a variety of R-related topics.
R Superstars Mara Averick, Roger Peng and Emily Robinson
A hallmark of our R conferences is that the speakers hang out with all the attendees and these three were crowd favorites.
Michael Powell Brings R to the aRmy
Major Michael Powell describes how R has brought efficiency to the Army Intelligence and Security Command by getting analysts out of Excel and into the Tidyverse. “Let me turn those 8 hours into 8 seconds for you,” says Powell.
Max Kuhn Explains the Applications of Equivocals to Apply Levels of Certainty to Predictions
After autographing his book, Applied Predictive Modeling, for a lucky attendee, Max Kuhn explains how Equivocals can be applied to individual predictions in order to avoid reporting predictions when there is significant uncertainty.
NYR and DCR Speaker Emily Robinson Getting an NYR Hoodie for her Awesome Tweeting
Emily Robinson tweeted the most at the 2018 NYR conference, winning her a WASD mechanical keyboard and at DCR she came in second so we gave her a limited edition NYR hoodie.
Max Richman Shows How SQL and R can Co-Exist
Max Richman, wearing the same shirt he wore when he spoke at the first NYR, shows parallels between dplyr and SQL.
Michael Garris Tells the Story of the MNIST Dataset
Michael Garris was a member of the team that built the original MNIST dataset, which set the standard for handwriting image classification in the early 1990s. This talk may have been the first time the origin story was ever told.
R Stats Luminary Roger Peng Explains Relationship Between Air Pollution and Public Health
Roger Peng shows us how air pollution levels has fallen over the past 50 years resulting in dramatic improvements in air quality and health (with help from R).
Kelly O’Briant Combining R with Serverless Computing
On the first day I challenged the audience to analyze the tweets from the conference and Malorie Hughes, a data scientist with NPR, designed a Twitter analytics dashboard to track the attendee with the most tweets with the hashtag #rstatsdc. Seth Wenchel won a WASD keyboard for the best tweeting. And we presented Malorie wit a DCR speaker mug.
Strong Showing from the #RLadies!
The #rladies group is growing year after year and it is great seeing them in force at NYR and DCR!
You might be asking yourself, “How was the 2016 New York R Conference?”
Well, if we had to sum it up in one picture, it would look a lot like this (thank you to Drew Conway for the slide & delivering the battle cry for data science in NYC):
Our 2nd annual, sold-out New York R Conference was back this year on April 8th & 9th at Work-Bench. Co-hosted with our friends at Lander Analytics, this year’s conference was bigger and better than ever, with over 250 attendees, and speakers from Airbnb, AT&T, Columbia University, eBay, Etsy, RStudio, Socure, and Tamr. In case you missed the conference or want to relive the excitement, all of the talks and slides are now live on the R Conference website.
With 30 talks, each 20 minutes long and two forty-minute keynotes, the topics of the presentations were just as diverse as the speakers. Vivian Peng gave an emotional talk on data visualization using non-visual senses and “The Feels.” Bryan Lewis measured the shadows of audience members to demonstrate the pros and cons of projection methods, and Daniel Lee talked about life, love, Stan, and March Madness. But, even with 32 presentations from a diverse selection of speakers, two dominant themes emerged: 1) Community and 2) Writing better code.
Given the amazing caliber of speakers and attendees, community was on everyone’s mind from the start. Drew Conway emoted the past, present, and future of data science in NYC, and spoke to the dangers of tearing down the tent we built. Joe Rickert from Microsoft discussed the R Consortium and how to become involved. Wes McKinney talked about community efforts in improving interoperability between data science languages with the new Feather data frame file format under the Apache Arrow project. Elena Grewal discussed how Airbnb’s data science team made changes to the hiring process to increase the number of female hires, and Andrew Gelman even talked about how your political opinions are shaped by those around you in his talk about Social Penumbras.
Writing better code also proved to be a dominant theme throughout the two day conference. Dan Chen of Lander Analytics talked about implementing tests in R. Similarly, Neal Richardson and Mike Malecki of Crunch.io talked about how they learned to stop munging and love tests, and Ben Lerner discussed how to optimize Python code using profilers and Cython. The perfect intersection of themes came from Bas van Schaik of Semmle who discussed how to use data science to write better code by treating code as data. While everyone had some amazing insights, these were our top five highlights:
JJ Allaire Releases a New Preview of RStudio
JJ Allaire, the second speaker of the conference, got the crowd fired up by announcing new features of RStudio and new packages. Particularly exciting was bookdown for authoring large documents, R Notebooks for interactive Markdown files and shared sessions so multiple people can code together from separate computers.
Andrew Gelman Discusses the Political Impact of the Social Penumbra
As always, Dr. Andrew Gelman wowed the crowd with his breakdown of how political opinions are shaped by those around us. He utilized his trademark visualizations and wit to convey the findings of complex models.
Vivian Peng Helps Kick off the Second Day with a Punch to the Gut
On the morning of the second day of the conference, Vivian Peng gave a heartfelt talk on using data visualization and non-visual senses to drive emotional reaction and shape public opinion on everything from the Syrian civil war to drug resistance statistics.
Ivor Cribben Studies Brain Activity with Time Varying Networks
University of Alberta Professor Ivor Cribben demonstrated his techniques for analyzing fMRI data. His use of network graphs, time series and extremograms brought an academic rigor to the conference.
Elena Grewal Talks About Scaling Data Science at Airbnb
After a jam-packed 2 full days, Elena Grewal helped wind down the conference with a thoughtful introspection on how Airbnb has grown their data science team from 5 to 70 people, with a focus on increasing diversity and eliminating bias in the hiring process.
My time slot was up against Nate Silver so I didn’t expect many people to attend. Much to my surprise when I entered the room every seat was taken, people were lining the walls and sitting in the aisles.
My presentation, which was unrelated to the work I did, analyzed the Giants’ probability of passing versus rushing and the probability of which receiver was targeted. It is available at the talks section of my site.
This was the first time we utilized three instructors (as opposed to a main instructor and assistants which we often use for large classes) and it led to an amazing dynamic. Bob laid the theoretical foundation for Markov chain Monte Carlo (MCMC), explaining both with math and geometry, and discussed the computational considerations of performing simulation draws. Daniel led the participants through hands-on examples with Stan, covering everything from how to describe a model, to efficient computation to debugging. Andrew gave his usual, crowd dazzling performance use previous work as case studies of when and how to use Bayesian methods.
It was an intensive three days of training with an incredible amount of information. Everyone walked away knowing a lot more about Bayes, MCMC and Stan and eager to try out their new skills, and an autographed copy of Andrew’s book, BDA3.
A big help, as always was Daniel Chen who put in so much effort making the class run smoothly from securing the space, physically moving furniture and running all the technology.
On April 24th and 25th Lander Analytics and Work-Bench coorganized the (sold-out) inaugural New York R Conference. It was an amazing weekend of nerding out over R and data, with a little Python and Julia mixed in for good measure. People from all across the R community gathered to see rockstars discuss their latest and greatest efforts.
However, he wondered if the preponderance of dollar slice shops has dropped the price of a slice below that of the subway and playfully joked that he wished there was a statistician in the audience.
Naturally, that night I set off to calculate the current price of a slice in New York City using listings from MenuPages. I used R’sXML package to pull the menus for over 1,800 places tagged as “Pizza” in Manhattan, Brooklyn and Queens (there was no data for Staten Island or The Bronx) and find the price of a cheese slice.
After cleaning up the data and doing my best to find prices for just cheese/plain/regular slices I found that the mean price was $2.33 with a standard deviation of $0.52 and a median price of $2.45. The base subway fare is $2.50 but is actually $2.38 after the 5% bonus for putting at least $5 on a MetroCard.
So, even with the proliferation of dollar slice joints, the average slice of pizza ($2.33) lines up pretty nicely with the cost of a subway ride ($2.38).
Taking it a step further, I broke down the price of a slice in Manhattan, Queens and Brooklyn. The vertical lines represented the price of a subway ride with and without the bonus. We see that the price of a slice in Manhattan is perfectly right there with the subway fare.
MenuPages even broke down Queens Neighborhoods so we can have a more specific plot.
The code for downloading the menus and the calculations is after the break.
After two years of writing and editing and proof reading and checking my book, R for Everyone is finally out!
There are so many people who helped me along the way, especially my editor Debra Williams, production editor Caroline Senay and the man who recruited me to write it in the first place, Paul Dix. Even more people helped throughout the long process, but with so many to mention I’ll leave that in the acknowledgements page.