### The costs involved with a health insurance plan can be confusing so I perform an analysis of different options to find which plan is most cost effective

My wife and I recently brought a new R programmer into our family so we had to update our health insurance. Becky is a researcher in neuroscience and psychology at NYU so we decided to choose an NYU insurance plan.

For families there are two main plans: Value and Advantage. The primary differences between the plans are the following:

Item Explanation Value Plan Amount Advantage Plan Amount
Bi-Weekly Premiums The amount we pay every other week in order to have insurance $160 ($4,160 annually) $240 ($6,240 annually)
Deductible Amount we pay directly to health providers before the insurance starts covering costs $1,000$800
Coinsurance After the deductible is met, we pay this percentage of medical bills 20% 10%
Out-of-Pocket Maximum This is the most we will have to pay to health providers in a year (premiums do not count toward this max) $6,000$5,000

We put them into a tibble for use later.

# use tribble() to make a quick and dirty tibble
parameters <- tibble::tribble(
'Value', 160*26, 1000, 0.2, 6000,
)

Other than these cost differences, there is not any particular benefit of either plan over the other. That means whichever plan is cheaper is the best to choose.

This blog post walks through the steps of evaluating the plans to figure out which to select. Code is included so anyone can repeat, and improve on, the analysis for their given situation.

# Cost

In order to figure out which plan to select we need to figure out the all-in cost, which is a function of how much we spend on healthcare in a year (we have to estimate our annual spending) and the aforementioned premiums, deductible, coinsurance and out-of-pocket maximum.

$\text{cost} = f(\text{spend}; \text{premiums}, \text{deductible}, \text{coinsurance}, \text{oop_maximum}) = \\ \text{min}(\text{oop_maximum}, \text{deductible} + \text{coinsurance}*(\text{spend}-\text{deductible}))+\text{premiums}$

This can be written as an R function like this.

#' @title cost
#' @description Given healthcare spend and other parameters, calculate the actual cost to the user
#' @details Uses the formula above to caluclate total costs given a certain level of spending. This is the premiums plus either the out-of-pocket maximum, the actual spend level if the deductible has not been met, or the amount of the deductible plus the coinsurance for spend above the deductible but below the out-of-pocket maximum.
#' @author Jared P. Lander
#' @param spend A given amount of healthcare spending as a vector for multiple amounts
#' @param deductible The deductible for a given plan
#' @param coinsurance The coinsurance percentage for spend beyond the deductible but below the out-of-pocket maximum
#' @param oop_maximum The maximum amount of money (not including premiums) that the insured will pay under a given plan
#' @return The total cost to the insured
#' @examples
#' cost(3000, 4160, 1000, .20, 6000)
#' cost(3000, 6240, 800, .10, 5000)
#'
cost <- function(spend, premiums, deductible, coinsurance, oop_maximum)
{
# spend is vectorized so we use pmin to get the min between oop_maximum and (deductible + coinsurance*(spend - deductible)) for each value of spend provided
pmin(
# we can never pay more than oop_maximum so that is one side
oop_maximum,
# if we are under oop_maximum for a given amount of spend,
# this is the cost
pmin(spend, deductible) + coinsurance*pmax(spend - deductible, 0)
) +
}

With this function we can see if one plan is always, or mostly, cheaper than the other plan and that’s the one we would choose.

# R Packages

For the rest of the code we need these R packages.

library(dplyr)
library(ggplot2)
library(tidyr)
library(formattable)
library(readxl)

# Spending

To see our out-of-pocket cost at varying levels of healthcare spend we build a grid in $1,000 increments from$1,000 to $70,000. spending <- tibble::tibble(Spend=seq(1000, 70000, by=1000)) We call our cost function on each amount of spend for the Value and Advantage plans. spending <- spending %>% # use our function to calcuate the cost for the value plan mutate(Value=cost( spend=Spend, premiums=parameters$Premiums[1],
deductible=parameters$Deductible[1], coinsurance=parameters$Coinsurance[1],
oop_maximum=parameters$OOP_Maximum[1] ) ) %>% # use our function to calcuate the cost for the Advantage plan mutate(Advantage=cost( spend=Spend, premiums=parameters$Premiums[2],
deductible=parameters$Deductible[2], coinsurance=parameters$Coinsurance[2],
oop_maximum=parameters$OOP_Maximum[2] ) ) %>% # compute the difference in costs for each plan mutate(Difference=Advantage-Value) %>% # the winner for a given amount of spend is the cheaper plan mutate(Winner=if_else(Advantage < Value, 'Advantage', 'Value')) The results are in the following table, showing every other row to save space. The Spend column is a theoretical amount of spending with a red bar giving a visual sense for the increasing amounts. The Value and Advantage columns are the corresponding overall costs of the plans for the given amount of Spend. The Difference column is the result of AdvantageValue where positive numbers in blue mean that the Value plan is cheaper while negative numbers in red mean that the Advantage plan is cheaper. This is further indicated in the Winner column which has the corresponding colors. Spend Value Advantage Difference Winner$2,000 $5,360$7,160 1800 Value
$4,000$5,760 $7,360 1600 Value$6,000 $6,160$7,560 1400 Value
$8,000$6,560 $7,760 1200 Value$10,000 $6,960$7,960 1000 Value
$12,000$7,360 $8,160 800 Value$14,000 $7,760$8,360 600 Value
$16,000$8,160 $8,560 400 Value$18,000 $8,560$8,760 200 Value
$20,000$8,960 $8,960 0 Value$22,000 $9,360$9,160 -200 Advantage
$24,000$9,760 $9,360 -400 Advantage$26,000 $10,160$9,560 -600 Advantage
$28,000$10,160 $9,760 -400 Advantage$30,000 $10,160$9,960 -200 Advantage
$32,000$10,160 $10,160 0 Value$34,000 $10,160$10,360 200 Value
$36,000$10,160 $10,560 400 Value$38,000 $10,160$10,760 600 Value
$40,000$10,160 $10,960 800 Value$42,000 $10,160$11,160 1000 Value
$44,000$10,160 $11,240 1080 Value$46,000 $10,160$11,240 1080 Value
$48,000$10,160 $11,240 1080 Value$50,000 $10,160$11,240 1080 Value
$52,000$10,160 $11,240 1080 Value$54,000 $10,160$11,240 1080 Value
$56,000$10,160 $11,240 1080 Value$58,000 $10,160$11,240 1080 Value
$60,000$10,160 $11,240 1080 Value$62,000 $10,160$11,240 1080 Value
$64,000$10,160 $11,240 1080 Value$66,000 $10,160$11,240 1080 Value
$68,000$10,160 $11,240 1080 Value$70,000 $10,160$11,240 1080 Value

Of course, plotting often makes it easier to see what is happening.

spending %>%
# put the plot in longer format so ggplot can set the colors
gather(key=Plan, value=Cost, -Spend) %>%
ggplot(aes(x=Spend, y=Cost, color=Plan)) +
geom_line(size=1) +
scale_x_continuous(labels=scales::dollar) +
scale_y_continuous(labels=scales::dollar) +
scale_color_brewer(type='qual', palette='Set1') +
labs(x='Healthcare Spending', y='Out-of-Pocket Costs') +
theme(
legend.position='top',
axis.title=element_text(face='bold')
)

It looks like there is only a small window where the Advantage plan is cheaper than the Value plan. This will be more obvious if we draw a plot of the difference in cost.

spending %>%
ggplot(aes(x=Spend, y=Difference, color=Winner, group=1)) +
geom_hline(yintercept=0, linetype=2, color='grey50') +
geom_line(size=1) +
scale_x_continuous(labels=scales::dollar) +
scale_y_continuous(labels=scales::dollar) +
labs(
x='Healthcare Spending',
y='Difference in Out-of-Pocket Costs Between the Two Plans'
) +
scale_color_brewer(type='qual', palette='Set1') +
theme(
legend.position='top',
axis.title=element_text(face='bold')
)

To calculate the exact cutoff points where one plan becomes cheaper than the other plan we have to solve for where the two curves intersect. Due to the out-of-pocket maximums the curves are non-linear so we need to consider four cases.

1. The spending exceeds the point of maximum out-of-pocket spend for both plans
2. The spending does not exceed the point of maximum out-of-pocket spend for either plan
3. The spending exceeds the point of maximum out-of-pocket spend for the Value plan but not the Advantage plan
4. The spending exceeds the point of maximum out-of-pocket spend for the Advantage plan but not the Value plan

When the spending exceeds the point of maximum out-of-pocket spend for both plans the curves are parallel so there will be no cross over point.

When the spending does not exceed the point of maximum out-of-pocket spend for either plan we set the cost calculations (not including the out-of-pocket maximum) for each plan equal to each other and solve for the amount of spend that creates the equality.

To keep the equations smaller we use variables such as $$d_v$$ for the Value plan deductible, $$c_a$$ for the Advantage plan coinsurance and $$oop_v$$ for the out-of-pocket maximum for the Value plan.

$d_v + c_v(S – d_v) + p_v = d_a + c_a(S – d_a) + p_a \\ c_v(S – D_v) – c_a(S-d_a) = d_a – d_v + p_a – p_v \\ c_vS – c_vd_v – c_aS + c_ad_a = d_a – d_v + p_a – p_v \\ S(c_v – c_a) = d_a – c_ad_a – d_v + c_vd_v + p_a – p_v \\ S(c_v – c_a) = d_a(1 – c_a) – d_v(1 – c_v) + p_a – p_v \\ S = \frac{d_a(1 – c_a) – d_v(1 – c_v) + p_a – p_v}{(c_v – c_a)}$

When the spending exceeds the point of maximum out-of-pocket spend for the Value plan but not the Advantage plan, we set the out-of-pocket maximum plus premiums for the Value plan equal to the cost calculation of the Advantage plan.

$oop_v + p_v = d_a + c_a(S – d_a) + p_a \\ d_a + c_a(S – d_a) + p_a = oop_v + p_v \\ c_aS – c_ad_a = oop_v + p_v – p_a – d_a \\ c_aS = oop_v + p_v – p_a + c_ad_a – d_a \\ S = \frac{oop_v + p_v – p_a + c_ad_a – d_a}{c_a}$

When the spending exceeds the point of maximum out-of-pocket spend for the Advantage plan but not the Value plan, the solution is just the opposite of the previous equation.

$oop_a + p_a = d_v + c_v(S – d_v) + p_v \\ d_v + c_v(S – d_v) + p_v = oop_a + p_a \\ c_vS – c_vd_v = oop_a + p_a – p_v – d_v \\ c_vS = oop_a + p_a – p_v + c_vd_v – d_v \\ S = \frac{oop_a + p_a – p_v + c_vd_v – d_v}{c_v}$

As an R function it looks like this.

#' @title calculate_crossover_points
#' @description Given healthcare parameters for two plans, calculate when one plan becomes more expensive than the other.
#' @details Calculates the potential crossover points for different scenarios and returns the ones that are true crossovers.
#' @author Jared P. Lander
#' @param deductible_1 The deductible plan 1
#' @param coinsurance_1 The coinsurance percentage for spend beyond the deductible for plan 1
#' @param oop_maximum_1 The maximum amount of money (not including premiums) that the insured will pay under plan 1
#' @param deductible_2 The deductible plan 2
#' @param coinsurance_2 The coinsurance percentage for spend beyond the deductible for plan 2
#' @param oop_maximum_2 The maximum amount of money (not including premiums) that the insured will pay under plan 2
#' @return The amount of spend at which point one plan becomes more expensive than the other
#' @examples
#' calculate_crossover_points(
#' 160, 1000, 0.2, 6000,
#' 240, 800, 0.1, 5000
#' )
#'
calculate_crossover_points <- function(
)
{
# calculate the crossover before either has maxed out
deductible_2*(1 - coinsurance_2) -
deductible_1*(1 - coinsurance_1)) /
(coinsurance_1 - coinsurance_2)

# calculate the crossover when one plan has maxed out but the other has not
one_maxed_out <- (oop_maximum_1 +
coinsurance_2*deductible_2 -
deductible_2) /
coinsurance_2

# calculate the crossover for the reverse
other_maxed_out <- (oop_maximum_2 +
coinsurance_1*deductible_1 -
deductible_1) /
coinsurance_1

# these are all possible points where the curves cross
all_roots <- c(neither_maxed_out, one_maxed_out, other_maxed_out)

# now calculate the difference between the two plans to ensure that these are true crossover points
all_differences <- cost(all_roots, premiums_1, deductible_1, coinsurance_1, oop_maximum_1) -

# only when the difference between plans is 0 are the curves truly crossing
all_roots[all_differences == 0]
}

We then call the function with the parameters for both plans we are considering.

crossovers <- calculate_crossover_points(
parameters$Premiums[1], parameters$Deductible[1], parameters$Coinsurance[1], parameters$OOP_Maximum[1],
parameters$Premiums[2], parameters$Deductible[2], parameters$Coinsurance[2], parameters$OOP_Maximum[2]
)

crossovers
## [1] 20000 32000

We see that the Advantage plan is only cheaper than the Value plan when spending between $20,000 and$32,000.

The next question is will our healthcare spending fall in that narrow band between $20,000 and$32,000 where the Advantage plan is the cheaper option?

# Probability of Spending

This part gets tricky. I’d like to figure out the probability of spending between $20,000 and$32,000. Unfortunately, it is not easy to find healthcare spending data due to the opaque healthcare system. So I am going to make a number of assumptions. This will likely violate a few principles, but it is better than nothing.

Assumptions and calculations:

• Healthcare spending follows a log-normal distribution
• We will work with New York State data which is possibly different than New York City data
• We know the mean for New York spending in 2014
• We will use the accompanying annual growth rate to estimate mean spending in 2019
• We have the national standard deviation for spending in 2009
• In order to figure out the standard deviation for New York, we calculate how different the New York mean is from the national mean as a multiple, then multiply the national standard deviation by that number to approximate the New York standard deviation in 2009
• We use the growth rate from before to estimate the New York standard deviation in 2019

First, we calculate the mean. The Centers for Medicare & Medicaid Services has data on total and per capita medical expenditures by state from 1991 to 2014 and includes the average annual percentage growth. Since the data are bundled in a zip with other files, I posted them on my site for easy access.

spend_data_url <- 'https://jaredlander.com/data/healthcare_spending_per_capita_1991_2014.csv'
health_spend <- read_csv(spend_data_url)

We then take just New York spending for 2014 and multiply it by the corresponding growth rate.

ny_spend <- health_spend %>%
# get just New York
filter(State_Name == 'New York') %>%
# this row holds overall spending information
filter(Item == 'Personal Health Care ($)') %>% # we only need a few columns select(Y2014, Growth=Average_Annual_Percent_Growth) %>% # we have to calculate the spending for 2019 by accounting for growth # after converting it to a percentage mutate(Y2019=Y2014*(1 + (Growth/100))^5) ny_spend Y2014 Growth Y2019 9778 5 12479.48 The standard deviation is trickier. The best I can find was the standard deviation on the national level in 2009. In 2013 the Centers for Medicare & Medicaid Services wrote in Volume 3, Number 4 of Medicare & Medicaid Research Review an article titled Modeling Per Capita State Health Expenditure Variation: State-Level Characteristics Matter. Exhibit 2 shows that the standard deviation of healthcare spending was$1,241 for the entire country in 2009. We need to estimate the New York standard deviation from this and then account for growth into 2019.

Next, we figure out the difference between the New York State spending mean and the national mean as a multiple.

nation_spend <- health_spend %>%
filter(Item == 'Personal Health Care ($)') %>% filter(Region_Name == 'United States') %>% pull(Y2009) ny_multiple <- ny_spend$Y2014/nation_spend

ny_multiple
## [1] 1.418746

We see that the New York average is 1.4187464 times the national average. So we multiply the national standard deviation from 2009 by this amount to estimate the New York State standard deviation and assume the same annual growth rate as the mean. Recall, we can multiply the standard deviation by a constant.

\begin{align} \text{var}(x*c) &= c^2*\text{var}(x) \\ \text{sd}(x*c) &= c*\text{sd}(x) \end{align}

ny_spend <- ny_spend %>%
mutate(SD2019=1241*ny_multiple*(1 + (Growth/100))^10)

ny_spend
Y2014 Growth Y2019 SD2019
9778 5 12479.48 2867.937

My original assumption was that spending would follow a normal distribution, but New York’s resident agricultural economist, JD Long, suggested that the spending distribution would have a floor at zero (a person cannot spend a negative amount) and a long right tail (there will be many people with lower levels of spending and a few people with very high levels of spending), so a log-normal distribution seems more appropriate.

$\text{spending} \sim \text{lognormal}(\text{log}(12479), \text{log}(2868)^2)$

Visualized it looks like this.

draws <- tibble(
Value=rlnorm(
n=1200,
meanlog=log(ny_spend$Y2019), sdlog=log(ny_spend$SD2019)
)
)

ggplot(draws, aes(x=Value)) + geom_density() + xlim(0, 75000)

We can see that there is a very long right tail which means there are many low values and few high values.

Then the probability of spending between $20,000 and$32,000 can be calculated with plnorm().

plnorm(crossovers[2], meanlog=log(ny_spend$Y2019), sdlog=log(ny_spend$SD2019)) -
plnorm(crossovers[1], meanlog=log(ny_spend$Y2019), sdlog=log(ny_spend$SD2019))
## [1] 0.02345586

So we only have a 2.35% probability of our spending falling in that band where the Advantage plan is more cost effective. Meaning we have a 97.65% probability that the Value plan will cost less over the course of a year.

We can also calculate the expected cost under each plan. We do this by first calculating the probability of spending each (thousand) dollar amount (since the log-normal is a continuous distribution this is an estimated probability). We multiply each of those probabilities against their corresponding dollar amounts. Since the distribution is log-normal we need to exponentiate the resulting number. The data are on the thousands scale, so we multiply by 1000 to put it back on the dollar scale. Mathematically it looks like this.

$\mathbb{E}_{\text{Value}} \left[ \text{cost} \right] = 1000*\text{exp} \left\{ \sum p(\text{spend})*\text{cost}_{\text{Value}} \right\} \\ \mathbb{E}_{\text{Advantage}} \left[ \text{cost} \right] = 1000*\text{exp} \left\{ \sum p(\text{spend})*\text{cost}_{\text{Advantage}} \right\}$

The following code calculates the expected cost for each plan.

spending %>%
# calculate the point-wise estimated probabilities of the healthcare spending
# based on a log-normal distribution with the appropriate mean and standard deviation
mutate(
SpendProbability=dlnorm(
Spend,
meanlog=log(ny_spend$Y2019), sdlog=log(ny_spend$SD2019)
)
) %>%
# compute the expected cost for each plan
# and the difference between them
summarize(
ValueExpectedCost=sum(Value*SpendProbability),
ExpectedDifference=sum(Difference*SpendProbability)
) %>%
# exponentiate the numbers so they are on the original scale
mutate_each(funs=exp) %>%
# the spending data is in increments of 1000
# so multiply by 1000 to get them on the dollar scale
mutate_each(funs=~ .x * 1000)
5422.768 7179.485 1323.952

This shows that overall the Value plan is cheaper by about $1,324 dollars on average. # Conclusion We see that there is a very small window of healthcare spending where the Advantage plan would be cheaper, and at most it would be about$600 cheaper than the Value plan. Further, the probability of falling in that small window of savings is just 2.35%.

So unless our spending will be between $20,000 and$32,000, which it likely will not be, it is a better idea to choose the Value plan.

Since the Value plan is so likely to be cheaper than the Advantage plan I wondered who would pick the Advantage plan. Economist Jon Hersh invokes behavioral economics to explain why people may select the Advantage plan. Some parts of the Advantage plan are lower than the Value plan, such as the deductible, coinsurance and out-of-pocket maximum. People see that under certain circumstances the Advantage plan would save them money and are enticed by that, not realizing how unlikely that would be. So they are hedging against a low probability situation. (A consideration I have not accounted for is family size. The number of members in a family can have a big impact on the overall spend and whether or not it falls into the narrow band where the Advantage plan is cheaper.)

In the end, the Value plan is very likely going to be cheaper than the Advantage plan.

# Try it at Home

I created a Shiny app to allow users to plug in the numbers for their own plans. It is rudimentary, but it gives a sense for the relative costs of different plans.

# Thanks

A big thanks to Jon Hersh, JD Long, Kaz Sakamoto, Rebecca Martin and Adam Hogan for reviewing this post.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

A friend recently posted the following the problem:

There are 10 green balls, 20 red balls, and 25 blues balls in a a jar. I choose a ball at random. If I choose a green then I take out all the green balls, if i choose a red ball then i take out all the red balls, and if I choose, a blue ball I take out all the blue balls, What is the probability that I will choose a red ball on my second try?

The math works out fairly easily. It’s the probability of first drawing a green ball AND then drawing a red ball, OR the probability of drawing a blue ball AND then drawing a red ball.

$\frac{10}{10+20+25} * \frac{20}{20+25} + \frac{25}{10+20+25} * \frac{20}{10+20} = 0.3838$

But I always prefer simulations over probability so let’s break out the R code like we did for the Monty Hall Problem and calculating lottery odds.  The results are after the break.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Michael Malecki recently shared a link to a Business Insider article that discussed the Monty Hall Problem.

The problem starts with three doors, one of which has a car and two of which have a goat. You choose one door at random and then the host reveals one door (not the one you chose) that holds a goat. You can then choose to stick with your door or choose the third, remaining door.

Probability theory states that people who switch win the car two-thirds of the time and those who don’t switch only win one-third of time.

But people often still do not believe they should switch based on the probability argument alone. So let’s run some simulations.

This function randomly assigns goats and cars behind three doors, chooses a door at random, reveals a goat door, then either switches doors or does not.

monty <- function(switch=TRUE)
{
# randomly assign goats and cars
doors <- sample(x=c("Car", "Goat", "Goat"), size=3, replace=FALSE)

# randomly choose a door
doorChoice <- sample(1:3, size=1)

# get goat doors
goatDoors <- which(doors == "Goat")
# show a door with a goat
goatDoor <- goatDoors[which(goatDoors != doorChoice)][1]

if(switch)
# if we are switching choose the other remaining door
{
return(doors[-c(doorChoice, goatDoor)])
}else
# otherwise keep the current door
{
return(doors[doorChoice])
}
}


Now we simulate switching 10,000 times and not switching 10,0000 times

withSwitching <- replicate(n = 10000, expr = monty(switch = TRUE), simplify = TRUE)
withoutSwitching <- replicate(n = 10000, expr = monty(switch = FALSE), simplify = TRUE)


## [1] "Goat" "Car"  "Car"  "Goat" "Car"  "Goat"

head(withoutSwitching)

## [1] "Goat" "Car"  "Car"  "Car"  "Car"  "Car"


mean(withSwitching == "Car")

## [1] 0.6678

mean(withoutSwitching == "Car")

## [1] 0.3408


Plotting the results really shows the difference.

require(ggplot2)

## Loading required package: ggplot2

require(scales)

## Loading required package: scales

qplot(withSwitching, geom = "bar", fill = withSwitching) + scale_fill_manual("Prize",
values = c(Car = muted("blue"), Goat = "orange")) + xlab("Switch") + ggtitle("Monty Hall with Switching")


qplot(withoutSwitching, geom = "bar", fill = withoutSwitching) + scale_fill_manual("Prize",
values = c(Car = muted("blue"), Goat = "orange")) + xlab("Don't Switch") +
ggtitle("Monty Hall without Switching")


(How are these colors? I’m trying out some new combinations.)

This clearly shows that switching is the best strategy.

The New York Times has a nice simulator that lets you play with actual doors.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Thanks to Rachel Schutt, who I’m teaching with at Columbia, and Cathy O’Neil from MathBabe I had the opportunity to go on TV and talk about the statistics of tonight’s Powerball lottery.

There’s an article with a brief quote from me and a video where I may a very quick appearance at the 1:14 mark.  My interview during the live broadcast actually went on for about three minutes but I can’t find that online.  If I can transfer the video from my DVR, I’ll post that too.

In the longer interview I discussed the probability of winning and the expected value of a given ticket and other such statistical nuggets.  In particular I broke down how choosing numbers based on birthdays eliminates any number higher than 31 mean you are missing out on 28 of the 59 possible numbers that are uniformly distributed.  Hopefully I’ll find that longer cut.

The video can be found here:  Video

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

With tonight’s Mega Millions jackpot estimated to be over \$640 million there are long lines of people waiting to buy tickets.  Of course you always hear about the probability of winning which is easy enough to calculate:  Five numbers ranging from 1 through 56 are drawn (without replacement) then a sixth ball is pulled from a set of 1 through 46.  That means there are choose(56, 5) * 46 = 175,711,536 possible different combinations.  That is why people are constantly reminded of how unlikely they are to win.

But I want to see how likely it is that SOMEONE will win tonight.  So let’s break out R and ggplot!

As of this afternoon it was reported (sorry no source) that two tickets were sold for every American.  So let’s assume that each of these tickets is an independent Bernoulli trial with probability of success of 1/175,711,536.

Running 1,000 simulations we see the distribution of the number of winners in the histogram above.

So we shouldn’t be surprised if there are multiple winners tonight.

The R code:

winners <- rbinom(n=1000, size=600000000, prob=1/175000000)
qplot(winners, geom="histogram", binwidth=1, xlab="Number of Winners")

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Shortly after the Giants fantastic defeat of the Patriots in Super Bowl XLVI (I was a little disappointed that Eli, Coughlin and the Vince Lombardi Trophy all got off the parade route early and the views of City Hall were obstructed by construction trailers, but Steve Weatherford was awesome as always) a friend asked me to settle a debate amongst some people in a Super Bowl pool.

He writes:

We have 10 participants in a superbowl pool.  The pool is a “pick the player who scores first” type pool.  In a hat, there are 10 Giants players.  Each participant picks 1 player out of the hat (in no particular order) until the hat is emptied.  Then 10 Patriots players go in the hat and each participant picks again.

In the end, each of the 10 participants has 1 Giants player and 1 Patriots player.  No one has any duplicate players as 10 different players from each team were selected.  Pool looks as follows:

 Participant 1 Giant A Patriot Q Participant 2 Giant B Patriot R Participant 3 Giant C Patriot S Participant 4 Giant D Patriot T Participant 5 Giant E Patriot U Participant 6 Giant F Patriot V Participant 7 Giant G Patriot W Participant 8 Giant H Patriot X Participant 9 Giant I Patriot Y Participant 10 Giant J Patriot Z

Winners = First Player to score wins half the pot.  First player to score in 2nd half wins the remaining half of the pot.

The question is, what are the odds that someone wins Both the 1st and 2nd half.  Remember, the picks were random.

Before anyone asks about the safety, one of the slots was for Special Teams/Defense.

There are two probabilistic ways of thinking about this.  Both hinge on the fact that whoever scores first in each half is both independent and not mutually exclusive.

First, let’s look at the two halves individually.  In a given half any of 20 players can score first (10 from the Giants and 10 from the Patriots) and an individual participant can win with two of those.  So a participant has a 2/20 = 1/10 chance of winning a half.  Thus that participant has a (1/10) * (1/10) = 1/100 chance of winning both halves.  Since there are 10 participants there is an overall probability of 10 * (1/100) = 1/10 of any single participant winning both halves.

The other way is to think a little more combinatorically.  There are 20 * 20 = 400 different combinations of players scoring first in each half.  A participant has two players which are each valid for each half giving them four of the possible combinations leading to a 4 / 400 = 1/100 probability that a single participant will win both halves.  Again, there are 10 participants giving an overall 10% chance of any one participant winning both halves.

Since both methods agreed I am pretty confidant in the results, but just in case I ran some simulations in R which you can find after the break.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

With the Super Bowl only hours away now is your last chance to buy your boxes.  Assuming the last digits are not assigned randomly you can maximize your chances with a little analysis.  While I’ve seen plenty of sites giving the raw numbers, I thought a little visualization was in order.

In the graph above (made using ggplot2 in R, of course) the bigger squares represent greater frequency.  The axes are labelled “Home” and “Away” for orientation, but in the Super Bowl that probably doesn’t matter too much, especially considering that Indianapolis is (Peyton) Manning territory so the locals will most likely be rooting for the Giants.  Further, I believe Super Bowl XLII, featuring the same two teams, had a disproportionate number of Giants fans.  Bias disclaimer:  GO BIG BLUE!!!

Below is the same graph broken down by year to see how the distribution has changed over the past 20 years.

All the data was scraped from Pro Football Reference.  All of my code and other graphs that didn’t make the cut are at my github site.

As always, send any questions my way.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

A post on Slashdot caught my attention.  It was about a microchip from Lyric Semiconductor that does calculations using analog probabilities instead of digital bits of 1’s and 0’s.

The article says that this will both make flash storage more efficient and make statistical calculations quicker.  I doubt it will help with fitting simple regressions where have a fixed formula, but the first thing that came to mind were Bayesian problems, especially a Markov chain Monte Carlo (MCMC).  Using BUGS to run these simulations can be VERY time consuming, so a faster approach would make the lives of many statisticians much easier.  The article did mention that the chip uses Bayesian NAND gates as opposed to digital NAND gates, but I don’t know how that relates to MCMC’s.

I reached out to my favorite Bayesian, Andy Gelman, to see what he thinks.  I’ll report back on what he says.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.