Shortly after the Giants fantastic defeat of the Patriots in Super Bowl XLVI (I was a little disappointed that Eli, Coughlin and the Vince Lombardi Trophy all got off the parade route early and the views of City Hall were obstructed by construction trailers, but Steve Weatherford was awesome as always) a friend asked me to settle a debate amongst some people in a Super Bowl pool.

He writes:

We have 10 participants in a superbowl pool. The pool is a “pick the player who scores first” type pool. In a hat, there are 10 Giants players. Each participant picks 1 player out of the hat (in no particular order) until the hat is emptied. Then 10 Patriots players go in the hat and each participant picks again.

In the end, each of the 10 participants has 1 Giants player and 1 Patriots player. No one has any duplicate players as 10 different players from each team were selected. Pool looks as follows:

Participant 1 | Giant A | Patriot Q |

Participant 2 | Giant B | Patriot R |

Participant 3 | Giant C | Patriot S |

Participant 4 | Giant D | Patriot T |

Participant 5 | Giant E | Patriot U |

Participant 6 | Giant F | Patriot V |

Participant 7 | Giant G | Patriot W |

Participant 8 | Giant H | Patriot X |

Participant 9 | Giant I | Patriot Y |

Participant 10 | Giant J | Patriot Z |

Winners = First Player to score wins half the pot. First player to score in 2nd half wins the remaining half of the pot.

The question is, what are the odds that someone wins

Boththe 1st and 2nd half. Remember, the picks were random.

Before anyone asks about the safety, one of the slots was for Special Teams/Defense.

There are two probabilistic ways of thinking about this. Both hinge on the fact that whoever scores first in each half is both independent and not mutually exclusive.

First, let’s look at the two halves individually. In a given half any of 20 players can score first (10 from the Giants and 10 from the Patriots) and an individual participant can win with two of those. So a participant has a 2/20 = 1/10 chance of winning a half. Thus that participant has a (1/10) * (1/10) = 1/100 chance of winning both halves. Since there are 10 participants there is an overall probability of 10 * (1/100) = 1/10 of any single participant winning both halves.

The other way is to think a little more combinatorically. There are 20 * 20 = 400 different combinations of players scoring first in each half. A participant has two players which are each valid for each half giving them four of the possible combinations leading to a 4 / 400 = 1/100 probability that a single participant will win both halves. Again, there are 10 participants giving an overall 10% chance of any one participant winning both halves.

Since both methods agreed I am pretty confidant in the results, but just in case I ran some simulations in R which you can find after the break.

For the simulation I built a function that randomly assigned a player from the Giants and a player from the Patriots to each participant, then randomly chose–from those 20 players–someone to score first and someone to score second, allowing the two to be the same. If the two scorers belonged to the same participant the function returned 1, otherwise 0. The code is below.

runGame <- function(participants, pick1, pick2) { # build data.frame assigning a random selection from each team to each player choices <- data.frame(Participants=participants, Pick1=sample(x=pick1, size=length(pick1), replace=FALSE), Pick2=sample(x=pick2, size=length(pick2), replace=FALSE)) # randomly pick a player to score first firstScore <- sample(c(pick1, pick2), 1) # randomly pick a player to score second secondScore <- sample(c(pick1, pick2), 1) # see if the players who scored were both "owned" by the same participant if(max(which(choices$Pick1 == firstScore), which(choices$Pick2 == firstScore)) == max(which(choices$Pick1 == secondScore), which(choices$Pick2 == secondScore))) { return(1) }else { return(0) } }

A second function run the simulation however many times are desired.

runSim <- function(participants, pick1, pick2, n=10000) { # number of simulations # hold the results results <- rep(NA, n) for(i in 1:n) { results[i] <- runGame(participants=participants, pick1=pick1, pick2=pick2) } # find percentage return(results) }

Running this simulation 10,000 times, I found that the average rate for the same participant to win both parts of the pool was .1041 with a standard error of .003.

So, the simulations confirm the probability theory that about 10% of the time the same participant will both sections of the pool, pissing off all of his friends.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

A quick way to check the logic – the odds of a given player winning the second half is 10%. Since you don’t care who wins, just that it’s the same person both times, you’ve got

P(same person wins both times) = P(someone wins the first half) * P(same person wins the second half) = 1.0 * 0.1 = 10%.

@Ben Nice way to think of it. So it’s pretty certain that 10% is the right answer after three methods agree.

Enjoying your book (got it yesterday!) and thank you for posting your R code. I was able to look over your function and using the R documentation, figure out how it worked. Once I read this site’s table, I used the imported data.frame with your function and was able to duplicate your results!

I also feel compelled to mention that I unsure about the 10% assertion. The model assumes that each player selection has an equal chance of scoring, however if one slot is Special Teams/Defense I wonder if this assumption is sound. For example, is the probability of scoring equal for a wide receiver and Special Teams/Defense? Empirical data seems to indicate touchdowns as much more probable than safety’s.

Thanks for your awesome book!

@Mark Thanks for getting the book, I hope it proves very helpful to you.

While you’re right that about the probability of scoring in the game, our situation concerns the probability of assignment for at pool participants.

it did not work, and the error message showed that

Error in (function (classes, fdef, mtable) :

unable to find an inherited method for function ‘readHTMLTable’ for signature ‘”NULL”‘

In addition: Warning message:

XML content does not seem to be XML: ‘https://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool’

Is it (coding rule) there any change in recent 3 years in R ?

When I wrote this blog post my website used http and the code worked properly. Since then, I switched the site to https, and the XML package no longer works. It is best to now use the rvest package.

Hello. I’m studying your book in Korea and thank you for your good book for R.

In stuyding, I got some problem now. I executed the code below.

require(XML)

theURL <- "http://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool/"

bowlPool <- readHTMLTable(theURL, which = 1, header = FALSE, stringsAsFactors = FALSE)

But there's an error like this.

Error: failed to load external entity "http://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool/"

How can I fix it?

I’m glad you’re liking the book!

This post was written when my website used http. Since then, I switched the site to https and the XML package doesn’t support https for some reason. So you’re better off using the rvest package.

What would the code look like? I tried using the vest package and I am receiving error messages.

library(rvest)

theURL <- "https://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool/"

bowlPool <- readHTMLTable(theURL, which = 1, header= FALSE, stringsAsFactors= FALSE)

bowlPool

Error in (function (classes, fdef, mtable) :

unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’

In addition: Warning message:

XML content does not seem to be XML: 'https://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool/'

The

`readHTMLTable`

function comes from the`XML`

package and unfortunately does not work with https websites. My website has since switched to HTML so using that function on my webpage will not work. You can instead use the`read_html`

function from the`xml2`

package then parse tables using`html_node`

and`html_table`

like so.read_html('https://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool') %>%

html_node('table') %>%

html_table()