Distribution of Lottery Winners based on 1,000 Simulations

With tonight’s Mega Millions jackpot estimated to be over $640 million there are long lines of people waiting to buy tickets.  Of course you always hear about the probability of winning which is easy enough to calculate:  Five numbers ranging from 1 through 56 are drawn (without replacement) then a sixth ball is pulled from a set of 1 through 46.  That means there are choose(56, 5) * 46 = 175,711,536 possible different combinations.  That is why people are constantly reminded of how unlikely they are to win.

But I want to see how likely it is that SOMEONE will win tonight.  So let’s break out R and ggplot!

As of this afternoon it was reported (sorry no source) that two tickets were sold for every American.  So let’s assume that each of these tickets is an independent Bernoulli trial with probability of success of 1/175,711,536.

Running 1,000 simulations we see the distribution of the number of winners in the histogram above.

So we shouldn’t be surprised if there are multiple winners tonight.

The R code:

winners <- rbinom(n=1000, size=600000000, prob=1/175000000)
qplot(winners, geom="histogram", binwidth=1, xlab="Number of Winners")

Shortly after the Giants fantastic defeat of the Patriots in Super Bowl XLVI (I was a little disappointed that Eli, Coughlin and the Vince Lombardi Trophy all got off the parade route early and the views of City Hall were obstructed by construction trailers, but Steve Weatherford was awesome as always) a friend asked me to settle a debate amongst some people in a Super Bowl pool.

He writes:

We have 10 participants in a superbowl pool.  The pool is a “pick the player who scores first” type pool.  In a hat, there are 10 Giants players.  Each participant picks 1 player out of the hat (in no particular order) until the hat is emptied.  Then 10 Patriots players go in the hat and each participant picks again.

In the end, each of the 10 participants has 1 Giants player and 1 Patriots player.  No one has any duplicate players as 10 different players from each team were selected.  Pool looks as follows:

Participant 1 Giant A Patriot Q
Participant 2 Giant B Patriot R
Participant 3 Giant C Patriot S
Participant 4 Giant D Patriot T
Participant 5 Giant E Patriot U
Participant 6 Giant F Patriot V
Participant 7 Giant G Patriot W
Participant 8 Giant H Patriot X
Participant 9 Giant I Patriot Y
Participant 10 Giant J Patriot Z

Winners = First Player to score wins half the pot.  First player to score in 2nd half wins the remaining half of the pot.

The question is, what are the odds that someone wins Both the 1st and 2nd half.  Remember, the picks were random.

Before anyone asks about the safety, one of the slots was for Special Teams/Defense.

There are two probabilistic ways of thinking about this.  Both hinge on the fact that whoever scores first in each half is both independent and not mutually exclusive.

First, let’s look at the two halves individually.  In a given half any of 20 players can score first (10 from the Giants and 10 from the Patriots) and an individual participant can win with two of those.  So a participant has a 2/20 = 1/10 chance of winning a half.  Thus that participant has a (1/10) * (1/10) = 1/100 chance of winning both halves.  Since there are 10 participants there is an overall probability of 10 * (1/100) = 1/10 of any single participant winning both halves.

The other way is to think a little more combinatorically.  There are 20 * 20 = 400 different combinations of players scoring first in each half.  A participant has two players which are each valid for each half giving them four of the possible combinations leading to a 4 / 400 = 1/100 probability that a single participant will win both halves.  Again, there are 10 participants giving an overall 10% chance of any one participant winning both halves.

Since both methods agreed I am pretty confidant in the results, but just in case I ran some simulations in R which you can find after the break.

Continue reading

A new study, reported in the New York Times, tracked population movements in post-earthquake Haiti using cell phone data.  The article grabbed my attention because one of the authors, Richard Garfield (whom I have done numerous projects with and who has his own Wikipedia entry!), had told me about this very study just a few months ago.

Over dinner in New York’s Little India he explained how the largest cell phone company in Haiti provided him with anonymized cell tower records.  As many people are aware, cell phones–even those without GPS–report their locations back to cell towers at regular intervals.  By tracking the daily position of the phones before and after the earthquake they were able to determine that 20% of Port-Au-Prince’s population had left the capitol within 19 days of the disaster.

They used plenty of solid math in the analysis and amazingly did it all without resorting to spatial statistics.  They have some nice map-based visualizations but I’ve been meaning to get the data from Dr. Garfield so I can attempt something similar to the amazing work done by the NYC Data Mafia on the WikiLeaks Afghanistan data.  Though I don’t promise anything nearly as good.

It is also worth noting that they did this at a fraction of the cost and time of an extensive UN survey.  That survey only had about 2,500 respondents whereas the cell phone project incorporated around 1.9 million people without them spending valuable time with an interviewer.

The FBI has put out a public request for help cracking a code.  The code above was found in the pants of a murder victim over 10 years ago.  Despite some of the best code breakers in the world give it a shot, they have not been able to break the code.  I wonder if the NSA had a go at it.  Couldn’t they try brute force like in Dan Brown’s Digital Fortress?  Yes I referenced Dan Brown in the same paragraph as the NSA, deal with it.

If you think you can help send a letter to:

FBI Laboratory
Cryptanalysis and Racketeering Records Unit
2501 Investigation Parkway
Quantico, VA 22135
Attn: Ricky McCormick Case

There’s no reward but you’d be helping your country.

Pi Day Celebrants

As mentioned earlier, yesterday was Pi Day so a bunch of statisticians and other such nerds celebrated at the new(ish) Artichoke Basille near the High Line.  We had three pies:  the signature Artichoke, the Margherita and the Anchovy, which was delicious but only some of us ate.  And of course we had our custom cake from Chrissie Cook.

The photos were taken by John.

Pi Cake 2011
NYC Data Mafia
NYC Data Mafia

Pi CakeHappy Pi Day everybody!  I’ll be out celebrating with the rest of the NYC Data Mafia eating pizza and devouring the above Pi Cake, custom baked by Chrissie Cook.

Today is also Albert Einstein’s birthday so there are plenty of reasons to have fun.

The cake below was my first ever Pi Cake in what is sure to become an annual tradition.

Pi Cake 2009

Update: Drew Conway does far more justice to our fair, irrational, transcendental number.

Update 2:  Engadget posted this awesome video of “What Pi Sounds Like.

The Father of Gerrymandering
The Father of Gerrymandering

The Wall Street Journal is reporting that even with all the concern around gerrymandering that in reality the upcoming redistricting probably won’t have much affect on upcoming elections.  Gary King is mentioned as having written a paper “that helped demonstrate the relative impotence of partisan redistricting” yet “he favors the efforts to create a statistical method that would replace it.”  I personally am always for using math and hard numbers to solve any problem whenever possible.

The article also mentioned a “conference last year in Washington, D.C., researchers proposed alternatives.”  David Epstein presented a paper at that conference that Andy Gelman and I worked on.

While the article quoted one of Dr. Gelman’s papers it unfortunately did not mention him, or any of us by name.  However, the accompanying blog post did mention both Dr.s Gelman and Epstein with specific quotes of them and their work.

Temple professor John Allen Paulos has an article in the New York Times that got Slashdotted today suggesting people be wary of all the metrics that fill our daily lives.

His first contention is whether assumptions about categorization are correct.  This is certainly important, but hopefully qualified statisticians, social scientists, doctors, etc. . .are making these decisions and properly counting the results.

Next he discusses whether numbers you are looking at have been aggregated properly and were arrived at by using the proper choices of criteria, protocols and weights.  He gives articles such as “The 10 Friendliest Colleges” and “The 20 Most Lovable Neighborhoods” as examples.  Having done a lot of work where variable selection and shrinkage is important I can say that I, for one, allow the data to speak for itself and use various statistical methods to arrive at the correct decision.

Dr. Paulos makes more points, but I’ll let you read the article for yourself.  The important take away–at least to me–is that when looking at reported statistics and measurements, try to figure out what methods were used.  That’s why I always am disappointed when articles do not report their methods.  I realize that understanding the techniques might be beyond the average person, but that’s when you ask your statistician friend.

Steven Strogatz is writing a column for the New York Times where he discusses math, starting with basic concepts and working his way up to the complex and cerebral.

I, and a lot of people, love his column.  However, last week’s piece on probability was not received so well by the statistics community., particularly on Andy Gelman’s blog and Junk Charts’ sister blog, Numbers Rule Your World. Continue reading