Shortly after the Giants fantastic defeat of the Patriots in Super Bowl XLVI (I was a little disappointed that Eli, Coughlin and the Vince Lombardi Trophy all got off the parade route early and the views of City Hall were obstructed by construction trailers, but Steve Weatherford was awesome as always) a friend asked me to settle a debate amongst some people in a Super Bowl pool.

He writes:

We have 10 participants in a superbowl pool.  The pool is a “pick the player who scores first” type pool.  In a hat, there are 10 Giants players.  Each participant picks 1 player out of the hat (in no particular order) until the hat is emptied.  Then 10 Patriots players go in the hat and each participant picks again.

In the end, each of the 10 participants has 1 Giants player and 1 Patriots player.  No one has any duplicate players as 10 different players from each team were selected.  Pool looks as follows:

 Participant 1 Giant A Patriot Q Participant 2 Giant B Patriot R Participant 3 Giant C Patriot S Participant 4 Giant D Patriot T Participant 5 Giant E Patriot U Participant 6 Giant F Patriot V Participant 7 Giant G Patriot W Participant 8 Giant H Patriot X Participant 9 Giant I Patriot Y Participant 10 Giant J Patriot Z

Winners = First Player to score wins half the pot.  First player to score in 2nd half wins the remaining half of the pot.

The question is, what are the odds that someone wins Both the 1st and 2nd half.  Remember, the picks were random.

Before anyone asks about the safety, one of the slots was for Special Teams/Defense.

There are two probabilistic ways of thinking about this.  Both hinge on the fact that whoever scores first in each half is both independent and not mutually exclusive.

First, let’s look at the two halves individually.  In a given half any of 20 players can score first (10 from the Giants and 10 from the Patriots) and an individual participant can win with two of those.  So a participant has a 2/20 = 1/10 chance of winning a half.  Thus that participant has a (1/10) * (1/10) = 1/100 chance of winning both halves.  Since there are 10 participants there is an overall probability of 10 * (1/100) = 1/10 of any single participant winning both halves.

The other way is to think a little more combinatorically.  There are 20 * 20 = 400 different combinations of players scoring first in each half.  A participant has two players which are each valid for each half giving them four of the possible combinations leading to a 4 / 400 = 1/100 probability that a single participant will win both halves.  Again, there are 10 participants giving an overall 10% chance of any one participant winning both halves.

Since both methods agreed I am pretty confidant in the results, but just in case I ran some simulations in R which you can find after the break.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

With the Super Bowl only hours away now is your last chance to buy your boxes.  Assuming the last digits are not assigned randomly you can maximize your chances with a little analysis.  While I’ve seen plenty of sites giving the raw numbers, I thought a little visualization was in order.

In the graph above (made using ggplot2 in R, of course) the bigger squares represent greater frequency.  The axes are labelled “Home” and “Away” for orientation, but in the Super Bowl that probably doesn’t matter too much, especially considering that Indianapolis is (Peyton) Manning territory so the locals will most likely be rooting for the Giants.  Further, I believe Super Bowl XLII, featuring the same two teams, had a disproportionate number of Giants fans.  Bias disclaimer:  GO BIG BLUE!!!

Below is the same graph broken down by year to see how the distribution has changed over the past 20 years.

All the data was scraped from Pro Football Reference.  All of my code and other graphs that didn’t make the cut are at my github site.

As always, send any questions my way.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

While playing Words with Friends my randomly chosen opponent played “radiale” as her first word.  Since that used up all of her tiles, she received a bonus on top of all the points the word itself got, resulting in a one-move score of 53 points!  Rather than being impressed I was upset at the large deficit I would have to overcome.

To combat this I did what comes naturally:  Write an R script to find the perfect word!

Needing to combine my seven letters with one of her letters there were two routes I could take.  The first would be for each combination of my seven letters and one of hers, find all 40,320 (8!) permutations then hit dictionary.com to see if it is a real word for a total of 282,240 (8!*7) http calls.  That seemed a bit excessive and impractical so I moved on to the next idea.

So, first thing I did was pull a list of common eight-letter words. Then for each combination of my letters and one of hers (only 7 iterations) I checked if those letters (in any order) matched the letters in any of the possible words.  Once a match was found there was a check for the counts of the letters and if that passed then the word was recorded as a true match.

The algorithm took about 17 seconds to run and found me one possible word for my letters combined with one of hers:  “headrace”, for 63 points!  Perhaps I should have been able to figure that out on my own, but where would be the fun in that.  Find the code after the break.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Fig. 1: This graph shows received and sent text messages by month. Notice the spike in July 2010.

A few weeks ago my iPhone for some reason erased ALL of my previous text messages (SMS and MMS) and it was as if I was starting with a new phone. After doing some digging I discovered that each time you sync your iPhone a copy of its text message database is saved on your computer which can be accessed without jailbreaking.

My original intent was to take the old database and union it with the new database for all the texting I had done since then, thus restoring all of my text messages. But once I got into the SQLite database I realized that I had a ton of information on my hands that was begging to be analyzed. It also didn’t hurt that I was in a lovely but small Vermont town for the week without much else to do at night.

My first finding, as seen above, is that my text messaging spiked after my girlfriend and I broke up around July of last year. Notice that for both years there is a dip in December. That’s because in 2009 I was in Burma during December and for 2010 the data stopped on December 6th when the last backup was made. A simple t-test confirmed that my texting did indeed increase after the breakup.

More interestingly, is that before my girlfriend and I broke up last year I texted more men than women, but shortly after we broke up that flipped. I don’t think that needs much of an explanation. The above graph and further analysis excludes her and family members because they would bias the gender effect. Being a good statistician I ran a poisson regression to see if there really was a significant change. The coefficient plot below (which is on the logarithmic scale) shows that my texting with males increased after the breakup (or Epoch) by 74% (calculated by summing the coefficients for “Epoch”, “Male” and “Male:Epoch” and then exponentiating) while my texting with females increased 127%.

Further analysis and a how-to after the break.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Last night, Harlan Harris and I gave a talk at the NY Predictive Analytics meetup.  Despite the rain there was a good turn out and people seemed to both enjoy and benefit from the presentation.

As requested I have posted the presentation for all to see.  Please feel free to contact me with any questions.  The data and R code are also posted and we will post at least the presentation on the Meetup page.  Everything is also available in one convenient package at GitHub.

Update:  Harlan wrote up a great summary of the night.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Tonight I will be giving a talk with Harlan Harris at the Predictive Analytics and Machine Learning Meetup in New York.  It is going to be an introduction to Multilevel Models with examples in R and from previous projects I have worked.

Here’s the details for the talk.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

A great way to visualize the results of a regression is to use a Coefficient Plot like the one to the right.  I’ve seen people on Twitter asking how to build this and there has been an option available using Andy Gelman’s coefplot() in the arm package.  Not knowing this I built my own (as seen in this post about taste testing tomatoes) and they both suffered the same problems:.  Long coefficient names often got cut off by the left margin of the graph and the name of the variable was appended to all the levels of a factor.  One big difference between his and mine is that his does not include the Intercept by default.  Mine includes the intercept with the option of excluding it.

I managed to solve the latter problem pretty quickly using some regular expressions.  Now the levels of factors are displayed alone, without being prepended by the factor name.  As for the former, I fixed that yesterday by taking advantage of ggplot by Hadley Wickham which deals with the margins better than I do.

Both of these changes made for a vast improvement over what I had avialable before.  Future improvements will address the sorting of the coefficients displayed and allow users to choose their own display names for the coefficients.

The function is in this file and is called plotCoef() and is very customizable, down to the color and line thickness.  I kept my old version, plotCoefBase(), in the file in case some people are adverse to using ggplot, though no one should be.  I sent the code to Dr. Gelman to hopefully be incorporated into his function which I’m sure gets used by a lot more people than mine will.  Examples of my old version and of Dr. Gelman’s are after the break.

As always, any comments or questions are welcomed.  Go to the Contact page or send an email to contact -at- jaredlander -dot- com or find me on Twitter @jaredlander. Continue reading

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Last week Slice ran a post about a tomato taste test they conducted with Scott Wiener (of Scott’s NYC Pizza Tours), Brooks Jones, Jason Feirman, Nick Sherman and Roberto Caporuscio from Keste.  While the methods used may not be rigorous enough for definitive results, I took the summary data that was in the post and performed some simple analyses.

The first thing to note is that there are only 16 data points, so multiple regression is not an option.  We can all thank the Curse of Dimensionality for that.  So I stuck to simpler methods and visualizations.  If I can get the raw data from Slice, I can get a little more advanced.

For the sake of simplicity I removed the tomatoes from Eataly because their price was such an outlier that it made visualizing the data difficult.  As usual, most of the graphics were made using ggplot2 by Hadley Wickham.  The coefficient plots were made using a little function I wrote.  Here is the code.  Any suggestions for improvement are greatly appreciated, especially if you can help with increasing the left hand margin of the plot.  And as always, all the work was done in R.

The most obvious relationship we want to test is Overall Quality vs. Price.  As can be seen from the scatterplot below with a fitted loess curve, there is not a linear relationship between price and quality.

More after the break. Continue reading

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Less than a month ago, Drew Conway suggested that our R user group present an analysis of the WikiLeaks data.  In that short time he, Mike Dewar, John Myles White and Harlan Harris have put together a beautiful visualization of attacks in Afghanistan.  The static image you see here has since been animated which is a really nice touch.

Within a few hours of them posting their initial results the work spread across the internet, even getting written up in Wired’s Danger Room.  Today, they got picked up by the New York Times where you can see the animation.

The bulk of the work was, of course, done in R.  I remember talking with them about how they were going to scrape the data from the WikiLeaks documents, but I am not certain how they did it in the end.  As is natural for these guys they made their code available on GitHubso you can recreate their results, after you’ve downloaded the data yourself from WikiLeaks.

Briefly looking at their code I can see they used Hadley Wickham’s ggplot and plyr packages (which are almost standard for most R users) as well as R’s mapping packages.  If you want to learn more about how they did this fantastic job come to the next R Meetup where they will present their findings.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Today, Google announced two new services that are sure to be loved by data geeks.  First is their BigQuery which lets you analyze “Terabytes of data, trillions of records.”  This is great for people with large datasets.  I wonder if a program like R(my favorite statistical analysis package) can read it?  If so would R just pull down the data like it would from any other database?  That would most likely result in a data.frame that is far too large for a standard computer to handle.  Maybe R can be ran in a way that it hits the BigQuery service and leaves the data in there.  Maybe even the processing can be done on Google’s end, allowing for much better computation time.  This is something I’ve been dreaming of for a while now.

Further, can BigQuery produce graphics?  If so, this might be a real shot at Business Intelligence tools like QlikView or Cognosthat specialize in handling LARGE datasets. Continue reading

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.