For a d3 bar plot visit http://www.jaredlander.com/plots/PizzaPollPlot.html.

I finally compiled the data from all the pizza polling I’ve been doing at the New York R meetups. The data are available as json at http://www.jaredlander.com/data/PizzaPollData.php.

This is easy enough to plot in R using ggplot2.

require(rjson)
require(plyr)
pizzaJson <- fromJSON(file = "http://jaredlander.com/data/PizzaPollData.php")
pizza <- ldply(pizzaJson, as.data.frame)

##   polla_qid      Answer Votes pollq_id                Question
## 1         2   Excellent     0        2  How was Pizza Mercato?
## 2         2        Good     6        2  How was Pizza Mercato?
## 3         2     Average     4        2  How was Pizza Mercato?
## 4         2        Poor     1        2  How was Pizza Mercato?
## 5         2 Never Again     2        2  How was Pizza Mercato?
## 6         3   Excellent     1        3 How was Maffei's Pizza?
## 1  Pizza Mercato 1.344e+09         13  0.0000
## 2  Pizza Mercato 1.344e+09         13  0.4615
## 3  Pizza Mercato 1.344e+09         13  0.3077
## 4  Pizza Mercato 1.344e+09         13  0.0769
## 5  Pizza Mercato 1.344e+09         13  0.1538
## 6 Maffei's Pizza 1.348e+09          7  0.1429

require(ggplot2)
ggplot(pizza, aes(x = Place, y = Percent, group = Answer, color = Answer)) +
geom_line() + theme(axis.text.x = element_text(angle = 46, hjust = 1), legend.position = "bottom") +
labs(x = "Pizza Place", title = "Pizza Poll Results")


But given this is live data that will change as more polls are added I thought it best to use a plot that automatically updates and is interactive. So this gave me my first chance to need rCharts by Ramnath Vaidyanathan as seen at October’s meetup.

require(rCharts)
pizzaPlot <- nPlot(Percent ~ Place, data = pizza, type = "multiBarChart", group = "Answer")
pizzaPlot$xAxis(axisLabel = "Pizza Place", rotateLabels = -45) pizzaPlot$yAxis(axisLabel = "Percent")
pizzaPlot$chart(reduceXTicks = FALSE) pizzaPlot$print("chart1", include_assets = TRUE)


Unfortunately I cannot figure out how to insert this in WordPress so please see the chart at http://www.jaredlander.com/plots/PizzaPollPlot.html. Or see the badly sized one below.

There are still a lot of things I am learning, including how to use a categorical x-axis natively on linecharts and inserting chart titles. I found a workaround for the categorical x-axis by using tickFormat but that is not pretty. I also would like to find a way to quickly switch between a line chart and a bar chart. Fitting more labels onto the x-axis or perhaps adding a scroll bar would be nice too.

The wonderful people at Gilt are having me teach an introductory course on R this Friday.

The class starts with the very basics such as variable types, vectors, data.frames and matrices.  After that we explore munging data with aggregate, plyr and reshape2.  Once the data is prepared we will use ggplot2 to visualize it and then fit models using lm, glm and decision trees.

Most of the material comes from my upcoming book R for Everyone.

Participants are encouraged to bring computers so they can code along with the live examples.  They should also have R and RStudio preinstalled.

An often requested feature for Hadley Wickham's ggplot2 package is the ability to vertically dodge points, lines and bars. There has long been a function to shift geoms to the side when the x-axis is categorical: position_dodge. However, no such function exists for vertical shifts when the y-axis is categorical. Hadley usually responds by saying it should be easy to build, so here is a hacky patch.

All I did was copy the old functions (geom_dodge, collide, pos_dodge and PositionDodge) and make them vertical by swapping y's with x's, height with width and vice versa. It's hacky and not tested but seems to work as I'll show below.

First the new functions:

require(proto)

## Loading required package: proto

collidev <- function(data, height = NULL, name, strategy, check.height = TRUE) {
if (!is.null(height)) {
if (!(all(c("ymin", "ymax") %in% names(data)))) {
data <- within(data, {
ymin <- y - height/2
ymax <- y + height/2
})
}
} else {
if (!(all(c("ymin", "ymax") %in% names(data)))) {
data$ymin <- data$y
data$ymax <- data$y
}
heights <- unique(with(data, ymax - ymin))
heights <- heights[!is.na(heights)]
if (!zero_range(range(heights))) {
warning(name, " requires constant height: output may be incorrect",
call. = FALSE)
}
height <- heights[1]
}
data <- data[order(data$ymin), ] intervals <- as.numeric(t(unique(data[c("ymin", "ymax")]))) intervals <- intervals[!is.na(intervals)] if (length(unique(intervals)) > 1 & any(diff(scale(intervals)) < -1e-06)) { warning(name, " requires non-overlapping y intervals", call. = FALSE) } if (!is.null(data$xmax)) {
ddply(data, .(ymin), strategy, height = height)
} else if (!is.null(data$x)) { message("xmax not defined: adjusting position using x instead") transform(ddply(transform(data, xmax = x), .(ymin), strategy, height = height), x = xmax) } else { stop("Neither x nor xmax defined") } } pos_dodgev <- function(df, height) { n <- length(unique(df$group))
if (n == 1)
return(df)
if (!all(c("ymin", "ymax") %in% names(df))) {
df$ymin <- df$y
df$ymax <- df$y
}
d_width <- max(df$ymax - df$ymin)
diff <- height - d_width
groupidx <- match(df$group, sort(unique(df$group)))
df$y <- df$y + height * ((groupidx - 0.5)/n - 0.5)
df$ymin <- df$y - d_width/n/2
df$ymax <- df$y + d_width/n/2
df
}

position_dodgev <- function(width = NULL, height = NULL) {
PositionDodgev$new(width = width, height = height) } PositionDodgev <- proto(ggplot2:::Position, { objname <- "dodgev" adjust <- function(., data) { if (empty(data)) return(data.frame()) check_required_aesthetics("y", names(data), "position_dodgev") collidev(data, .$height, .$my_name(), pos_dodgev, check.height = TRUE) } })  Now that they are built we can whip up some example data to show them off. Since this was inspired by a refactoring of my coefplot package I will use a deconstructed sample. # get tips data data(tips, package = "reshape2") # fit some models mod1 <- lm(tip ~ day + sex, data = tips) mod2 <- lm(tip ~ day * sex, data = tips) # build data/frame with coefficients and confidence intervals and combine # them into one data.frame require(coefplot)  ## Loading required package: coefplot  ## Loading required package: ggplot2  df1 <- coefplot(mod1, plot = FALSE, name = "Base", shorten = FALSE) df2 <- coefplot(model = mod2, plot = FALSE, name = "Interaction", shorten = FALSE) theDF <- rbind(df1, df2) theDF  ## LowOuter HighOuter LowInner HighInner Coef Name Checkers ## 1 1.9803 3.3065 2.31183 2.9750 2.64340 (Intercept) Numeric ## 2 -0.4685 0.9325 -0.11822 0.5822 0.23202 daySat day ## 3 -0.2335 1.1921 0.12291 0.8357 0.47929 daySun day ## 4 -0.6790 0.7672 -0.31745 0.4056 0.04408 dayThur day ## 5 -0.2053 0.5524 -0.01589 0.3630 0.17354 sexMale sex ## 6 1.8592 3.7030 2.32016 3.2421 2.78111 (Intercept) Numeric ## 7 -1.0391 1.0804 -0.50921 0.5506 0.02067 daySat day ## 8 -0.5430 1.7152 0.02156 1.1507 0.58611 daySun day ## 9 -1.2490 0.8380 -0.72725 0.3163 -0.20549 dayThur day ## 10 -1.3589 1.1827 -0.72349 0.5473 -0.08811 sexMale sex ## 11 -1.0502 1.7907 -0.34000 1.0804 0.37022 daySat:sexMale day:sex ## 12 -1.5324 1.4149 -0.79560 0.6781 -0.05877 daySun:sexMale day:sex ## 13 -0.9594 1.9450 -0.23328 1.2189 0.49282 dayThur:sexMale day:sex ## CoefShort Model ## 1 (Intercept) Base ## 2 daySat Base ## 3 daySun Base ## 4 dayThur Base ## 5 sexMale Base ## 6 (Intercept) Interaction ## 7 daySat Interaction ## 8 daySun Interaction ## 9 dayThur Interaction ## 10 sexMale Interaction ## 11 daySat:sexMale Interaction ## 12 daySun:sexMale Interaction ## 13 dayThur:sexMale Interaction  # build the plot require(ggplot2) require(plyr)  ## Loading required package: plyr  ggplot(theDF, aes(y = Name, x = Coef, color = Model)) + geom_vline(xintercept = 0, linetype = 2, color = "grey") + geom_errorbarh(aes(xmin = LowOuter, xmax = HighOuter), height = 0, lwd = 0, position = position_dodgev(height = 1)) + geom_errorbarh(aes(xmin = LowInner, xmax = HighInner), height = 0, lwd = 1, position = position_dodgev(height = 1)) + geom_point(position = position_dodgev(height = 1), aes(xmax = Coef))  Compare that to the multiplot function in coefplot that was built using geom_dodge and coord_flip. multiplot(mod1, mod2, shorten = F, names = c("Base", "Interaction"))  With the exception of the ordering and plot labels, these charts are the same. The main benefit here is that avoiding coord_flip still allows the plot to be faceted, which was not possible with coord_flip. Hopefully Hadley will be able to take these functions and incorporate them into ggplot2. Continuing with the newly available football data and inspired by a question from Drew Conway I decided to look at play selection based on down by the Giants for the past 10 years. Visually, we see that until 2011 the Giants preferred to run on first and second down. Third down is usually a do-or-die down so passes will dominate on third-and-long. The grey vertical lines mark Super Bowls XLII and XLVI. Code for the graph after the break. With the recent availability of play-by-play NFL data I got to analyzing my favorite team, the New York Giants with some very hasty EDA. From the above graph you can see that on 1st down Eli preferred to throw to Hakim Nicks and on 2nd and 3rd downs he slightly favored Victor Cruz. The code for the analysis is after the break. A friend of mine has told me on numerous occasions that since 1960 the Yankees have not won a World Series while a Republican was President. Upon hearing this my Republican friends (both Yankee and Red Sox fans) turn incredulous and say that this is ridiculous. So I decided to investigate. To be clear this is in no way shows causality, but just checks the numbers. The data was easily attainable so it really came down to plotting. The plot above shows every Yankee win (and loss) since 1960 and the party of the President at the time. It is clear to see that all nine Yankees World Series wins came while a Democrat inhabited the White House. The fluctuation plot below shows Yankee wins both before and after 1960 and the complete lack of a block for Republican/Post-1960 simply makes the case. There are similar plots for the American League after the jump. With tonight’s Mega Millions jackpot estimated to be over$640 million there are long lines of people waiting to buy tickets.  Of course you always hear about the probability of winning which is easy enough to calculate:  Five numbers ranging from 1 through 56 are drawn (without replacement) then a sixth ball is pulled from a set of 1 through 46.  That means there are choose(56, 5) * 46 = 175,711,536 possible different combinations.  That is why people are constantly reminded of how unlikely they are to win.

But I want to see how likely it is that SOMEONE will win tonight.  So let’s break out R and ggplot!

As of this afternoon it was reported (sorry no source) that two tickets were sold for every American.  So let’s assume that each of these tickets is an independent Bernoulli trial with probability of success of 1/175,711,536.

Running 1,000 simulations we see the distribution of the number of winners in the histogram above.

So we shouldn’t be surprised if there are multiple winners tonight.

The R code:

winners <- rbinom(n=1000, size=600000000, prob=1/175000000)
qplot(winners, geom="histogram", binwidth=1, xlab="Number of Winners")

With the Super Bowl only hours away now is your last chance to buy your boxes.  Assuming the last digits are not assigned randomly you can maximize your chances with a little analysis.  While I’ve seen plenty of sites giving the raw numbers, I thought a little visualization was in order.

In the graph above (made using ggplot2 in R, of course) the bigger squares represent greater frequency.  The axes are labelled “Home” and “Away” for orientation, but in the Super Bowl that probably doesn’t matter too much, especially considering that Indianapolis is (Peyton) Manning territory so the locals will most likely be rooting for the Giants.  Further, I believe Super Bowl XLII, featuring the same two teams, had a disproportionate number of Giants fans.  Bias disclaimer:  GO BIG BLUE!!!

Below is the same graph broken down by year to see how the distribution has changed over the past 20 years.

All the data was scraped from Pro Football Reference.  All of my code and other graphs that didn’t make the cut are at my github site.

As always, send any questions my way.

Fig. 1: This graph shows received and sent text messages by month. Notice the spike in July 2010.

A few weeks ago my iPhone for some reason erased ALL of my previous text messages (SMS and MMS) and it was as if I was starting with a new phone. After doing some digging I discovered that each time you sync your iPhone a copy of its text message database is saved on your computer which can be accessed without jailbreaking.

My original intent was to take the old database and union it with the new database for all the texting I had done since then, thus restoring all of my text messages. But once I got into the SQLite database I realized that I had a ton of information on my hands that was begging to be analyzed. It also didn’t hurt that I was in a lovely but small Vermont town for the week without much else to do at night.

My first finding, as seen above, is that my text messaging spiked after my girlfriend and I broke up around July of last year. Notice that for both years there is a dip in December. That’s because in 2009 I was in Burma during December and for 2010 the data stopped on December 6th when the last backup was made. A simple t-test confirmed that my texting did indeed increase after the breakup.

More interestingly, is that before my girlfriend and I broke up last year I texted more men than women, but shortly after we broke up that flipped. I don’t think that needs much of an explanation. The above graph and further analysis excludes her and family members because they would bias the gender effect. Being a good statistician I ran a poisson regression to see if there really was a significant change. The coefficient plot below (which is on the logarithmic scale) shows that my texting with males increased after the breakup (or Epoch) by 74% (calculated by summing the coefficients for “Epoch”, “Male” and “Male:Epoch” and then exponentiating) while my texting with females increased 127%.

Further analysis and a how-to after the break.

A great way to visualize the results of a regression is to use a Coefficient Plot like the one to the right.  I’ve seen people on Twitter asking how to build this and there has been an option available using Andy Gelman’s coefplot() in the arm package.  Not knowing this I built my own (as seen in this post about taste testing tomatoes) and they both suffered the same problems:.  Long coefficient names often got cut off by the left margin of the graph and the name of the variable was appended to all the levels of a factor.  One big difference between his and mine is that his does not include the Intercept by default.  Mine includes the intercept with the option of excluding it.

I managed to solve the latter problem pretty quickly using some regular expressions.  Now the levels of factors are displayed alone, without being prepended by the factor name.  As for the former, I fixed that yesterday by taking advantage of ggplot by Hadley Wickham which deals with the margins better than I do.

Both of these changes made for a vast improvement over what I had avialable before.  Future improvements will address the sorting of the coefficients displayed and allow users to choose their own display names for the coefficients.

The function is in this file and is called plotCoef() and is very customizable, down to the color and line thickness.  I kept my old version, plotCoefBase(), in the file in case some people are adverse to using ggplot, though no one should be.  I sent the code to Dr. Gelman to hopefully be incorporated into his function which I’m sure gets used by a lot more people than mine will.  Examples of my old version and of Dr. Gelman’s are after the break.

As always, any comments or questions are welcomed.  Go to the Contact page or send an email to contact -at- jaredlander -dot- com or find me on Twitter @jaredlander. Continue reading