plot of chunk plot-ggplot

For a d3 bar plot visit http://www.jaredlander.com/plots/PizzaPollPlot.html.



I finally compiled the data from all the pizza polling I’ve been doing at the New York R meetups. The data are available as json at http://www.jaredlander.com/data/PizzaPollData.php.

This is easy enough to plot in R using ggplot2.

require(rjson)
require(plyr)
pizzaJson <- fromJSON(file = "http://jaredlander.com/data/PizzaPollData.php")
pizza <- ldply(pizzaJson, as.data.frame)
head(pizza)
##   polla_qid      Answer Votes pollq_id                Question
## 1         2   Excellent     0        2  How was Pizza Mercato?
## 2         2        Good     6        2  How was Pizza Mercato?
## 3         2     Average     4        2  How was Pizza Mercato?
## 4         2        Poor     1        2  How was Pizza Mercato?
## 5         2 Never Again     2        2  How was Pizza Mercato?
## 6         3   Excellent     1        3 How was Maffei's Pizza?
##            Place      Time TotalVotes Percent
## 1  Pizza Mercato 1.344e+09         13  0.0000
## 2  Pizza Mercato 1.344e+09         13  0.4615
## 3  Pizza Mercato 1.344e+09         13  0.3077
## 4  Pizza Mercato 1.344e+09         13  0.0769
## 5  Pizza Mercato 1.344e+09         13  0.1538
## 6 Maffei's Pizza 1.348e+09          7  0.1429
require(ggplot2)
ggplot(pizza, aes(x = Place, y = Percent, group = Answer, color = Answer)) + 
    geom_line() + theme(axis.text.x = element_text(angle = 46, hjust = 1), legend.position = "bottom") + 
    labs(x = "Pizza Place", title = "Pizza Poll Results")

plot of chunk plot-ggplot

But given this is live data that will change as more polls are added I thought it best to use a plot that automatically updates and is interactive. So this gave me my first chance to need rCharts by Ramnath Vaidyanathan as seen at October’s meetup.

require(rCharts)
pizzaPlot <- nPlot(Percent ~ Place, data = pizza, type = "multiBarChart", group = "Answer")
pizzaPlot$xAxis(axisLabel = "Pizza Place", rotateLabels = -45)
pizzaPlot$yAxis(axisLabel = "Percent")
pizzaPlot$chart(reduceXTicks = FALSE)
pizzaPlot$print("chart1", include_assets = TRUE)

Unfortunately I cannot figure out how to insert this in WordPress so please see the chart at http://www.jaredlander.com/plots/PizzaPollPlot.html. Or see the badly sized one below.

There are still a lot of things I am learning, including how to use a categorical x-axis natively on linecharts and inserting chart titles. I found a workaround for the categorical x-axis by using tickFormat but that is not pretty. I also would like to find a way to quickly switch between a line chart and a bar chart. Fitting more labels onto the x-axis or perhaps adding a scroll bar would be nice too.

The wonderful people at Gilt are having me teach an introductory course on R this Friday.

The class starts with the very basics such as variable types, vectors, data.frames and matrices.  After that we explore munging data with aggregate, plyr and reshape2.  Once the data is prepared we will use ggplot2 to visualize it and then fit models using lm, glm and decision trees.

Most of the material comes from my upcoming book R for Everyone.

Participants are encouraged to bring computers so they can code along with the live examples.  They should also have R and RStudio preinstalled.

An often requested feature for Hadley Wickham's ggplot2 package is the ability to vertically dodge points, lines and bars. There has long been a function to shift geoms to the side when the x-axis is categorical: position_dodge. However, no such function exists for vertical shifts when the y-axis is categorical. Hadley usually responds by saying it should be easy to build, so here is a hacky patch.

All I did was copy the old functions (geom_dodge, collide, pos_dodge and PositionDodge) and make them vertical by swapping y's with x's, height with width and vice versa. It's hacky and not tested but seems to work as I'll show below.

First the new functions:

require(proto)
## Loading required package: proto
collidev <- function(data, height = NULL, name, strategy, check.height = TRUE) {
    if (!is.null(height)) {
        if (!(all(c("ymin", "ymax") %in% names(data)))) {
            data <- within(data, {
                ymin <- y - height/2
                ymax <- y + height/2
            })
        }
    } else {
        if (!(all(c("ymin", "ymax") %in% names(data)))) {
            data$ymin <- data$y
            data$ymax <- data$y
        }
        heights <- unique(with(data, ymax - ymin))
        heights <- heights[!is.na(heights)]
        if (!zero_range(range(heights))) {
            warning(name, " requires constant height: output may be incorrect", 
                call. = FALSE)
        }
        height <- heights[1]
    }
    data <- data[order(data$ymin), ]
    intervals <- as.numeric(t(unique(data[c("ymin", "ymax")])))
    intervals <- intervals[!is.na(intervals)]
    if (length(unique(intervals)) > 1 & any(diff(scale(intervals)) < -1e-06)) {
        warning(name, " requires non-overlapping y intervals", call. = FALSE)
    }
    if (!is.null(data$xmax)) {
        ddply(data, .(ymin), strategy, height = height)
    } else if (!is.null(data$x)) {
        message("xmax not defined: adjusting position using x instead")
        transform(ddply(transform(data, xmax = x), .(ymin), strategy, height = height), 
            x = xmax)
    } else {
        stop("Neither x nor xmax defined")
    }
}

pos_dodgev <- function(df, height) {
    n <- length(unique(df$group))
    if (n == 1) 
        return(df)
    if (!all(c("ymin", "ymax") %in% names(df))) {
        df$ymin <- df$y
        df$ymax <- df$y
    }
    d_width <- max(df$ymax - df$ymin)
    diff <- height - d_width
    groupidx <- match(df$group, sort(unique(df$group)))
    df$y <- df$y + height * ((groupidx - 0.5)/n - 0.5)
    df$ymin <- df$y - d_width/n/2
    df$ymax <- df$y + d_width/n/2
    df
}

position_dodgev <- function(width = NULL, height = NULL) {
    PositionDodgev$new(width = width, height = height)
}

PositionDodgev <- proto(ggplot2:::Position, {
    objname <- "dodgev"

    adjust <- function(., data) {
        if (empty(data)) 
            return(data.frame())
        check_required_aesthetics("y", names(data), "position_dodgev")

        collidev(data, .$height, .$my_name(), pos_dodgev, check.height = TRUE)
    }

})

Now that they are built we can whip up some example data to show them off. Since this was inspired by a refactoring of my coefplot package I will use a deconstructed sample.

# get tips data
data(tips, package = "reshape2")

# fit some models
mod1 <- lm(tip ~ day + sex, data = tips)
mod2 <- lm(tip ~ day * sex, data = tips)

# build data/frame with coefficients and confidence intervals and combine
# them into one data.frame
require(coefplot)
## Loading required package: coefplot
## Loading required package: ggplot2
df1 <- coefplot(mod1, plot = FALSE, name = "Base", shorten = FALSE)
df2 <- coefplot(model = mod2, plot = FALSE, name = "Interaction", shorten = FALSE)
theDF <- rbind(df1, df2)
theDF
##    LowOuter HighOuter LowInner HighInner     Coef            Name Checkers
## 1    1.9803    3.3065  2.31183    2.9750  2.64340     (Intercept)  Numeric
## 2   -0.4685    0.9325 -0.11822    0.5822  0.23202          daySat      day
## 3   -0.2335    1.1921  0.12291    0.8357  0.47929          daySun      day
## 4   -0.6790    0.7672 -0.31745    0.4056  0.04408         dayThur      day
## 5   -0.2053    0.5524 -0.01589    0.3630  0.17354         sexMale      sex
## 6    1.8592    3.7030  2.32016    3.2421  2.78111     (Intercept)  Numeric
## 7   -1.0391    1.0804 -0.50921    0.5506  0.02067          daySat      day
## 8   -0.5430    1.7152  0.02156    1.1507  0.58611          daySun      day
## 9   -1.2490    0.8380 -0.72725    0.3163 -0.20549         dayThur      day
## 10  -1.3589    1.1827 -0.72349    0.5473 -0.08811         sexMale      sex
## 11  -1.0502    1.7907 -0.34000    1.0804  0.37022  daySat:sexMale  day:sex
## 12  -1.5324    1.4149 -0.79560    0.6781 -0.05877  daySun:sexMale  day:sex
## 13  -0.9594    1.9450 -0.23328    1.2189  0.49282 dayThur:sexMale  day:sex
##          CoefShort       Model
## 1      (Intercept)        Base
## 2           daySat        Base
## 3           daySun        Base
## 4          dayThur        Base
## 5          sexMale        Base
## 6      (Intercept) Interaction
## 7           daySat Interaction
## 8           daySun Interaction
## 9          dayThur Interaction
## 10         sexMale Interaction
## 11  daySat:sexMale Interaction
## 12  daySun:sexMale Interaction
## 13 dayThur:sexMale Interaction
# build the plot
require(ggplot2)
require(plyr)
## Loading required package: plyr
ggplot(theDF, aes(y = Name, x = Coef, color = Model)) + geom_vline(xintercept = 0, 
    linetype = 2, color = "grey") + geom_errorbarh(aes(xmin = LowOuter, xmax = HighOuter), 
    height = 0, lwd = 0, position = position_dodgev(height = 1)) + geom_errorbarh(aes(xmin = LowInner, 
    xmax = HighInner), height = 0, lwd = 1, position = position_dodgev(height = 1)) + 
    geom_point(position = position_dodgev(height = 1), aes(xmax = Coef))

plot of chunk make-Plot

Compare that to the multiplot function in coefplot that was built using geom_dodge and coord_flip.

multiplot(mod1, mod2, shorten = F, names = c("Base", "Interaction"))

plot of chunk multiplot

With the exception of the ordering and plot labels, these charts are the same. The main benefit here is that avoiding coord_flip still allows the plot to be faceted, which was not possible with coord_flip.

Hopefully Hadley will be able to take these functions and incorporate them into ggplot2.

plot of chunk plot-play-by-down

Continuing with the newly available football data and inspired by a question from Drew Conway I decided to look at play selection based on down by the Giants for the past 10 years.

Visually, we see that until 2011 the Giants preferred to run on first and second down.  Third down is usually a do-or-die down so passes will dominate on third-and-long.  The grey vertical lines mark Super Bowls XLII and XLVI.

Code for the graph after the break.

Continue reading

plot of chunk make-graph

With the recent availability of play-by-play NFL data I got to analyzing my favorite team, the New York Giants with some very hasty EDA.

From the above graph you can see that on 1st down Eli preferred to throw to Hakim Nicks and on 2nd and 3rd downs he slightly favored Victor Cruz.

The code for the analysis is after the break.

Continue reading

A friend of mine has told me on numerous occasions that since 1960 the Yankees have not won a World Series while a Republican was President.  Upon hearing this my Republican friends (both Yankee and Red Sox fans) turn incredulous and say that this is ridiculous.  So I decided to investigate.  To be clear this is in no way shows causality, but just checks the numbers.

The data was easily attainable so it really came down to plotting.

The plot above shows every Yankee win (and loss) since 1960 and the party of the President at the time.  It is clear to see that all nine Yankees World Series wins came while a Democrat inhabited the White House.  The fluctuation plot below shows Yankee wins both before and after 1960 and the complete lack of a block for Republican/Post-1960 simply makes the case.

There are similar plots for the American League after the jump.

Continue reading


Distribution of Lottery Winners based on 1,000 Simulations

With tonight’s Mega Millions jackpot estimated to be over $640 million there are long lines of people waiting to buy tickets.  Of course you always hear about the probability of winning which is easy enough to calculate:  Five numbers ranging from 1 through 56 are drawn (without replacement) then a sixth ball is pulled from a set of 1 through 46.  That means there are choose(56, 5) * 46 = 175,711,536 possible different combinations.  That is why people are constantly reminded of how unlikely they are to win.

But I want to see how likely it is that SOMEONE will win tonight.  So let’s break out R and ggplot!

As of this afternoon it was reported (sorry no source) that two tickets were sold for every American.  So let’s assume that each of these tickets is an independent Bernoulli trial with probability of success of 1/175,711,536.

Running 1,000 simulations we see the distribution of the number of winners in the histogram above.

So we shouldn’t be surprised if there are multiple winners tonight.

The R code:

winners <- rbinom(n=1000, size=600000000, prob=1/175000000)
qplot(winners, geom="histogram", binwidth=1, xlab="Number of Winners")

With the Super Bowl only hours away now is your last chance to buy your boxes.  Assuming the last digits are not assigned randomly you can maximize your chances with a little analysis.  While I’ve seen plenty of sites giving the raw numbers, I thought a little visualization was in order.

In the graph above (made using ggplot2 in R, of course) the bigger squares represent greater frequency.  The axes are labelled “Home” and “Away” for orientation, but in the Super Bowl that probably doesn’t matter too much, especially considering that Indianapolis is (Peyton) Manning territory so the locals will most likely be rooting for the Giants.  Further, I believe Super Bowl XLII, featuring the same two teams, had a disproportionate number of Giants fans.  Bias disclaimer:  GO BIG BLUE!!!

Below is the same graph broken down by year to see how the distribution has changed over the past 20 years.

All the data was scraped from Pro Football Reference.  All of my code and other graphs that didn’t make the cut are at my github site.

As always, send any questions my way.

This graphs shows received and sent texts by month.  Notice the spike in July 2010.
Fig. 1: This graph shows received and sent text messages by month. Notice the spike in July 2010.

A few weeks ago my iPhone for some reason erased ALL of my previous text messages (SMS and MMS) and it was as if I was starting with a new phone. After doing some digging I discovered that each time you sync your iPhone a copy of its text message database is saved on your computer which can be accessed without jailbreaking.

My original intent was to take the old database and union it with the new database for all the texting I had done since then, thus restoring all of my text messages. But once I got into the SQLite database I realized that I had a ton of information on my hands that was begging to be analyzed. It also didn’t hurt that I was in a lovely but small Vermont town for the week without much else to do at night.

My first finding, as seen above, is that my text messaging spiked after my girlfriend and I broke up around July of last year. Notice that for both years there is a dip in December. That’s because in 2009 I was in Burma during December and for 2010 the data stopped on December 6th when the last backup was made. A simple t-test confirmed that my texting did indeed increase after the breakup.


Fig. 2: This graph shows my text messaging pattern over time for both men and women. Notice the crossover around August 2010.

More interestingly, is that before my girlfriend and I broke up last year I texted more men than women, but shortly after we broke up that flipped. I don’t think that needs much of an explanation. The above graph and further analysis excludes her and family members because they would bias the gender effect. Being a good statistician I ran a poisson regression to see if there really was a significant change. The coefficient plot below (which is on the logarithmic scale) shows that my texting with males increased after the breakup (or Epoch) by 74% (calculated by summing the coefficients for “Epoch”, “Male” and “Male:Epoch” and then exponentiating) while my texting with females increased 127%.


Fig. 3: Here the “Male” coefficient seems statistically insignificant but its direction makes sense so it is left in the model. The “Intercept” is interpreted as the texting rate with females before the breakup, “Epoch” is the increase with females after the breakup, “Intercept” plus “Male” is the rate with males before the breakup. “Epoch” combined with “Male:Epoch” is the change in rate for texts with males after the breakup.

Further analysis and a how-to after the break.

Continue reading

A great way to visualize the results of a regression is to use a Coefficient Plot like the one to the right.  I’ve seen people on Twitter asking how to build this and there has been an option available using Andy Gelman’s coefplot() in the arm package.  Not knowing this I built my own (as seen in this post about taste testing tomatoes) and they both suffered the same problems:.  Long coefficient names often got cut off by the left margin of the graph and the name of the variable was appended to all the levels of a factor.  One big difference between his and mine is that his does not include the Intercept by default.  Mine includes the intercept with the option of excluding it.

I managed to solve the latter problem pretty quickly using some regular expressions.  Now the levels of factors are displayed alone, without being prepended by the factor name.  As for the former, I fixed that yesterday by taking advantage of ggplot by Hadley Wickham which deals with the margins better than I do.

Both of these changes made for a vast improvement over what I had avialable before.  Future improvements will address the sorting of the coefficients displayed and allow users to choose their own display names for the coefficients.

The function is in this file and is called plotCoef() and is very customizable, down to the color and line thickness.  I kept my old version, plotCoefBase(), in the file in case some people are adverse to using ggplot, though no one should be.  I sent the code to Dr. Gelman to hopefully be incorporated into his function which I’m sure gets used by a lot more people than mine will.  Examples of my old version and of Dr. Gelman’s are after the break.

As always, any comments or questions are welcomed.  Go to the Contact page or send an email to contact -at- jaredlander -dot- com or find me on Twitter @jaredlander. Continue reading