Continuing with the newly available football data (new link) and inspired by a question from Drew Conway I decided to look at play selection based on down by the Giants for the past 10 years.
Visually, we see that until 2011 the Giants preferred to run on first and second down. Third down is usually a do-or-die down so passes will dominate on third-and-long. The grey vertical lines mark Super Bowls XLII and XLVI.
Code for the graph after the break.
The packages to be used are:
require(plyr) require(stringr) require(ggplot2) require(reshape2) require(parallel) require(doParallel)
We will be running this in parallel so let’s register a parallel backend.
cl <- makeCluster(2) registerDoParallel(cl)
First we read the data for all available years.
theFiles <- file.path("../data", dir("../data/")) # read the csvs in parallel system.time(allGames <- adply(theFiles, .margins = 1, read.csv2, header = TRUE, sep = ",", stringsAsFactors = FALSE, .parallel = TRUE)) # took 10.98 seconds on a dual core 2,66 GHz machine save it so this does # not have to be done again save(allGames, file = "../data/allGames.rdata") write.table(allGames, file = "../data/allGames.csv", row.names = FALSE, sep = ",") # stop the cluster stopCluster(cl)
We are only interested in the giants on offense so let’s narrow it down to them.
# get just giants games nyg <- allGames[str_detect(string = allGames$gameid, "NYG"), ] # get just when they are on offense nygOff <- nyg[nyg$off == "NYG", ]
To determine which plays were passes, runs, kickoffs, punts and field goals we need to process the description column a bit. To do this we will make four new columns, one for each type of play with a logical value.
# If the word pass is used, it was a pass nygOff$Pass <- str_detect(string = nygOff$description, pattern = ignore.case(" pass ")) # same for punt nygOff$Punt <- str_detect(string = nygOff$description, pattern = ignore.case(" punts ")) # and field goal nygOff$FieldGoal <- str_detect(string = nygOff$description, pattern = ignore.case(" field goal ")) # and kick nygOff$Kick <- str_detect(string = nygOff$description, pattern = ignore.case(" kicks ")) # This is for a penalty which we assume blows the play dead nygOff$Penalty <- str_detect(string = nygOff$description, pattern = "^PENALTY") # for cases where the inteded play was aborted nygOff$Aborted <- str_detect(string = nygOff$description, pattern = ignore.case("aborted")) # if none of the other cases are true we assume it was a run nygOff$Run <- rowSums(nygOff[, c("Pass", "Punt", "FieldGoal", "Kick", "Penalty", "Aborted")]) == 0 # which(rowSums(nygOff[, c('Pass', 'Punt', 'FieldGoal', 'Kick', 'Penalty', # 'Aborted', 'Run')]) != 1) View(nygOff[which(rowSums(nygOff[, c('Pass', # 'Punt', 'FieldGoal', 'Kick', 'Penalty', 'Aborted', 'Run')]) != 1), ])
After all that processing we end up with 4 rows that do not have only one of the indicator columns as TRUE.
The first is the play that ended the horrible 2003 playoff game against the 49ers where the Giants blew a huge half time lead. This play was a muffed field goal attempt where the holder, Matt Allen, attempts to pass the ball downfield where Rich Seubert is the first player to touch it. The referees mistakenly thought he was an ineligible receiver and called a penalty ending the game. The league later admitted that he was indeed eligible and that defensive pass interference should have been called. Since the game cannot end on a defensive penalty the Giants should have had another field goal opportunity from much better field position. Clearly, this is still a sore point for Giants fans. Since this is a muffed field goal we will eliminate it from the data.
The next two are plays where the quarterback (Kerry Collins and Eli Manning) got a fumbled snap and then threw an incomplete pass. We will classify these plays as passes.
The last play is a fumble by St. Louis quarterback Sam Bradford which was recovered by Michael Boley for a Giants touchdown. That will be eliminated as well.
# find out the row numbers for the bad plays badRows <- which(rowSums(nygOff[, c("Pass", "Punt", "FieldGoal", "Kick", "Penalty", "Aborted", "Run")]) != 1) # the middle two are to be classified as passes so we set Aborted to FALSE nygOff[badRows[2:3], "Aborted"] <- FALSE # check which rows have the indicator variables summing to one and only # keep those nygOff <- nygOff[which(rowSums(nygOff[, c("Pass", "Punt", "FieldGoal", "Kick", "Penalty", "Aborted", "Run")]) == 1), ] # check it worked which(rowSums(nygOff[, c("Pass", "Punt", "FieldGoal", "Kick", "Penalty", "Aborted", "Run")]) != 1)
## named integer(0)
Now we narrow it down to just run and pass plays and count up each by season and down.
nygPassRun <- nygOff[nygOff$Pass | nygOff$Run, ] playCount <- aggregate(cbind(Pass, Run) ~ season + down, nygPassRun, sum) playCount$Plays <- with(playCount, Pass + Run) # calculate the percent of each type of play playCount$PassPct <- with(playCount, Pass/Plays) playCount$RunPct <- with(playCount, Run/Plays)
Time to plot the data.
playMelt <- melt(playCount[, c("season", "down", "PassPct", "RunPct")], id.vars = c("season", "down"), value.name = "Percent", variable.name = "Play") playMelt$Play <- as.character(playMelt$Play) playMelt$Play[playMelt$Play == "PassPct"] <- "Pass" playMelt$Play[playMelt$Play == "RunPct"] <- "Run" playMelt$Play <- factor(playMelt$Play, levels = c("Run", "Pass")) ggplot(playMelt, aes(x = season, y = Percent, group = Play, color = Play)) + geom_line() + facet_wrap(~down, ncol = 1, scales = "free_y") + ggtitle("Type of Play by Down") + labs(x = "Season") + geom_vline(xintercept = c(2007, 2011), color = "grey", linetype = 2)
From this we can see that the Giants, traditionally known as a running team, mostly preferred the run over the pass on both first and second down, until the 2011 season when Eli became a truly dominant quarterback and passed more on those downs.
This does not take into account the time left in the game, the score or the yards to go, but that’s for another day.
Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.
Leave a Reply