Analyzing Play-by-Play Data

Statistician

Data Scientist

Math & Statistics

\[ f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{\frac{x-\mu}{\sigma^2}} \]

Substantive Experience

Hacking Skills

Play-by-Play

Questions?

Pass vs Rush
Favored Receiver
Rush Direction
Blitzing

S

John Chambers

John Chambers * Bell Labs * 1976

R

gentleman ihaka

Robert Gentleman & Ross Ihaka * University of Auckland, New Zealand * 1993

Statistics

Modeling

Machine Learning

Hierarchical clustering of wine data

Graphics

The Data

Interesting Part

Read the Data

g14 <- read.csv2(file.path(dataDir, 'football', 'pbp-2014.csv'), 
                 sep=',', header=TRUE, stringsAsFactors=FALSE)
g15 <- read.csv2(file.path(dataDir, 'football', 'pbp-2015.csv'), 
                 sep=',', header=TRUE, stringsAsFactors=FALSE)

games <- bind_rows(g14, g15)

Giants

Pass vs Rush

Focus on One Team's Offense

oneOff <- games %>%
    filter(OffenseTeam == 'NYG', PlayType %in% c('PASS', 'RUSH')) %>%
    mutate(PlayType=factor(PlayType, levels=c('RUSH', 'PASS')), 
           Down=factor(Down, levels=c(1, 2, 3, 4)))

Pass vs Rush

ggplot(oneOff, aes(x=PlayType)) + geom_bar(fill='darkgrey')

Pass vs Rush by Down

ggplot(oneOff, aes(x=PlayType)) + geom_bar(fill='darkgrey') + facet_wrap(~Down)

Binary Regression

\[ p(y_i=1) = \text{logit}^{-1}(\boldsymbol{X}_i\boldsymbol{\beta}) \]

\[ \text{logit}^{-1}(x) = \frac{e^x}{1+e^x} = \frac{1}{1+e^{-x}} \]

Pass vs Rush Modeled

passRushMod1 <- glm(PlayType ~ Down - 1, data=oneOff, family=binomial)
summary(passRushMod1)

Call:
glm(formula = PlayType ~ Down - 1, family = binomial, data = oneOff)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9174  -1.1486   0.5887   0.9610   1.2065  

Coefficients:
      Estimate Std. Error z value Pr(>|z|)    
Down1 -0.06811    0.06527  -1.044  0.29671    
Down2  0.53305    0.07709   6.914  4.7e-12 ***
Down3  1.66501    0.13034  12.774  < 2e-16 ***
Down4  1.27297    0.42762   2.977  0.00291 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2958.4  on 2134  degrees of freedom
Residual deviance: 2672.6  on 2130  degrees of freedom
AIC: 2680.6

Number of Fisher Scoring iterations: 4

Pass vs Rush Modeled

coefplot(passRushMod1, title='Probability of Pass')

Inverse Logit

invlogit <- function(x)
{
    1/(1 + exp(-x))
}

round(invlogit(-6:6), 2)

 [1] 0.00 0.01 0.02 0.05 0.12 0.27 0.50 0.73 0.88 0.95 0.98 0.99 1.00

Probabilities

coefplot(passRushMod1, trans=invlogit, title='Probability of Pass')

With Yards to Gain

passRushMod2 <- glm(PlayType ~ Down + ToGo - 1, data=oneOff, family=binomial)
coefplot(passRushMod2, trans=invlogit, title='Probability of Pass')

Scenarios

# make grid of scenarios
scenarios <- expand.grid(ToGo=1:15, Down=1:4) %>% as.tbl %>% 
    mutate(Down=factor(Down, levels=c(1, 2, 3, 4)))
# make prediction based on model
scenarioPredict <- predict(passRushMod2, 
                           newdata=scenarios, type='response', se.fit=TRUE)
# build confidence intervals
scenarios <- scenarios %>% mutate(Prediction=scenarioPredict$fit, 
                                  Lower=Prediction - 2*scenarioPredict$se.fit,
                                  Upper=Prediction + 2*scenarioPredict$se.fit)

Probability of Pass

ggplot(scenarios, aes(x=ToGo)) + scale_y_continuous(label=scales::percent) +
    geom_ribbon(aes(ymin=Lower, ymax=Upper), fill='lightgrey') +
    geom_line(aes(y=Prediction)) + facet_wrap(~Down, nrow=2)

Who Gets the Pass?

onePass <- oneOff %>% filter(PlayType == 'PASS') %>%
    mutate(Receiver=str_extract_all(Description, pattern=' \\d{1,2}-\\w\\.\\w+( |\\.)',
                                    simplify=TRUE) %>% `[`(, 2), 
           Receiver=str_replace_all(Receiver, '(\\d{1,2}-)|(\\.$)', ''), 
           Receiver= str_trim(Receiver)) %>% 
                      filter(Receiver %in% 
                                 c("L.DONNELL", "V.CRUZ", "R.RANDLE", 
                                   "O.BECKHAM", "H.NICKS"))

Who Gets the Pass?

Decision Tree

\[ \hat{p}_{mk} = \frac{1}{N_m} \sum_{x_i \in R_m} I(y_i = k) \]

Decision Tree

passTree <- rpart(Receiver ~ Down + ToGo + Quarter + Minute + Second + 
                      YardLine + YardLineDirection + Formation, 
                  data=onePass)

Decision Tree

rpart.plot(passTree, extra=104, branch=1, uniform=FALSE, varlen=0, faclen=9, under=TRUE,
           box.col=c(8), border.col=1, shadow.col='darkgray', col=4, eq=": ",
           branch.col='blue', nspace=.1, minbranch=.9, cex=.8)

Decision Tree

What Else?

Direction of Rush
Length of Pass

Better Models

Include More Variables
More Advanced Machine Learning

Game Management

When to Blitz
When to Call Timeouts
When to Go For it on Fourth Down
When to Go For Two

Personnel

Better Drafting
Free Agent Valuation
Salary Cap Management
Player Comparison

Jared P. Lander

Chief Data Scientist of Lander Analytics
Author of R for Everyone
Adjunct Professor at Columbia University
Organizer of New York Open Statistical Programming (The R) Meetup
Website: http://www.jaredlander.com

Statistician

Data Scientist

Math & Statistics

Substantive Experience

Hacking Skills

Play-by-Play

Questions?

S

R

Statistics

Modeling

Machine Learning

Graphics

The Data

Interesting Part

Read the Data

Giants

Pass vs Rush

Focus on One Team's Offense

Pass vs Rush

Pass vs Rush by Down

Binary Regression

Pass vs Rush Modeled

Pass vs Rush Modeled

Inverse Logit

Probabilities

With Yards to Gain

Scenarios

Probability of Pass

Who Gets the Pass?

Who Gets the Pass?

Who Gets the Pass?

Decision Tree

Decision Tree

Decision Tree

Decision Tree

What Else?

Better Models

Game Management

Personnel

Jared P. Lander

Packages

The Tools