Based on data collected from polls conducted at the beginning of the New York Open Statistical Programming meetups.

# Category Archives: Statistics

# My Book is Out

After two years of writing and editing and proof reading and checking my book, R for Everyone is finally out!

There are so many people who helped me along the way, especially my editor Debra Williams, production editor Caroline Senay and the man who recruited me to write it in the first place, Paul Dix. Even more people helped throughout the long process, but with so many to mention I’ll leave that in the acknowledgements page.

Online resources for the book are available (http://www.jaredlander.com/r-for-everyone/) and will continue to be updated.

As of now the three major sites to purchase the book are Amazon, Barnes & Noble (available in stores January 3rd) and InformIT. And of course digital versions are available.

# Drawing Balls From an Urn

A friend recently posted the following the problem:

There are 10 green balls, 20 red balls, and 25 blues balls in a a jar. I choose a ball at random. If I choose a green then I take out all the green balls, if i choose a red ball then i take out all the red balls, and if I choose, a blue ball I take out all the blue balls, What is the probability that I will choose a red ball on my second try?

The math works out fairly easily. It’s the probability of first drawing a green ball **AND** then drawing a red ball, **OR** the probability of drawing a blue ball **AND** then drawing a red ball.

\[

\frac{10}{10+20+25} * \frac{20}{20+25} + \frac{25}{10+20+25} * \frac{20}{10+20} = 0.3838

\]

But I always prefer simulations over probability so let’s break out the R code like we did for the Monty Hall Problem and calculating lottery odds. The results are after the break.

# Pizza Poll Results

For a d3 bar plot visit http://www.jaredlander.com/plots/PizzaPollPlot.html.

I finally compiled the data from all the pizza polling I’ve been doing at the New York R meetups. The data are available as json at http://www.jaredlander.com/data/PizzaPollData.php.

This is easy enough to plot in R using ggplot2.

```
require(rjson)
require(plyr)
pizzaJson <- fromJSON(file = "http://jaredlander.com/data/PizzaPollData.php")
pizza <- ldply(pizzaJson, as.data.frame)
head(pizza)
```

```
## polla_qid Answer Votes pollq_id Question
## 1 2 Excellent 0 2 How was Pizza Mercato?
## 2 2 Good 6 2 How was Pizza Mercato?
## 3 2 Average 4 2 How was Pizza Mercato?
## 4 2 Poor 1 2 How was Pizza Mercato?
## 5 2 Never Again 2 2 How was Pizza Mercato?
## 6 3 Excellent 1 3 How was Maffei's Pizza?
## Place Time TotalVotes Percent
## 1 Pizza Mercato 1.344e+09 13 0.0000
## 2 Pizza Mercato 1.344e+09 13 0.4615
## 3 Pizza Mercato 1.344e+09 13 0.3077
## 4 Pizza Mercato 1.344e+09 13 0.0769
## 5 Pizza Mercato 1.344e+09 13 0.1538
## 6 Maffei's Pizza 1.348e+09 7 0.1429
```

```
require(ggplot2)
ggplot(pizza, aes(x = Place, y = Percent, group = Answer, color = Answer)) +
geom_line() + theme(axis.text.x = element_text(angle = 46, hjust = 1), legend.position = "bottom") +
labs(x = "Pizza Place", title = "Pizza Poll Results")
```

But given this is live data that will change as more polls are added I thought it best to use a plot that automatically updates and is interactive. So this gave me my first chance to *need* rCharts by Ramnath Vaidyanathan as seen at October’s meetup.

```
require(rCharts)
pizzaPlot <- nPlot(Percent ~ Place, data = pizza, type = "multiBarChart", group = "Answer")
pizzaPlot$xAxis(axisLabel = "Pizza Place", rotateLabels = -45)
pizzaPlot$yAxis(axisLabel = "Percent")
pizzaPlot$chart(reduceXTicks = FALSE)
pizzaPlot$print("chart1", include_assets = TRUE)
```

Unfortunately I cannot figure out how to insert this in WordPress so please see the chart at http://www.jaredlander.com/plots/PizzaPollPlot.html. Or see the badly sized one below.

There are still a lot of things I am learning, including how to use a categorical x-axis natively on linecharts and inserting chart titles. I found a workaround for the categorical x-axis by using `tickFormat`

but that is not pretty. I also would like to find a way to quickly switch between a line chart and a bar chart. Fitting more labels onto the x-axis or perhaps adding a scroll bar would be nice too.

# Books from the NYC Data Mafia

Attending this week’s Strata conference it was easy to see quite how prolific the NYC Data Mafia is when it comes to writing. Some of the found books:

Books from the #nycdatamafia @drewconway @johnmyleswhite http://t.co/EuV4FF6JA7 pic.twitter.com/Oi8tVcjPYE

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @mikedewar http://t.co/w2oCS2jLvN pic.twitter.com/yiq9x6SG3y

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @wesmckinn http://t.co/jhUPSrtTOE pic.twitter.com/ri5eUhWwY0

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia Rachel Schutt @mathbabedotorg http://t.co/EVI6HanjUb pic.twitter.com/yTL0fXQGBK

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @HarlanH @wahalulu http://t.co/6CjAvGsHRL pic.twitter.com/0DwMqSmNve

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @qethanm http://t.co/Hy82gz4tGe pic.twitter.com/Uba15XIhLT

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @pauldix http://t.co/Tdw0MSF5B7 pic.twitter.com/4rmpk5UuYf

— NYC Data Hackers (@nyhackr) October 29, 2013

And, of course, my book will be out soon to join them.

# Class at Gilt

The wonderful people at Gilt are having me teach an introductory course on R this Friday.

The class starts with the very basics such as variable types, vectors, data.frames and matrices. After that we explore munging data with aggregate, plyr and reshape2. Once the data is prepared we will use ggplot2 to visualize it and then fit models using lm, glm and decision trees.

Most of the material comes from my upcoming book R for Everyone.

Participants are encouraged to bring computers so they can code along with the live examples. They should also have R and RStudio preinstalled.

# The Monty Hall Problem

Michael Malecki recently shared a link to a Business Insider article that discussed the Monty Hall Problem.

The problem starts with three doors, one of which has a car and two of which have a goat. You choose one door at random and then the host reveals one door (not the one you chose) that holds a goat. You can then choose to stick with your door or choose the third, remaining door.

Probability theory states that people who switch win the car two-thirds of the time and those who don’t switch only win one-third of time.

But people often still do not believe they should switch based on the probability argument alone. So let’s run some simulations.

This function randomly assigns goats and cars behind three doors, chooses a door at random, reveals a goat door, then either switches doors or does not.

```
monty <- function(switch=TRUE)
{
# randomly assign goats and cars
doors <- sample(x=c("Car", "Goat", "Goat"), size=3, replace=FALSE)
# randomly choose a door
doorChoice <- sample(1:3, size=1)
# get goat doors
goatDoors <- which(doors == "Goat")
# show a door with a goat
goatDoor <- goatDoors[which(goatDoors != doorChoice)][1]
if(switch)
# if we are switching choose the other remaining door
{
return(doors[-c(doorChoice, goatDoor)])
}else
# otherwise keep the current door
{
return(doors[doorChoice])
}
}
```

Now we simulate switching 10,000 times and not switching 10,0000 times

```
withSwitching <- replicate(n = 10000, expr = monty(switch = TRUE), simplify = TRUE)
withoutSwitching <- replicate(n = 10000, expr = monty(switch = FALSE), simplify = TRUE)
head(withSwitching)
```

```
## [1] "Goat" "Car" "Car" "Goat" "Car" "Goat"
```

```
head(withoutSwitching)
```

```
## [1] "Goat" "Car" "Car" "Car" "Car" "Car"
```

```
mean(withSwitching == "Car")
```

```
## [1] 0.6678
```

```
mean(withoutSwitching == "Car")
```

```
## [1] 0.3408
```

Plotting the results really shows the difference.

```
require(ggplot2)
```

```
## Loading required package: ggplot2
```

```
require(scales)
```

```
## Loading required package: scales
```

```
qplot(withSwitching, geom = "bar", fill = withSwitching) + scale_fill_manual("Prize",
values = c(Car = muted("blue"), Goat = "orange")) + xlab("Switch") + ggtitle("Monty Hall with Switching")
```

```
qplot(withoutSwitching, geom = "bar", fill = withoutSwitching) + scale_fill_manual("Prize",
values = c(Car = muted("blue"), Goat = "orange")) + xlab("Don't Switch") +
ggtitle("Monty Hall without Switching")
```

(How are these colors? I’m trying out some new combinations.)

This clearly shows that switching is the best strategy.

The New York Times has a nice simulator that lets you play with actual doors.

# NYC Evacuation Map in R

Given the warnings for today’s winter storm, or lack of panic, I thought it would be a good time to plot the NYC evacuation maps using R. Of course these are already available online, provided by the city, but why not build them in R as well?

I obtained the shapefiles from NYC Open Data on February 28th, so it’s possible they are the new shapefiles redrawn after Hurricane Sandy, but I am not certain.

First we need the appropriate packages which are mostly included in maptools, rgeos and ggplot2.

`require(maptools) `

`## Loading required package: maptools `

`## Loading required package: foreign `

`## Loading required package: sp `

`## Loading required package: lattice `

`## Checking rgeos availability: TRUE `

`require(rgeos) `

`## Loading required package: rgeos `

`## Loading required package: stringr `

`## Loading required package: plyr `

`## rgeos: (SVN revision 348) GEOS runtime version: 3.3.5-CAPI-1.7.5 Polygon ## checking: TRUE `

`require(ggplot2) `

`## Loading required package: ggplot2 `

`require(plyr) require(grid) `

`## Loading required package: grid `

Then we read in the shape files, fortify them to turn them into a data.frame for easy plotting then join that back into the original data to get zone information.

`# read the shape file evac <- readShapeSpatial("../data/Evac_Zones_with_Additions_20121026/Evac_Zones_with_Additions_20121026.shp") # necessary for some of our work gpclibPermit() `

`## [1] TRUE `

`# create ID variable evac@data$id <- rownames(evac@data) # fortify the shape file evac.points <- fortify(evac, region = "id") # join in info from data evac.df <- join(evac.points, evac@data, by = "id") # modified data head(evac.df) `

`## long lat order hole piece group id Neighbrhd CAT1NNE Shape_Leng ## 1 1003293 239790 1 FALSE 1 0.1 0 <NA> A 9121 ## 2 1003313 239782 2 FALSE 1 0.1 0 <NA> A 9121 ## 3 1003312 239797 3 FALSE 1 0.1 0 <NA> A 9121 ## 4 1003301 240165 4 FALSE 1 0.1 0 <NA> A 9121 ## 5 1003337 240528 5 FALSE 1 0.1 0 <NA> A 9121 ## 6 1003340 240550 6 FALSE 1 0.1 0 <NA> A 9121 ## Shape_Area ## 1 2019091 ## 2 2019091 ## 3 2019091 ## 4 2019091 ## 5 2019091 ## 6 2019091 `

`# as opposed to the original data head(evac@data) `

`## Neighbrhd CAT1NNE Shape_Leng Shape_Area id ## 0 <NA> A 9121 2019091 0 ## 1 <NA> A 12250 54770 1 ## 2 <NA> A 10013 1041886 2 ## 3 <NA> B 11985 3462377 3 ## 4 <NA> B 5816 1515518 4 ## 5 <NA> B 5286 986675 5 `

Now, I’ve begun working on a package to make this step, and later ones easier, but it’s far from being close to ready for production. For those who want to see it (and contribute) it is available at https://github.com/jaredlander/mapping. The idea is to make mapping (including faceting!) doable with one or two lines of code.

Now it is time for the plot.

`ggplot(evac.df, aes(x = long, y = lat)) + geom_path(aes(group = group)) + geom_polygon(aes(group = group, fill = CAT1NNE)) + list(theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), panel.background = element_blank())) + coord_equal() + labs(x = NULL, y = NULL) + theme(plot.margin = unit(c(1, 1, 1, 1), "mm")) + scale_fill_discrete("Zone") `

There are clearly a number of things I would change about this plot including filling in the non-evacuation regions, connecting borders and smaller margins. Perhaps some of this can be accomplished by combining this information with another shapefile of the city, but that is beyond today’s code.

# Vertical Dodging in ggplot2

An often requested feature for Hadley Wickham's ggplot2 package is the ability to vertically dodge points, lines and bars. There has long been a function to shift geoms to the side when the x-axis is categorical: position_dodge. However, no such function exists for vertical shifts when the y-axis is categorical. Hadley usually responds by saying it should be easy to build, so here is a hacky patch.

All I did was copy the old functions (geom_dodge, collide, pos_dodge and PositionDodge) and make them vertical by swapping y's with x's, height with width and vice versa. It's hacky and not tested but seems to work as I'll show below.

First the new functions:

```
require(proto)
```

```
## Loading required package: proto
```

```
collidev <- function(data, height = NULL, name, strategy, check.height = TRUE) {
if (!is.null(height)) {
if (!(all(c("ymin", "ymax") %in% names(data)))) {
data <- within(data, {
ymin <- y - height/2
ymax <- y + height/2
})
}
} else {
if (!(all(c("ymin", "ymax") %in% names(data)))) {
data$ymin <- data$y
data$ymax <- data$y
}
heights <- unique(with(data, ymax - ymin))
heights <- heights[!is.na(heights)]
if (!zero_range(range(heights))) {
warning(name, " requires constant height: output may be incorrect",
call. = FALSE)
}
height <- heights[1]
}
data <- data[order(data$ymin), ]
intervals <- as.numeric(t(unique(data[c("ymin", "ymax")])))
intervals <- intervals[!is.na(intervals)]
if (length(unique(intervals)) > 1 & any(diff(scale(intervals)) < -1e-06)) {
warning(name, " requires non-overlapping y intervals", call. = FALSE)
}
if (!is.null(data$xmax)) {
ddply(data, .(ymin), strategy, height = height)
} else if (!is.null(data$x)) {
message("xmax not defined: adjusting position using x instead")
transform(ddply(transform(data, xmax = x), .(ymin), strategy, height = height),
x = xmax)
} else {
stop("Neither x nor xmax defined")
}
}
pos_dodgev <- function(df, height) {
n <- length(unique(df$group))
if (n == 1)
return(df)
if (!all(c("ymin", "ymax") %in% names(df))) {
df$ymin <- df$y
df$ymax <- df$y
}
d_width <- max(df$ymax - df$ymin)
diff <- height - d_width
groupidx <- match(df$group, sort(unique(df$group)))
df$y <- df$y + height * ((groupidx - 0.5)/n - 0.5)
df$ymin <- df$y - d_width/n/2
df$ymax <- df$y + d_width/n/2
df
}
position_dodgev <- function(width = NULL, height = NULL) {
PositionDodgev$new(width = width, height = height)
}
PositionDodgev <- proto(ggplot2:::Position, {
objname <- "dodgev"
adjust <- function(., data) {
if (empty(data))
return(data.frame())
check_required_aesthetics("y", names(data), "position_dodgev")
collidev(data, .$height, .$my_name(), pos_dodgev, check.height = TRUE)
}
})
```

Now that they are built we can whip up some example data to show them off. Since this was inspired by a refactoring of my coefplot package I will use a deconstructed sample.

```
# get tips data
data(tips, package = "reshape2")
# fit some models
mod1 <- lm(tip ~ day + sex, data = tips)
mod2 <- lm(tip ~ day * sex, data = tips)
# build data/frame with coefficients and confidence intervals and combine
# them into one data.frame
require(coefplot)
```

```
## Loading required package: coefplot
```

```
## Loading required package: ggplot2
```

```
df1 <- coefplot(mod1, plot = FALSE, name = "Base", shorten = FALSE)
df2 <- coefplot(model = mod2, plot = FALSE, name = "Interaction", shorten = FALSE)
theDF <- rbind(df1, df2)
theDF
```

```
## LowOuter HighOuter LowInner HighInner Coef Name Checkers
## 1 1.9803 3.3065 2.31183 2.9750 2.64340 (Intercept) Numeric
## 2 -0.4685 0.9325 -0.11822 0.5822 0.23202 daySat day
## 3 -0.2335 1.1921 0.12291 0.8357 0.47929 daySun day
## 4 -0.6790 0.7672 -0.31745 0.4056 0.04408 dayThur day
## 5 -0.2053 0.5524 -0.01589 0.3630 0.17354 sexMale sex
## 6 1.8592 3.7030 2.32016 3.2421 2.78111 (Intercept) Numeric
## 7 -1.0391 1.0804 -0.50921 0.5506 0.02067 daySat day
## 8 -0.5430 1.7152 0.02156 1.1507 0.58611 daySun day
## 9 -1.2490 0.8380 -0.72725 0.3163 -0.20549 dayThur day
## 10 -1.3589 1.1827 -0.72349 0.5473 -0.08811 sexMale sex
## 11 -1.0502 1.7907 -0.34000 1.0804 0.37022 daySat:sexMale day:sex
## 12 -1.5324 1.4149 -0.79560 0.6781 -0.05877 daySun:sexMale day:sex
## 13 -0.9594 1.9450 -0.23328 1.2189 0.49282 dayThur:sexMale day:sex
## CoefShort Model
## 1 (Intercept) Base
## 2 daySat Base
## 3 daySun Base
## 4 dayThur Base
## 5 sexMale Base
## 6 (Intercept) Interaction
## 7 daySat Interaction
## 8 daySun Interaction
## 9 dayThur Interaction
## 10 sexMale Interaction
## 11 daySat:sexMale Interaction
## 12 daySun:sexMale Interaction
## 13 dayThur:sexMale Interaction
```

```
# build the plot
require(ggplot2)
require(plyr)
```

```
## Loading required package: plyr
```

```
ggplot(theDF, aes(y = Name, x = Coef, color = Model)) + geom_vline(xintercept = 0,
linetype = 2, color = "grey") + geom_errorbarh(aes(xmin = LowOuter, xmax = HighOuter),
height = 0, lwd = 0, position = position_dodgev(height = 1)) + geom_errorbarh(aes(xmin = LowInner,
xmax = HighInner), height = 0, lwd = 1, position = position_dodgev(height = 1)) +
geom_point(position = position_dodgev(height = 1), aes(xmax = Coef))
```

Compare that to the multiplot function in coefplot that was built using geom_dodge and coord_flip.

```
multiplot(mod1, mod2, shorten = F, names = c("Base", "Interaction"))
```

With the exception of the ordering and plot labels, these charts are the same. The main benefit here is that avoiding coord_flip still allows the plot to be faceted, which was not possible with coord_flip.

Hopefully Hadley will be able to take these functions and incorporate them into ggplot2.

# Play Selection by Down

Continuing with the newly available football data and inspired by a question from Drew Conway I decided to look at play selection based on down by the Giants for the past 10 years.

Visually, we see that until 2011 the Giants preferred to run on first and second down. Third down is usually a do-or-die down so passes will dominate on third-and-long. The grey vertical lines mark Super Bowls XLII and XLVI.

Code for the graph after the break.