Michael Malecki recently shared a link to a Business Insider article that discussed the Monty Hall Problem.

The problem starts with three doors, one of which has a car and two of which have a goat. You choose one door at random and then the host reveals one door (not the one you chose) that holds a goat. You can then choose to stick with your door or choose the third, remaining door.

Probability theory states that people who switch win the car two-thirds of the time and those who don’t switch only win one-third of time.

But people often still do not believe they should switch based on the probability argument alone. So let’s run some simulations.

This function randomly assigns goats and cars behind three doors, chooses a door at random, reveals a goat door, then either switches doors or does not.

``````monty <- function(switch=TRUE)
{
# randomly assign goats and cars
doors <- sample(x=c("Car", "Goat", "Goat"), size=3, replace=FALSE)

# randomly choose a door
doorChoice <- sample(1:3, size=1)

# get goat doors
goatDoors <- which(doors == "Goat")
# show a door with a goat
goatDoor <- goatDoors[which(goatDoors != doorChoice)][1]

if(switch)
# if we are switching choose the other remaining door
{
return(doors[-c(doorChoice, goatDoor)])
}else
# otherwise keep the current door
{
return(doors[doorChoice])
}
}
``````

Now we simulate switching 10,000 times and not switching 10,0000 times

``````withSwitching <- replicate(n = 10000, expr = monty(switch = TRUE), simplify = TRUE)
withoutSwitching <- replicate(n = 10000, expr = monty(switch = FALSE), simplify = TRUE)

head(withSwitching)
``````
``````## [1] "Goat" "Car"  "Car"  "Goat" "Car"  "Goat"
``````
``````head(withoutSwitching)
``````
``````## [1] "Goat" "Car"  "Car"  "Car"  "Car"  "Car"
``````
``````
mean(withSwitching == "Car")
``````
``````## [1] 0.6678
``````
``````mean(withoutSwitching == "Car")
``````
``````## [1] 0.3408
``````

Plotting the results really shows the difference.

``````require(ggplot2)
``````
``````## Loading required package: ggplot2
``````
``````require(scales)
``````
``````## Loading required package: scales
``````
``````qplot(withSwitching, geom = "bar", fill = withSwitching) + scale_fill_manual("Prize",
values = c(Car = muted("blue"), Goat = "orange")) + xlab("Switch") + ggtitle("Monty Hall with Switching")
``````

``````qplot(withoutSwitching, geom = "bar", fill = withoutSwitching) + scale_fill_manual("Prize",
values = c(Car = muted("blue"), Goat = "orange")) + xlab("Don't Switch") +
ggtitle("Monty Hall without Switching")
``````

(How are these colors? I’m trying out some new combinations.)

This clearly shows that switching is the best strategy.

The New York Times has a nice simulator that lets you play with actual doors.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Given the warnings for today’s winter storm, or lack of panic, I thought it would be a good time to plot the NYC evacuation maps using R. Of course these are already available online, provided by the city, but why not build them in R as well?

I obtained the shapefiles from NYC Open Data on February 28th, so it’s possible they are the new shapefiles redrawn after Hurricane Sandy, but I am not certain.

First we need the appropriate packages which are mostly included in maptools, rgeos and ggplot2.

``require(maptools) ``
``## Loading required package: maptools ``
``## Loading required package: foreign ``
``## Loading required package: sp ``
``## Loading required package: lattice ``
``## Checking rgeos availability: TRUE ``
``require(rgeos) ``
``## Loading required package: rgeos ``
``## Loading required package: stringr ``
``## Loading required package: plyr ``
``## rgeos: (SVN revision 348) GEOS runtime version: 3.3.5-CAPI-1.7.5 Polygon ## checking: TRUE ``
``require(ggplot2) ``
``## Loading required package: ggplot2 ``
``require(plyr) require(grid) ``
``## Loading required package: grid ``

Then we read in the shape files, fortify them to turn them into a data.frame for easy plotting then join that back into the original data to get zone information.

``# read the shape file evac <- readShapeSpatial("../data/Evac_Zones_with_Additions_20121026/Evac_Zones_with_Additions_20121026.shp") # necessary for some of our work gpclibPermit() ``
``## [1] TRUE ``
``# create ID variable evac@data\$id <- rownames(evac@data) # fortify the shape file evac.points <- fortify(evac, region = "id") # join in info from data evac.df <- join(evac.points, evac@data, by = "id") # modified data head(evac.df) ``
``## long lat order hole piece group id Neighbrhd CAT1NNE Shape_Leng ## 1 1003293 239790 1 FALSE 1 0.1 0 <NA> A 9121 ## 2 1003313 239782 2 FALSE 1 0.1 0 <NA> A 9121 ## 3 1003312 239797 3 FALSE 1 0.1 0 <NA> A 9121 ## 4 1003301 240165 4 FALSE 1 0.1 0 <NA> A 9121 ## 5 1003337 240528 5 FALSE 1 0.1 0 <NA> A 9121 ## 6 1003340 240550 6 FALSE 1 0.1 0 <NA> A 9121 ## Shape_Area ## 1 2019091 ## 2 2019091 ## 3 2019091 ## 4 2019091 ## 5 2019091 ## 6 2019091 ``
``# as opposed to the original data head(evac@data) ``
``## Neighbrhd CAT1NNE Shape_Leng Shape_Area id ## 0 <NA> A 9121 2019091 0 ## 1 <NA> A 12250 54770 1 ## 2 <NA> A 10013 1041886 2 ## 3 <NA> B 11985 3462377 3 ## 4 <NA> B 5816 1515518 4 ## 5 <NA> B 5286 986675 5 ``

Now, I’ve begun working on a package to make this step, and later ones easier, but it’s far from being close to ready for production. For those who want to see it (and contribute) it is available at https://github.com/jaredlander/mapping. The idea is to make mapping (including faceting!) doable with one or two lines of code.

Now it is time for the plot.

``ggplot(evac.df, aes(x = long, y = lat)) + geom_path(aes(group = group)) + geom_polygon(aes(group = group, fill = CAT1NNE)) + list(theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), panel.background = element_blank())) + coord_equal() + labs(x = NULL, y = NULL) + theme(plot.margin = unit(c(1, 1, 1, 1), "mm")) + scale_fill_discrete("Zone") ``

There are clearly a number of things I would change about this plot including filling in the non-evacuation regions, connecting borders and smaller margins. Perhaps some of this can be accomplished by combining this information with another shapefile of the city, but that is beyond today’s code.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

An often requested feature for Hadley Wickham's ggplot2 package is the ability to vertically dodge points, lines and bars. There has long been a function to shift geoms to the side when the x-axis is categorical: position_dodge. However, no such function exists for vertical shifts when the y-axis is categorical. Hadley usually responds by saying it should be easy to build, so here is a hacky patch.

All I did was copy the old functions (geom_dodge, collide, pos_dodge and PositionDodge) and make them vertical by swapping y's with x's, height with width and vice versa. It's hacky and not tested but seems to work as I'll show below.

First the new functions:

``````require(proto)
``````
``````## Loading required package: proto
``````
``````collidev <- function(data, height = NULL, name, strategy, check.height = TRUE) {
if (!is.null(height)) {
if (!(all(c("ymin", "ymax") %in% names(data)))) {
data <- within(data, {
ymin <- y - height/2
ymax <- y + height/2
})
}
} else {
if (!(all(c("ymin", "ymax") %in% names(data)))) {
data\$ymin <- data\$y
data\$ymax <- data\$y
}
heights <- unique(with(data, ymax - ymin))
heights <- heights[!is.na(heights)]
if (!zero_range(range(heights))) {
warning(name, " requires constant height: output may be incorrect",
call. = FALSE)
}
height <- heights[1]
}
data <- data[order(data\$ymin), ]
intervals <- as.numeric(t(unique(data[c("ymin", "ymax")])))
intervals <- intervals[!is.na(intervals)]
if (length(unique(intervals)) > 1 & any(diff(scale(intervals)) < -1e-06)) {
warning(name, " requires non-overlapping y intervals", call. = FALSE)
}
if (!is.null(data\$xmax)) {
ddply(data, .(ymin), strategy, height = height)
} else if (!is.null(data\$x)) {
message("xmax not defined: adjusting position using x instead")
transform(ddply(transform(data, xmax = x), .(ymin), strategy, height = height),
x = xmax)
} else {
stop("Neither x nor xmax defined")
}
}

pos_dodgev <- function(df, height) {
n <- length(unique(df\$group))
if (n == 1)
return(df)
if (!all(c("ymin", "ymax") %in% names(df))) {
df\$ymin <- df\$y
df\$ymax <- df\$y
}
d_width <- max(df\$ymax - df\$ymin)
diff <- height - d_width
groupidx <- match(df\$group, sort(unique(df\$group)))
df\$y <- df\$y + height * ((groupidx - 0.5)/n - 0.5)
df\$ymin <- df\$y - d_width/n/2
df\$ymax <- df\$y + d_width/n/2
df
}

position_dodgev <- function(width = NULL, height = NULL) {
PositionDodgev\$new(width = width, height = height)
}

PositionDodgev <- proto(ggplot2:::Position, {
objname <- "dodgev"

adjust <- function(., data) {
if (empty(data))
return(data.frame())
check_required_aesthetics("y", names(data), "position_dodgev")

collidev(data, .\$height, .\$my_name(), pos_dodgev, check.height = TRUE)
}

})
``````

Now that they are built we can whip up some example data to show them off. Since this was inspired by a refactoring of my coefplot package I will use a deconstructed sample.

``````# get tips data
data(tips, package = "reshape2")

# fit some models
mod1 <- lm(tip ~ day + sex, data = tips)
mod2 <- lm(tip ~ day * sex, data = tips)

# build data/frame with coefficients and confidence intervals and combine
# them into one data.frame
require(coefplot)
``````
``````## Loading required package: coefplot
``````
``````## Loading required package: ggplot2
``````
``````df1 <- coefplot(mod1, plot = FALSE, name = "Base", shorten = FALSE)
df2 <- coefplot(model = mod2, plot = FALSE, name = "Interaction", shorten = FALSE)
theDF <- rbind(df1, df2)
theDF
``````
``````##    LowOuter HighOuter LowInner HighInner     Coef            Name Checkers
## 1    1.9803    3.3065  2.31183    2.9750  2.64340     (Intercept)  Numeric
## 2   -0.4685    0.9325 -0.11822    0.5822  0.23202          daySat      day
## 3   -0.2335    1.1921  0.12291    0.8357  0.47929          daySun      day
## 4   -0.6790    0.7672 -0.31745    0.4056  0.04408         dayThur      day
## 5   -0.2053    0.5524 -0.01589    0.3630  0.17354         sexMale      sex
## 6    1.8592    3.7030  2.32016    3.2421  2.78111     (Intercept)  Numeric
## 7   -1.0391    1.0804 -0.50921    0.5506  0.02067          daySat      day
## 8   -0.5430    1.7152  0.02156    1.1507  0.58611          daySun      day
## 9   -1.2490    0.8380 -0.72725    0.3163 -0.20549         dayThur      day
## 10  -1.3589    1.1827 -0.72349    0.5473 -0.08811         sexMale      sex
## 11  -1.0502    1.7907 -0.34000    1.0804  0.37022  daySat:sexMale  day:sex
## 12  -1.5324    1.4149 -0.79560    0.6781 -0.05877  daySun:sexMale  day:sex
## 13  -0.9594    1.9450 -0.23328    1.2189  0.49282 dayThur:sexMale  day:sex
##          CoefShort       Model
## 1      (Intercept)        Base
## 2           daySat        Base
## 3           daySun        Base
## 4          dayThur        Base
## 5          sexMale        Base
## 6      (Intercept) Interaction
## 7           daySat Interaction
## 8           daySun Interaction
## 9          dayThur Interaction
## 10         sexMale Interaction
## 11  daySat:sexMale Interaction
## 12  daySun:sexMale Interaction
## 13 dayThur:sexMale Interaction
``````
``````# build the plot
require(ggplot2)
require(plyr)
``````
``````## Loading required package: plyr
``````
``````ggplot(theDF, aes(y = Name, x = Coef, color = Model)) + geom_vline(xintercept = 0,
linetype = 2, color = "grey") + geom_errorbarh(aes(xmin = LowOuter, xmax = HighOuter),
height = 0, lwd = 0, position = position_dodgev(height = 1)) + geom_errorbarh(aes(xmin = LowInner,
xmax = HighInner), height = 0, lwd = 1, position = position_dodgev(height = 1)) +
geom_point(position = position_dodgev(height = 1), aes(xmax = Coef))
``````

Compare that to the multiplot function in coefplot that was built using geom_dodge and coord_flip.

``````multiplot(mod1, mod2, shorten = F, names = c("Base", "Interaction"))
``````

With the exception of the ordering and plot labels, these charts are the same. The main benefit here is that avoiding coord_flip still allows the plot to be faceted, which was not possible with coord_flip.

Hopefully Hadley will be able to take these functions and incorporate them into ggplot2.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Continuing with the newly available football data (new link) and inspired by a question from Drew Conway I decided to look at play selection based on down by the Giants for the past 10 years.

Visually, we see that until 2011 the Giants preferred to run on first and second down.  Third down is usually a do-or-die down so passes will dominate on third-and-long.  The grey vertical lines mark Super Bowls XLII and XLVI.

Code for the graph after the break.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

With the recent availability (new link) of play-by-play NFL data I got to analyzing my favorite team, the New York Giants with some very hasty EDA.

From the above graph you can see that on 1st down Eli preferred to throw to Hakim Nicks and on 2nd and 3rd downs he slightly favored Victor Cruz.

The code for the analysis is after the break.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

This Monday I’ll be talking at the Amsterdam R meetup, better known as amst-R-dam.  At their request I’ll discuss the differences between the New York and Silicon Valley data scenes.  Time permitting I’ll also go over some topic that I’ll let the audience choose.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

How was Patzeria?

• Poor (100%, 1 Votes)
• Excellent (0%, 0 Votes)
• Good (0%, 0 Votes)
• Average (0%, 0 Votes)
• Never Again (0%, 0 Votes)

Total Voters: 1

Loading …

Aggregated results.

Results from individual previous polls are below. Continue reading

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

A friend of mine has told me on numerous occasions that since 1960 the Yankees have not won a World Series while a Republican was President.  Upon hearing this my Republican friends (both Yankee and Red Sox fans) turn incredulous and say that this is ridiculous.  So I decided to investigate.  To be clear this is in no way shows causality, but just checks the numbers.

The data was easily attainable so it really came down to plotting.

The plot above shows every Yankee win (and loss) since 1960 and the party of the President at the time.  It is clear to see that all nine Yankees World Series wins came while a Democrat inhabited the White House.  The fluctuation plot below shows Yankee wins both before and after 1960 and the complete lack of a block for Republican/Post-1960 simply makes the case.

There are similar plots for the American League after the jump.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Wes McKinney and I are hosting our first ever Open Statistical Programming meetup tomorrow night after taking over for Drew Conway.  Please attend, have some pizza, enjoy the talk then come out for some beer.

This meetup is about EDA, Visualization and Collaboration on the Web and will be presented by Carlos Scheidegger from AT&T Labs.

This month’s pizza will be from Pizza Mercato in the Village.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

With tonight’s Mega Millions jackpot estimated to be over \$640 million there are long lines of people waiting to buy tickets.  Of course you always hear about the probability of winning which is easy enough to calculate:  Five numbers ranging from 1 through 56 are drawn (without replacement) then a sixth ball is pulled from a set of 1 through 46.  That means there are choose(56, 5) * 46 = 175,711,536 possible different combinations.  That is why people are constantly reminded of how unlikely they are to win.

But I want to see how likely it is that SOMEONE will win tonight.  So let’s break out R and ggplot!

As of this afternoon it was reported (sorry no source) that two tickets were sold for every American.  So let’s assume that each of these tickets is an independent Bernoulli trial with probability of success of 1/175,711,536.

Running 1,000 simulations we see the distribution of the number of winners in the histogram above.

So we shouldn’t be surprised if there are multiple winners tonight.

The R code:

```winners <- rbinom(n=1000, size=600000000, prob=1/175000000)
qplot(winners, geom="histogram", binwidth=1, xlab="Number of Winners")```

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.