The class starts with the very basics such as variable types, vectors, data.frames and matrices. After that we explore munging data with aggregate, plyr and reshape2. Once the data is prepared we will use ggplot2 to visualize it and then fit models using lm, glm and decision trees.
The problem starts with three doors, one of which has a car and two of which have a goat. You choose one door at random and then the host reveals one door (not the one you chose) that holds a goat. You can then choose to stick with your door or choose the third, remaining door.
Probability theory states that people who switch win the car two-thirds of the time and those who don’t switch only win one-third of time.
But people often still do not believe they should switch based on the probability argument alone. So let’s run some simulations.
This function randomly assigns goats and cars behind three doors, chooses a door at random, reveals a goat door, then either switches doors or does not.
monty <- function(switch=TRUE)
# randomly assign goats and cars
doors <- sample(x=c("Car", "Goat", "Goat"), size=3, replace=FALSE)
# randomly choose a door
doorChoice <- sample(1:3, size=1)
# get goat doors
goatDoors <- which(doors == "Goat")
# show a door with a goat
goatDoor <- goatDoors[which(goatDoors != doorChoice)]
# if we are switching choose the other remaining door
# otherwise keep the current door
Now we simulate switching 10,000 times and not switching 10,0000 times
Continuing the annualtradition of Pi Cakes from Chrissie Cook we have gotten another Pi Cake! This year we let Drew Conway’s wife pick the flavors and she went with vanilla and red velvet (the blue color is to cause some cognitive dissonance). Looking forward to enjoying this tonight after some pizza.
Given the warnings for today’s winter storm, or lack of panic, I thought it would be a good time to plot the NYC evacuation maps using R. Of course these are already available online, provided by the city, but why not build them in R as well?
Then we read in the shape files, fortify them to turn them into a data.frame for easy plotting then join that back into the original data to get zone information.
# read the shape file evac <- readShapeSpatial("../data/Evac_Zones_with_Additions_20121026/Evac_Zones_with_Additions_20121026.shp") # necessary for some of our work gpclibPermit()
##  TRUE
# create ID variable evac@data$id <- rownames(evac@data) # fortify the shape file evac.points <- fortify(evac, region = "id") # join in info from data evac.df <- join(evac.points, evac@data, by = "id") # modified data head(evac.df)
## long lat order hole piece group id Neighbrhd CAT1NNE Shape_Leng ## 1 1003293 239790 1 FALSE 1 0.1 0 <NA> A 9121 ## 2 1003313 239782 2 FALSE 1 0.1 0 <NA> A 9121 ## 3 1003312 239797 3 FALSE 1 0.1 0 <NA> A 9121 ## 4 1003301 240165 4 FALSE 1 0.1 0 <NA> A 9121 ## 5 1003337 240528 5 FALSE 1 0.1 0 <NA> A 9121 ## 6 1003340 240550 6 FALSE 1 0.1 0 <NA> A 9121 ## Shape_Area ## 1 2019091 ## 2 2019091 ## 3 2019091 ## 4 2019091 ## 5 2019091 ## 6 2019091
# as opposed to the original data head(evac@data)
## Neighbrhd CAT1NNE Shape_Leng Shape_Area id ## 0 <NA> A 9121 2019091 0 ## 1 <NA> A 12250 54770 1 ## 2 <NA> A 10013 1041886 2 ## 3 <NA> B 11985 3462377 3 ## 4 <NA> B 5816 1515518 4 ## 5 <NA> B 5286 986675 5
Now, I’ve begun working on a package to make this step, and later ones easier, but it’s far from being close to ready for production. For those who want to see it (and contribute) it is available at https://github.com/jaredlander/mapping. The idea is to make mapping (including faceting!) doable with one or two lines of code.
There are clearly a number of things I would change about this plot including filling in the non-evacuation regions, connecting borders and smaller margins. Perhaps some of this can be accomplished by combining this information with another shapefile of the city, but that is beyond today’s code.
An often requested feature for Hadley Wickham'sggplot2 package is the ability to vertically dodge points, lines and bars. There has long been a function to shift geoms to the side when the x-axis is categorical: position_dodge. However, no such function exists for vertical shifts when the y-axis is categorical. Hadley usually responds by saying it should be easy to build, so here is a hacky patch.
All I did was copy the old functions (geom_dodge, collide, pos_dodge and PositionDodge) and make them vertical by swapping y's with x's, height with width and vice versa. It's hacky and not tested but seems to work as I'll show below.
Now that they are built we can whip up some example data to show them off. Since this was inspired by a refactoring of my coefplot package I will use a deconstructed sample.
# get tips data
data(tips, package = "reshape2")
# fit some models
mod1 <- lm(tip ~ day + sex, data = tips)
mod2 <- lm(tip ~ day * sex, data = tips)
# build data/frame with coefficients and confidence intervals and combine
# them into one data.frame
With the exception of the ordering and plot labels, these charts are the same. The main benefit here is that avoiding coord_flip still allows the plot to be faceted, which was not possible with coord_flip.
Hopefully Hadley will be able to take these functions and incorporate them into ggplot2.
Visually, we see that until 2011 the Giants preferred to run on first and second down. Third down is usually a do-or-die down so passes will dominate on third-and-long. The grey vertical lines mark Super Bowls XLII and XLVI.
This Monday I’ll be talking at the Amsterdam R meetup, better known as amst-R-dam. At their request I’ll discuss the differences between the New York and Silicon Valley data scenes. Time permitting I’ll also go over some topic that I’ll let the audience choose.