IMG_1577

After two years of writing and editing and proof reading and checking my book, R for Everyone is finally out!

There are so many people who helped me along the way, especially my editor Debra Williams, production editor Caroline Senay and the man who recruited me to write it in the first place, Paul Dix.  Even more people helped throughout the long process, but with so many to mention I’ll leave that in the acknowledgements page.

Online resources for the book are available (http://www.jaredlander.com/r-for-everyone/) and will continue to be updated.

As of now the three major sites to purchase the book are Amazon, Barnes & Noble (available in stores January 3rd) and InformIT.  And of course digital versions are available.


A friend recently posted the following the problem:

There are 10 green balls, 20 red balls, and 25 blues balls in a a jar. I choose a ball at random. If I choose a green then I take out all the green balls, if i choose a red ball then i take out all the red balls, and if I choose, a blue ball I take out all the blue balls, What is the probability that I will choose a red ball on my second try?

The math works out fairly easily. It’s the probability of first drawing a green ball AND then drawing a red ball, OR the probability of drawing a blue ball AND then drawing a red ball.

\[
\frac{10}{10+20+25} * \frac{20}{20+25} + \frac{25}{10+20+25} * \frac{20}{10+20} = 0.3838
\]

But I always prefer simulations over probability so let’s break out the R code like we did for the Monty Hall Problem and calculating lottery odds.  The results are after the break.

Continue reading

plot of chunk plot-ggplot

For a d3 bar plot visit http://www.jaredlander.com/plots/PizzaPollPlot.html.



I finally compiled the data from all the pizza polling I’ve been doing at the New York R meetups. The data are available as json at http://www.jaredlander.com/data/PizzaPollData.php.

This is easy enough to plot in R using ggplot2.

require(rjson)
require(plyr)
pizzaJson <- fromJSON(file = "http://jaredlander.com/data/PizzaPollData.php")
pizza <- ldply(pizzaJson, as.data.frame)
head(pizza)
##   polla_qid      Answer Votes pollq_id                Question
## 1         2   Excellent     0        2  How was Pizza Mercato?
## 2         2        Good     6        2  How was Pizza Mercato?
## 3         2     Average     4        2  How was Pizza Mercato?
## 4         2        Poor     1        2  How was Pizza Mercato?
## 5         2 Never Again     2        2  How was Pizza Mercato?
## 6         3   Excellent     1        3 How was Maffei's Pizza?
##            Place      Time TotalVotes Percent
## 1  Pizza Mercato 1.344e+09         13  0.0000
## 2  Pizza Mercato 1.344e+09         13  0.4615
## 3  Pizza Mercato 1.344e+09         13  0.3077
## 4  Pizza Mercato 1.344e+09         13  0.0769
## 5  Pizza Mercato 1.344e+09         13  0.1538
## 6 Maffei's Pizza 1.348e+09          7  0.1429
require(ggplot2)
ggplot(pizza, aes(x = Place, y = Percent, group = Answer, color = Answer)) + 
    geom_line() + theme(axis.text.x = element_text(angle = 46, hjust = 1), legend.position = "bottom") + 
    labs(x = "Pizza Place", title = "Pizza Poll Results")

plot of chunk plot-ggplot

But given this is live data that will change as more polls are added I thought it best to use a plot that automatically updates and is interactive. So this gave me my first chance to need rCharts by Ramnath Vaidyanathan as seen at October’s meetup.

require(rCharts)
pizzaPlot <- nPlot(Percent ~ Place, data = pizza, type = "multiBarChart", group = "Answer")
pizzaPlot$xAxis(axisLabel = "Pizza Place", rotateLabels = -45)
pizzaPlot$yAxis(axisLabel = "Percent")
pizzaPlot$chart(reduceXTicks = FALSE)
pizzaPlot$print("chart1", include_assets = TRUE)

Unfortunately I cannot figure out how to insert this in WordPress so please see the chart at http://www.jaredlander.com/plots/PizzaPollPlot.html. Or see the badly sized one below.

There are still a lot of things I am learning, including how to use a categorical x-axis natively on linecharts and inserting chart titles. I found a workaround for the categorical x-axis by using tickFormat but that is not pretty. I also would like to find a way to quickly switch between a line chart and a bar chart. Fitting more labels onto the x-axis or perhaps adding a scroll bar would be nice too.

IMG_1209

Attending this week’s Strata conference it was easy to see quite how prolific the NYC Data Mafia is when it comes to writing.  Some of the found books:

And, of course, my book will be out soon to join them.

The wonderful people at Gilt are having me teach an introductory course on R this Friday.

The class starts with the very basics such as variable types, vectors, data.frames and matrices.  After that we explore munging data with aggregate, plyr and reshape2.  Once the data is prepared we will use ggplot2 to visualize it and then fit models using lm, glm and decision trees.

Most of the material comes from my upcoming book R for Everyone.

Participants are encouraged to bring computers so they can code along with the live examples.  They should also have R and RStudio preinstalled.

Michael Malecki recently shared a link to a Business Insider article that discussed the Monty Hall Problem.

The problem starts with three doors, one of which has a car and two of which have a goat. You choose one door at random and then the host reveals one door (not the one you chose) that holds a goat. You can then choose to stick with your door or choose the third, remaining door.

Probability theory states that people who switch win the car two-thirds of the time and those who don’t switch only win one-third of time.

But people often still do not believe they should switch based on the probability argument alone. So let’s run some simulations.

This function randomly assigns goats and cars behind three doors, chooses a door at random, reveals a goat door, then either switches doors or does not.

monty <- function(switch=TRUE)
{
    # randomly assign goats and cars
    doors <- sample(x=c("Car", "Goat", "Goat"), size=3, replace=FALSE)

    # randomly choose a door
    doorChoice <- sample(1:3, size=1)

    # get goat doors
    goatDoors <- which(doors == "Goat")
    # show a door with a goat
    goatDoor <- goatDoors[which(goatDoors != doorChoice)][1]

    if(switch)
        # if we are switching choose the other remaining door
    {
        return(doors[-c(doorChoice, goatDoor)])
    }else
        # otherwise keep the current door
    {
        return(doors[doorChoice])
    }
}

Now we simulate switching 10,000 times and not switching 10,0000 times

withSwitching <- replicate(n = 10000, expr = monty(switch = TRUE), simplify = TRUE)
withoutSwitching <- replicate(n = 10000, expr = monty(switch = FALSE), simplify = TRUE)

head(withSwitching)
## [1] "Goat" "Car"  "Car"  "Goat" "Car"  "Goat"
head(withoutSwitching)
## [1] "Goat" "Car"  "Car"  "Car"  "Car"  "Car"

mean(withSwitching == "Car")
## [1] 0.6678
mean(withoutSwitching == "Car")
## [1] 0.3408

Plotting the results really shows the difference.

require(ggplot2)
## Loading required package: ggplot2
require(scales)
## Loading required package: scales
qplot(withSwitching, geom = "bar", fill = withSwitching) + scale_fill_manual("Prize", 
    values = c(Car = muted("blue"), Goat = "orange")) + xlab("Switch") + ggtitle("Monty Hall with Switching")

qplot(withoutSwitching, geom = "bar", fill = withoutSwitching) + scale_fill_manual("Prize", 
    values = c(Car = muted("blue"), Goat = "orange")) + xlab("Don't Switch") + 
    ggtitle("Monty Hall without Switching")

(How are these colors? I’m trying out some new combinations.)

This clearly shows that switching is the best strategy.

The New York Times has a nice simulator that lets you play with actual doors.