Highlights include:

Videos of the talks are available at http://www.rstats.nyc/#speakers with slides being added frequently.

A big thank you to sponsors RStudio, Revolution Analytics, DataKind, Pearson, Brewla Bars and Twillory.

Highlights include:

Videos of the talks are available at http://www.rstats.nyc/#speakers with slides being added frequently.

A big thank you to sponsors RStudio, Revolution Analytics, DataKind, Pearson, Brewla Bars and Twillory.

This year we celebrated Mega Pi Day with the date (3/14/15) covering the first four digits of Pi. And of course, we unveiled the Pi Cake at 9:26 to get the next three digits. This year the cake came from Empire Cakes and was peanut butter flavored. We even had the bakery put as many digits as would fit around the cake.

A large group from the NYC Data Mafia came out and Scott Wiener of Scott’s Pizza Tours ensured we had the perfect assortment and quantity of pizza.

A look at Pi Cakes from previous years:

So far this year I have logged many miles in the air and on the rails. In between trips to Minneapolis and Boston I spent about a month traveling through India and Southeast Asia, mainly to conduct R courses in Singapore and Kuala Lumpur for the likes of Intel, Micron, Celcom, Maxis, DBS and other similar companies. The training courses were organized through Revolution Analytics’ Singapore office. Given the success of the classes, there will be more opportunities this spring or summer in Singapore, Kuala Lumpur and also in Australia.

Quite a lot of material was covered based on the offerings of my company, Lander Analytics and the content of my *R for Everyone*.

- Getting and installing R
- The RStudio Environment
- The basics of R
- Variables
- Data Types
- Reading data
- Calling functions
- Missing Data

- Basic Math
- Advanced Data Structures
- data.frames
- lists
- matrices
- arrays

- Reading Data into R
- read.table
- RODBC
- Binary data

- Matrix Calculations
- Data Munging
- Writing functions
- Conditionals
- Loops
- String manipulation and regular expressions
- Visualization
- Base R
- ggplot2

- Basic Statistics
- Probability Distributions
- Averages, standard deviations and correlations
- t-test

- Linear Models
- Generalized Linear Models
- Logistic Regression
- Poisson Regression

- Survival Analysis
- Assessing Model Quality
- MSE
- AIC
- BIC
- Residual Analysis

- Time Series
- Variable Selection

- Variable selection for high dimensional data with glmnet
- Reduce uncertainty with weakly informative priors and Bayesian regression
- K-Means clustering
- Hierarchical clustering
- Multidimensional scaling
- Decision Trees for classification
- Random Forests for ensembling decision trees
- Bootstrap for measuring uncertainty
- Cross validation for model assessment
- Support Vector Machines
- Neural Networks

- Reproducible reports using knitr
- Basic Introduction to Markdown
- Using knitr to automatically generate reports with embedded analytics
- Using Markdown and knitr to automatically generate websites with embedded analytics
- Using Markdown and knitr to make HTML5 slideshows with embedded analytics
- Advanced plotting
- Building R Packages
- Shiny Overview

- Benchmarking code using microbenchmark
- The different speeds of various aggregation functions
- aggregate
- tapply
- plyr
- data.table

- Fast manipulation using dplyr
- Running dplyr commands in a database
- Parallel Code
- Integrating C++

Given my extensive time abroad I thought it would be good to look at it all on a map using the Leaflet package in R.

Using the Google Maps API we can look up the latitude and longitude of the visited cities.

```
library(XML)
library(plyr)
cities <- c('Hong Kong', 'Haripal, India', 'Kolkata, India', 'Jaipur, India', 'Agra, India', 'Delhi, India',
'Singapore', 'Kuala Lumpur, Malaysia', 'Geroge Town, Malaysia')
lat.long <- function(place)
{
theURL <- sprintf('http://maps.google.com/maps/api/geocode/xml?sensor=false&address=%s', place)
doc <- xmlToList(theURL)
data.frame(Place=place, Latitude=as.numeric(doc$result$geometry$location$lat), Longitude=as.numeric(doc$result$geometry$location$lng), stringsAsFactors=FALSE)
}
places <- adply(cities, 1, lat.long)
```

`knitr::kable(places[, -1], digits=3, row.names=FALSE)`

Place | Latitude | Longitude |
---|---|---|

Hong Kong | 22.396 | 114.109 |

Haripal, India | 22.817 | 88.105 |

Kolkata, India | 22.573 | 88.364 |

Jaipur, India | 26.912 | 75.787 |

Agra, India | 27.177 | 78.008 |

Delhi, India | 28.614 | 77.209 |

Singapore | 1.352 | 103.820 |

Kuala Lumpur, Malaysia | 3.139 | 101.687 |

Geroge Town, Malaysia | 5.415 | 100.330 |

Now that we have the coordinates we use Leaflet to plot them.

```
library(leaflet)
leaflet(data=places) %>% addTiles() %>% setView(90, 15, zoom=4) %>% addPopups(lng=~Longitude, lat=~Latitude, popup=~Place) %>% addPolylines(~Longitude, ~Latitude, data=places[c(1, 3, 2:9, 1), ]) %>% addMarkers(lng=~Longitude, lat=~Latitude, popup=~Place, icon=JS("L.icon({iconUrl: 'http://www.jaredlander.com/images/jaredlanderfavicon.png', iconSize: [20, 20]})"))
```

Calculating all the miles traveled could be as simple as looking it up on TripIt, or we could do some quick Haversine distance calculations with the geosphere package.

First, we get the coordinates for New York, Minneapolis and Boston to have a complete picture of the distance.

```
newCities <- adply(c('New York, NY', 'Minneapolis, MN', 'Boston, MA'), 1, lat.long)
allPlaces <- rbind(newCities[c(1, 2, 1), ], places[c(1, 3, 2:9, 1), ], newCities[c(1, 3, 1), ])
```

Then in order to use `distHaversine`

we need to set up a to and from relationship between the places. The easiest way will be to just shift the columns.

`library(useful)`

`## Loading required package: ggplot2`

`shiftedPlaces <- shift.column(data=allPlaces, columns=names(places)[-1], newNames=c('To', 'Lat2', 'Long2'))`

Now we can calculate the distance. This assumes that all trips followed a great circle, which might not be the case, especially for the car and rail portions of the trip.

`library(geosphere)`

`## Loading required package: sp`

`shiftedPlaces$Distance <- distHaversine(shiftedPlaces[, c("Longitude", "Latitude")], shiftedPlaces[, c("Long2", "Lat2")], r=3959)`

In total this led to 25,727 miles traveled.

`knitr::kable(shiftedPlaces[, -1], digits=c(1, 3, 3, 1, 3, 3, 0), row.names=FALSE)`

Place | Latitude | Longitude | To | Lat2 | Long2 | Distance |
---|---|---|---|---|---|---|

New York, NY | 40.713 | -74.006 | Minneapolis, MN | 44.978 | -93.265 | 1016 |

Minneapolis, MN | 44.978 | -93.265 | New York, NY | 40.713 | -74.006 | 1016 |

New York, NY | 40.713 | -74.006 | Hong Kong | 22.396 | 114.109 | 8046 |

Hong Kong | 22.396 | 114.109 | Kolkata, India | 22.573 | 88.364 | 1642 |

Kolkata, India | 22.573 | 88.364 | Haripal, India | 22.817 | 88.105 | 24 |

Haripal, India | 22.817 | 88.105 | Kolkata, India | 22.573 | 88.364 | 24 |

Kolkata, India | 22.573 | 88.364 | Jaipur, India | 26.912 | 75.787 | 844 |

Jaipur, India | 26.912 | 75.787 | Agra, India | 27.177 | 78.008 | 138 |

Agra, India | 27.177 | 78.008 | Delhi, India | 28.614 | 77.209 | 111 |

Delhi, India | 28.614 | 77.209 | Singapore | 1.352 | 103.820 | 2574 |

Singapore | 1.352 | 103.820 | Kuala Lumpur, Malaysia | 3.139 | 101.687 | 192 |

Kuala Lumpur, Malaysia | 3.139 | 101.687 | Geroge Town, Malaysia | 5.415 | 100.330 | 183 |

Geroge Town, Malaysia | 5.415 | 100.330 | Hong Kong | 22.396 | 114.109 | 1491 |

Hong Kong | 22.396 | 114.109 | New York, NY | 40.713 | -74.006 | 8046 |

New York, NY | 40.713 | -74.006 | Boston, MA | 42.360 | -71.059 | 190 |

Boston, MA | 42.360 | -71.059 | New York, NY | 40.713 | -74.006 | 190 |

```
leaflet(data=allPlaces) %>% addTiles() %>% setView(80, 20, zoom = 3) %>% addPolylines(~Longitude, ~Latitude) %>% addMarkers(lng=~Longitude, lat=~Latitude, popup=~Place, icon=JS("L.icon({
iconUrl: 'http://www.jaredlander.com/images/jaredlanderfavicon.png', iconSize: [20, 20]})"))
```

// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });

The other night I attended a talk about the history of Brooklyn pizza at the Brooklyn Historical Society by Scott Wiener of Scott’s Pizza Tours. Toward the end, a woman stated she had a theory that pizza slice prices stay in rough lockstep with New York City subway fares. Of course, this is a well known relationship that even has its own Wikipedia entry, so Scott referred her to a New York Times article from 1995 that mentioned the phenomenon.

However, he wondered if the preponderance of dollar slice shops has dropped the price of a slice below that of the subway and playfully joked that he wished there was a statistician in the audience.

Naturally, that night I set off to calculate the current price of a slice in New York City using listings from MenuPages. I used R’s XML package to pull the menus for over 1,800 places tagged as “Pizza” in Manhattan, Brooklyn and Queens (there was no data for Staten Island or The Bronx) and find the price of a cheese slice.

After cleaning up the data and doing my best to find prices for just cheese/plain/regular slices I found that the mean price was $2.33 with a standard deviation of $0.52 and a median price of $2.45. The base subway fare is $2.50 but is actually $2.38 after the 5% bonus for putting at least $5 on a MetroCard.

So, even with the proliferation of dollar slice joints, the average slice of pizza ($2.33) lines up pretty nicely with the cost of a subway ride ($2.38).

Taking it a step further, I broke down the price of a slice in Manhattan, Queens and Brooklyn. The vertical lines represented the price of a subway ride with and without the bonus. We see that the price of a slice in Manhattan is perfectly right there with the subway fare.

MenuPages even broke down Queens Neighborhoods so we can have a more specific plot.

The code for downloading the menus and the calculations is after the break.

Based on data collected from polls conducted at the beginning of the New York Open Statistical Programming meetups.

After two years of writing and editing and proof reading and checking my book, R for Everyone is finally out!

There are so many people who helped me along the way, especially my editor Debra Williams, production editor Caroline Senay and the man who recruited me to write it in the first place, Paul Dix. Even more people helped throughout the long process, but with so many to mention I’ll leave that in the acknowledgements page.

Online resources for the book are available (http://www.jaredlander.com/r-for-everyone/) and will continue to be updated.

As of now the three major sites to purchase the book are Amazon, Barnes & Noble (available in stores January 3rd) and InformIT. And of course digital versions are available.

A friend recently posted the following the problem:

There are 10 green balls, 20 red balls, and 25 blues balls in a a jar. I choose a ball at random. If I choose a green then I take out all the green balls, if i choose a red ball then i take out all the red balls, and if I choose, a blue ball I take out all the blue balls, What is the probability that I will choose a red ball on my second try?

The math works out fairly easily. It’s the probability of first drawing a green ball **AND** then drawing a red ball, **OR** the probability of drawing a blue ball **AND** then drawing a red ball.

\[

\frac{10}{10+20+25} * \frac{20}{20+25} + \frac{25}{10+20+25} * \frac{20}{10+20} = 0.3838

\]

But I always prefer simulations over probability so let’s break out the R code like we did for the Monty Hall Problem and calculating lottery odds. The results are after the break.

For a d3 bar plot visit http://www.jaredlander.com/plots/PizzaPollPlot.html.

I finally compiled the data from all the pizza polling I’ve been doing at the New York R meetups. The data are available as json at http://www.jaredlander.com/data/PizzaPollData.php.

This is easy enough to plot in R using ggplot2.

```
require(rjson)
require(plyr)
pizzaJson <- fromJSON(file = "http://jaredlander.com/data/PizzaPollData.php")
pizza <- ldply(pizzaJson, as.data.frame)
head(pizza)
```

```
## polla_qid Answer Votes pollq_id Question
## 1 2 Excellent 0 2 How was Pizza Mercato?
## 2 2 Good 6 2 How was Pizza Mercato?
## 3 2 Average 4 2 How was Pizza Mercato?
## 4 2 Poor 1 2 How was Pizza Mercato?
## 5 2 Never Again 2 2 How was Pizza Mercato?
## 6 3 Excellent 1 3 How was Maffei's Pizza?
## Place Time TotalVotes Percent
## 1 Pizza Mercato 1.344e+09 13 0.0000
## 2 Pizza Mercato 1.344e+09 13 0.4615
## 3 Pizza Mercato 1.344e+09 13 0.3077
## 4 Pizza Mercato 1.344e+09 13 0.0769
## 5 Pizza Mercato 1.344e+09 13 0.1538
## 6 Maffei's Pizza 1.348e+09 7 0.1429
```

```
require(ggplot2)
ggplot(pizza, aes(x = Place, y = Percent, group = Answer, color = Answer)) +
geom_line() + theme(axis.text.x = element_text(angle = 46, hjust = 1), legend.position = "bottom") +
labs(x = "Pizza Place", title = "Pizza Poll Results")
```

But given this is live data that will change as more polls are added I thought it best to use a plot that automatically updates and is interactive. So this gave me my first chance to *need* rCharts by Ramnath Vaidyanathan as seen at October’s meetup.

```
require(rCharts)
pizzaPlot <- nPlot(Percent ~ Place, data = pizza, type = "multiBarChart", group = "Answer")
pizzaPlot$xAxis(axisLabel = "Pizza Place", rotateLabels = -45)
pizzaPlot$yAxis(axisLabel = "Percent")
pizzaPlot$chart(reduceXTicks = FALSE)
pizzaPlot$print("chart1", include_assets = TRUE)
```

Unfortunately I cannot figure out how to insert this in WordPress so please see the chart at http://www.jaredlander.com/plots/PizzaPollPlot.html. Or see the badly sized one below.

There are still a lot of things I am learning, including how to use a categorical x-axis natively on linecharts and inserting chart titles. I found a workaround for the categorical x-axis by using `tickFormat`

but that is not pretty. I also would like to find a way to quickly switch between a line chart and a bar chart. Fitting more labels onto the x-axis or perhaps adding a scroll bar would be nice too.

Attending this week’s Strata conference it was easy to see quite how prolific the NYC Data Mafia is when it comes to writing. Some of the found books:

Books from the #nycdatamafia @drewconway @johnmyleswhite http://t.co/EuV4FF6JA7 pic.twitter.com/Oi8tVcjPYE

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @mikedewar http://t.co/w2oCS2jLvN pic.twitter.com/yiq9x6SG3y

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @wesmckinn http://t.co/jhUPSrtTOE pic.twitter.com/ri5eUhWwY0

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia Rachel Schutt @mathbabedotorg http://t.co/EVI6HanjUb pic.twitter.com/yTL0fXQGBK

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @HarlanH @wahalulu http://t.co/6CjAvGsHRL pic.twitter.com/0DwMqSmNve

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @qethanm http://t.co/Hy82gz4tGe pic.twitter.com/Uba15XIhLT

— NYC Data Hackers (@nyhackr) October 29, 2013

Books from the #nycdatamafia @pauldix http://t.co/Tdw0MSF5B7 pic.twitter.com/4rmpk5UuYf

— NYC Data Hackers (@nyhackr) October 29, 2013

And, of course, my book will be out soon to join them.