Been a busy few weeks with the New York R Conference, speaking engagements, writing the second edition of R for Everyone and coding open source packages.  The most exciting news involves the news as the Wall Street Journal wrote an article about my NFL Draft work.

It is a great piece with some nice quotes from the Vikings General Manager Rick Spielman and ESPN’s legendary John Clayton that succinctly sums up the work I did and runs the numbers on a few select players.

So now I’ve been in the news for pizza, the lottery and football.  Fun mix.

Last year, as I embarked on my NFL sports statistics work, I attended the Sloan Sports Analytics Conference for the first time. A year later, after a very successful draft, I was invited to present an R workshop to the conference.

My time slot was up against Nate Silver so I didn’t expect many people to attend.    Much to my surprise when I entered the room every seat was taken, people were lining the walls and sitting in the aisles.

My presentation, which was unrelated to the work I did, analyzed the Giants’ probability of passing versus rushing and the probability of which receiver was targeted.  It is available at the talks section of my site.

After the talk I spent the rest of the day fielding questions and gave away copies of R for Everyone and an NYC Data Mafia shirt.

Last night we celebrated Rounded Pi Day by rounding at the 10,000’s digit to get 3.1416 which nicely works with the date 3/14/16.  This was great after Mega Pi Day worked out so perfectly last year.  And this all built upon previous years’ celebrations.

We ate a large quantity of pizza at Lombardi’s. and for the second year in a row we got the Pi Cake from Empire Cakes with peanut butter and chocolate flavors.  The base was inscribed with historic approximations of Pi:  25/8, 256/81, 339/108, 223/71, 377/120, 3927/1250, 355/113, 62832/20000, 22/7.

Some pictures from the fantastic night:

Previous year’s Pi Cakes:

Earlier this week, my company, Lander Analytics, organized our first public Bayesian short course, taught by Andrew Gelman, Bob Carpenter and Daniel Lee.  Needless to say the class sold out very quickly and left a long wait list.  So we will schedule another public training (exactly when tbd) and will make the same course available for private training.

This was the first time we utilized three instructors (as opposed to a main instructor and assistants which we often use for large classes) and it led to an amazing dynamic.  Bob laid the theoretical foundation for Markov chain Monte Carlo (MCMC), explaining both with math and geometry, and discussed the computational considerations of performing simulation draws.  Daniel led the participants through hands-on examples with Stan, covering everything from how to describe a model, to efficient computation to debugging.  Andrew gave his usual, crowd dazzling performance use previous work as case studies of when and how to use Bayesian methods.

It was an intensive three days of training with an incredible amount of information.  Everyone walked away knowing a lot more about Bayes, MCMC and Stan and eager to try out their new skills, and an autographed copy of Andrew’s book, BDA3.

A big help, as always was Daniel Chen who put in so much effort making the class run smoothly from securing the space, physically moving furniture and running all the technology.

On April 24th and 25th Lander Analytics and Work-Bench coorganized the (sold-out) inaugural New York R Conference. It was an amazing weekend of nerding out over R and data, with a little Python and Julia mixed in for good measure. People from all across the R community gathered to see rockstars discuss their latest and greatest efforts.

Highlights include:

Bryan Lewis wowing the crowd (there were literally gasps) with rthreejs implemented with htmlwidgets.

Hilary Parker receiving spontaneous applause in the middle of her talk about reproducible research at Etsy for her explainr, catsplainr and mansplainr packages.

James Powell speaking flawless Mandarin in a talk tangentially about Python.

Vivian Peng also receiving spontaneous applause for her discussion of storytelling with data.

Wes McKinney showing love for data.frames in all languages and sporting an awesome R t-shirt.

Dan Chen using Shiny to study Ebola data.

Andrew Gelman blowing away everyone with his keynote about Bayesian methods with particular applications in politics.

Videos of the talks are available at http://www.rstats.nyc/#speakers with slides being added frequently.

A big thank you to sponsors RStudio, Revolution Analytics, DataKind, Pearson, Brewla Bars and Twillory.

This year we celebrated Mega Pi Day with the date (3/14/15) covering the first four digits of Pi. And of course, we unveiled the Pi Cake at 9:26 to get the next three digits.  This year the cake came from Empire Cakes and was peanut butter flavored.  We even had the bakery put as many digits as would fit around the cake.

A large group from the NYC Data Mafia came out and Scott Wiener of Scott’s Pizza Tours ensured we had the perfect assortment and quantity of pizza.

A look at Pi Cakes from previous years:

So far this year I have logged many miles in the air and on the rails. In between trips to Minneapolis and Boston I spent about a month traveling through India and Southeast Asia, mainly to conduct R courses in Singapore and Kuala Lumpur for the likes of Intel, Micron, Celcom, Maxis, DBS and other similar companies. The training courses were organized through Revolution Analytics’ Singapore office. Given the success of the classes, there will be more opportunities this spring or summer in Singapore, Kuala Lumpur and also in Australia.

Quite a lot of material was covered based on the offerings of my company, Lander Analytics and the content of my R for Everyone.

## Day 1 – Basics

• Getting and installing R
• The RStudio Environment
• The basics of R
• Variables
• Data Types
• Calling functions
• Missing Data
• Basic Math
• data.frames
• lists
• matrices
• arrays
• RODBC
• Binary data
• Matrix Calculations
• Data Munging
• Writing functions
• Conditionals
• Loops
• String manipulation and regular expressions
• Visualization

## Day 2 – Modeling

• Basic Statistics
• Probability Distributions
• Averages, standard deviations and correlations
• t-test
• Linear Models
• Generalized Linear Models
• Survival Analysis
• Assessing Model Quality
• MSE
• AIC
• BIC
• Residual Analysis
• Time Series
• Variable Selection

## Day 4 – Data Presentation and Portability

• Reproducible reports using knitr
• Basic Introduction to Markdown
• Using knitr to automatically generate reports with embedded analytics
• Using Markdown and knitr to automatically generate websites with embedded analytics
• Using Markdown and knitr to make HTML5 slideshows with embedded analytics
• Building R Packages
• Shiny Overview

## Day 5 – High Performance Computing with R

• Benchmarking code using microbenchmark
• The different speeds of various aggregation functions
• Fast manipulation using dplyr
• Running dplyr commands in a database
• Parallel Code
• Integrating C++

Given my extensive time abroad I thought it would be good to look at it all on a map using the Leaflet package in R.

Using the Google Maps API we can look up the latitude and longitude of the visited cities.

library(XML)
library(plyr)

cities <- c('Hong Kong', 'Haripal, India', 'Kolkata, India', 'Jaipur, India', 'Agra, India', 'Delhi, India',
'Singapore', 'Kuala Lumpur, Malaysia', 'Geroge Town, Malaysia')
lat.long <- function(place)
{
doc <- xmlToList(theURL)
data.frame(Place=place, Latitude=as.numeric(doc$result$geometry$location$lat), Longitude=as.numeric(doc$result$geometry$location$lng), stringsAsFactors=FALSE)
}

places <- adply(cities, 1, lat.long)
knitr::kable(places[, -1], digits=3, row.names=FALSE)
Place Latitude Longitude
Hong Kong 22.396 114.109
Haripal, India 22.817 88.105
Kolkata, India 22.573 88.364
Jaipur, India 26.912 75.787
Agra, India 27.177 78.008
Delhi, India 28.614 77.209
Singapore 1.352 103.820
Kuala Lumpur, Malaysia 3.139 101.687
Geroge Town, Malaysia 5.415 100.330

Now that we have the coordinates we use Leaflet to plot them.

library(leaflet)
leaflet(data=places) %>% addTiles() %>% setView(90, 15, zoom=4) %>% addPopups(lng=~Longitude, lat=~Latitude, popup=~Place) %>% addPolylines(~Longitude, ~Latitude, data=places[c(1, 3, 2:9, 1), ]) %>% addMarkers(lng=~Longitude, lat=~Latitude, popup=~Place, icon=JS("L.icon({iconUrl: 'http://www.jaredlander.com/images/jaredlanderfavicon.png', iconSize: [20, 20]})"))

Calculating all the miles traveled could be as simple as looking it up on TripIt, or we could do some quick Haversine distance calculations with the geosphere package.

First, we get the coordinates for New York, Minneapolis and Boston to have a complete picture of the distance.

newCities <- adply(c('New York, NY', 'Minneapolis, MN', 'Boston, MA'), 1, lat.long)
allPlaces <- rbind(newCities[c(1, 2, 1), ], places[c(1, 3, 2:9, 1), ], newCities[c(1, 3, 1), ])

Then in order to use distHaversine we need to set up a to and from relationship between the places. The easiest way will be to just shift the columns.

library(useful)
## Loading required package: ggplot2
shiftedPlaces <- shift.column(data=allPlaces, columns=names(places)[-1], newNames=c('To', 'Lat2', 'Long2'))

Now we can calculate the distance. This assumes that all trips followed a great circle, which might not be the case, especially for the car and rail portions of the trip.

library(geosphere)
## Loading required package: sp
shiftedPlaces$Distance <- distHaversine(shiftedPlaces[, c("Longitude", "Latitude")], shiftedPlaces[, c("Long2", "Lat2")], r=3959) In total this led to 25,727 miles traveled. knitr::kable(shiftedPlaces[, -1], digits=c(1, 3, 3, 1, 3, 3, 0), row.names=FALSE) Place Latitude Longitude To Lat2 Long2 Distance New York, NY 40.713 -74.006 Minneapolis, MN 44.978 -93.265 1016 Minneapolis, MN 44.978 -93.265 New York, NY 40.713 -74.006 1016 New York, NY 40.713 -74.006 Hong Kong 22.396 114.109 8046 Hong Kong 22.396 114.109 Kolkata, India 22.573 88.364 1642 Kolkata, India 22.573 88.364 Haripal, India 22.817 88.105 24 Haripal, India 22.817 88.105 Kolkata, India 22.573 88.364 24 Kolkata, India 22.573 88.364 Jaipur, India 26.912 75.787 844 Jaipur, India 26.912 75.787 Agra, India 27.177 78.008 138 Agra, India 27.177 78.008 Delhi, India 28.614 77.209 111 Delhi, India 28.614 77.209 Singapore 1.352 103.820 2574 Singapore 1.352 103.820 Kuala Lumpur, Malaysia 3.139 101.687 192 Kuala Lumpur, Malaysia 3.139 101.687 Geroge Town, Malaysia 5.415 100.330 183 Geroge Town, Malaysia 5.415 100.330 Hong Kong 22.396 114.109 1491 Hong Kong 22.396 114.109 New York, NY 40.713 -74.006 8046 New York, NY 40.713 -74.006 Boston, MA 42.360 -71.059 190 Boston, MA 42.360 -71.059 New York, NY 40.713 -74.006 190 leaflet(data=allPlaces) %>% addTiles() %>% setView(80, 20, zoom = 3) %>% addPolylines(~Longitude, ~Latitude) %>% addMarkers(lng=~Longitude, lat=~Latitude, popup=~Place, icon=JS("L.icon({ iconUrl: 'http://www.jaredlander.com/images/jaredlanderfavicon.png', iconSize: [20, 20]})")) The other night I attended a talk about the history of Brooklyn pizza at the Brooklyn Historical Society by Scott Wiener of Scott’s Pizza Tours. Toward the end, a woman stated she had a theory that pizza slice prices stay in rough lockstep with New York City subway fares. Of course, this is a well known relationship that even has its own Wikipedia entry, so Scott referred her to a New York Times article from 1995 that mentioned the phenomenon. However, he wondered if the preponderance of dollar slice shops has dropped the price of a slice below that of the subway and playfully joked that he wished there was a statistician in the audience. Naturally, that night I set off to calculate the current price of a slice in New York City using listings from MenuPages. I used R’s XML package to pull the menus for over 1,800 places tagged as “Pizza” in Manhattan, Brooklyn and Queens (there was no data for Staten Island or The Bronx) and find the price of a cheese slice. After cleaning up the data and doing my best to find prices for just cheese/plain/regular slices I found that the mean price was$2.33 with a standard deviation of $0.52 and a median price of$2.45. The base subway fare is $2.50 but is actually$2.38 after the 5% bonus for putting at least $5 on a MetroCard. So, even with the proliferation of dollar slice joints, the average slice of pizza ($2.33) lines up pretty nicely with the cost of a subway ride (\$2.38).

Taking it a step further, I broke down the price of a slice in Manhattan, Queens and Brooklyn. The vertical lines represented the price of a subway ride with and without the bonus.  We see that the price of a slice in Manhattan is perfectly right there with the subway fare.

MenuPages even broke down Queens Neighborhoods so we can have a more specific plot.