Jared Lander

Last year, as I embarked on my NFL sports statistics work, I attended the Sloan Sports Analytics Conference for the first time. A year later, after a very successful draft, I was invited to present an R workshop to the conference.

My time slot was up against Nate Silver so I didn’t expect many people to attend. Much to my surprise when I entered the room every seat was taken, people were lining the walls and sitting in the aisles.

My presentation, which was unrelated to the work I did, analyzed the Giants’ probability of passing versus rushing and the probability of which receiver was targeted. It is available at the talks section of my site.

After the talk I spent the rest of the day fielding questions and gave away copies of R for Everyone and an NYC Data Mafia shirt.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science and AI firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Government Data Science and AI Conferences and author of R for Everyone.

Last night we celebrated Rounded Pi Day by rounding at the 10,000’s digit to get 3.1416 which nicely works with the date 3/14/16. This was great after Mega Pi Day worked out so perfectly last year. And this all built upon previous years’ celebrations.

We ate a large quantity of pizza at Lombardi’s. and for the second year in a row we got the Pi Cake from Empire Cakes with peanut butter and chocolate flavors. The base was inscribed with historic approximations of Pi: 25/8, 256/81, 339/108, 223/71, 377/120, 3927/1250, 355/113, 62832/20000, 22/7.

Some pictures from the fantastic night:

Previous year’s Pi Cakes:

[Show slideshow]

Earlier this week, my company, Lander Analytics, organized our first public Bayesian short course, taught by Andrew Gelman, Bob Carpenter and Daniel Lee. Needless to say the class sold out very quickly and left a long wait list. So we will schedule another public training (exactly when tbd) and will make the same course available for private training.

This was the first time we utilized three instructors (as opposed to a main instructor and assistants which we often use for large classes) and it led to an amazing dynamic. Bob laid the theoretical foundation for Markov chain Monte Carlo (MCMC), explaining both with math and geometry, and discussed the computational considerations of performing simulation draws. Daniel led the participants through hands-on examples with Stan, covering everything from how to describe a model, to efficient computation to debugging. Andrew gave his usual, crowd dazzling performance use previous work as case studies of when and how to use Bayesian methods.

It was an intensive three days of training with an incredible amount of information. Everyone walked away knowing a lot more about Bayes, MCMC and Stan and eager to try out their new skills, and an autographed copy of Andrew’s book, BDA3.

A big help, as always was Daniel Chen who put in so much effort making the class run smoothly from securing the space, physically moving furniture and running all the technology.

[Show slideshow]

$Gelman Math$

On April 24th and 25th Lander Analytics and Work-Bench coorganized the (sold-out) inaugural New York R Conference. It was an amazing weekend of nerding out over R and data, with a little Python and Julia mixed in for good measure. People from all across the R community gathered to see rockstars discuss their latest and greatest efforts.

Highlights include:

Bryan Lewis wowing the crowd (there were literally gasps) with rthreejs implemented with htmlwidgets.

Hilary Parker receiving spontaneous applause in the middle of her talk about reproducible research at Etsy for her explainr, catsplainr and mansplainr packages.

James Powell speaking flawless Mandarin in a talk tangentially about Python.

Vivian Peng also receiving spontaneous applause for her discussion of storytelling with data.

Wes McKinney showing love for data.frames in all languages and sporting an awesome R t-shirt.

Dan Chen using Shiny to study Ebola data.

Andrew Gelman blowing away everyone with his keynote about Bayesian methods with particular applications in politics.

Videos of the talks are available at http://www.rstats.nyc/#speakers with slides being added frequently.

A big thank you to sponsors RStudio, Revolution Analytics, DataKind, Pearson, Brewla Bars and Twillory.

Next year’s conference is already being planned for April. To inquire about sponsoring or speaking please get in touch.

This year we celebrated Mega Pi Day with the date (3/14/15) covering the first four digits of Pi. And of course, we unveiled the Pi Cake at 9:26 to get the next three digits. This year the cake came from Empire Cakes and was peanut butter flavored. We even had the bakery put as many digits as would fit around the cake.

A large group from the NYC Data Mafia came out and Scott Wiener of Scott’s Pizza Tours ensured we had the perfect assortment and quantity of pizza.

Scott making sure they get just the right pizza

[Show slideshow]

A look at Pi Cakes from previous years:

[Show slideshow]

So far this year I have logged many miles in the air and on the rails. In between trips to Minneapolis and Boston I spent about a month traveling through India and Southeast Asia, mainly to conduct R courses in Singapore and Kuala Lumpur for the likes of Intel, Micron, Celcom, Maxis, DBS and other similar companies. The training courses were organized through Revolution Analytics’ Singapore office. Given the success of the classes, there will be more opportunities this spring or summer in Singapore, Kuala Lumpur and also in Australia.

Quite a lot of material was covered based on the offerings of my company, Lander Analytics and the content of my R for Everyone.

Day 1 – Basics

Getting and installing R
The RStudio Environment
The basics of R
- Variables
- Data Types
- Reading data
- Calling functions
- Missing Data
Basic Math
Advanced Data Structures
- data.frames
- lists
- matrices
- arrays
Reading Data into R
- read.table
- RODBC
- Binary data
Matrix Calculations
Data Munging
- Base R
- plyr
- reshape2
Writing functions
Conditionals
Loops
String manipulation and regular expressions
Visualization
- Base R
- ggplot2

Day 2 – Modeling

Basic Statistics
- Probability Distributions
- Averages, standard deviations and correlations
- t-test
Linear Models
- Simple linear regression
- Multiple Regression
Generalized Linear Models
- Logistic Regression
- Poisson Regression
Survival Analysis
Assessing Model Quality
- MSE
- AIC
- BIC
- Residual Analysis
Time Series
Variable Selection

Day 3 – Machine Learning

Variable selection for high dimensional data with glmnet
Reduce uncertainty with weakly informative priors and Bayesian regression
K-Means clustering
Hierarchical clustering
Multidimensional scaling
Decision Trees for classification
Random Forests for ensembling decision trees
Bootstrap for measuring uncertainty
Cross validation for model assessment
Support Vector Machines
Neural Networks

Day 4 – Data Presentation and Portability

Reproducible reports using knitr
Basic Introduction to Markdown
Using knitr to automatically generate reports with embedded analytics
Using Markdown and knitr to automatically generate websites with embedded analytics
Using Markdown and knitr to make HTML5 slideshows with embedded analytics
Advanced plotting
Building R Packages
Shiny Overview

Day 5 – High Performance Computing with R

Benchmarking code using microbenchmark
The different speeds of various aggregation functions
- aggregate
- tapply
- plyr
- data.table
Fast manipulation using dplyr
Running dplyr commands in a database
Parallel Code
- foreach
- doParallel
- plyr
Integrating C++
- Rcpp

Given my extensive time abroad I thought it would be good to look at it all on a map using the Leaflet package in R.

Using the Google Maps API we can look up the latitude and longitude of the visited cities.

library(XML)
library(plyr)

cities <- c('Hong Kong', 'Haripal, India', 'Kolkata, India', 'Jaipur, India', 'Agra, India', 'Delhi, India', 
            'Singapore', 'Kuala Lumpur, Malaysia', 'Geroge Town, Malaysia')
lat.long <- function(place)
{
    theURL <- sprintf('http://maps.google.com/maps/api/geocode/xml?sensor=false&address=%s', place)
    doc <- xmlToList(theURL)
    data.frame(Place=place, Latitude=as.numeric(doc$result$geometry$location$lat), Longitude=as.numeric(doc$result$geometry$location$lng), stringsAsFactors=FALSE)
}

places <- adply(cities, 1, lat.long)

knitr::kable(places[, -1], digits=3, row.names=FALSE)

Place	Latitude	Longitude
Hong Kong	22.396	114.109
Haripal, India	22.817	88.105
Kolkata, India	22.573	88.364
Jaipur, India	26.912	75.787
Agra, India	27.177	78.008
Delhi, India	28.614	77.209
Singapore	1.352	103.820
Kuala Lumpur, Malaysia	3.139	101.687
Geroge Town, Malaysia	5.415	100.330

Now that we have the coordinates we use Leaflet to plot them.

library(leaflet)
leaflet(data=places) %>% addTiles() %>% setView(90, 15, zoom=4) %>% addPopups(lng=~Longitude, lat=~Latitude, popup=~Place) %>% addPolylines(~Longitude, ~Latitude, data=places[c(1, 3, 2:9, 1), ]) %>% addMarkers(lng=~Longitude, lat=~Latitude, popup=~Place, icon=JS("L.icon({iconUrl: 'https://www.jaredlander.com/images/jaredlanderfavicon.png', iconSize: [20, 20]})"))

Calculating all the miles traveled could be as simple as looking it up on TripIt, or we could do some quick Haversine distance calculations with the geosphere package.

First, we get the coordinates for New York, Minneapolis and Boston to have a complete picture of the distance.

newCities <- adply(c('New York, NY', 'Minneapolis, MN', 'Boston, MA'), 1, lat.long)
allPlaces <- rbind(newCities[c(1, 2, 1), ], places[c(1, 3, 2:9, 1), ], newCities[c(1, 3, 1), ])

Then in order to use distHaversine we need to set up a to and from relationship between the places. The easiest way will be to just shift the columns.

library(useful)

## Loading required package: ggplot2

shiftedPlaces <- shift.column(data=allPlaces, columns=names(places)[-1], newNames=c('To', 'Lat2', 'Long2'))

Now we can calculate the distance. This assumes that all trips followed a great circle, which might not be the case, especially for the car and rail portions of the trip.

library(geosphere)

## Loading required package: sp

shiftedPlaces$Distance <- distHaversine(shiftedPlaces[, c("Longitude", "Latitude")], shiftedPlaces[, c("Long2", "Lat2")], r=3959)

In total this led to 25,727 miles traveled.

knitr::kable(shiftedPlaces[, -1], digits=c(1, 3, 3, 1, 3, 3, 0), row.names=FALSE)

Place	Latitude	Longitude	To	Lat2	Long2	Distance
New York, NY	40.713	-74.006	Minneapolis, MN	44.978	-93.265	1016
Minneapolis, MN	44.978	-93.265	New York, NY	40.713	-74.006	1016
New York, NY	40.713	-74.006	Hong Kong	22.396	114.109	8046
Hong Kong	22.396	114.109	Kolkata, India	22.573	88.364	1642
Kolkata, India	22.573	88.364	Haripal, India	22.817	88.105	24
Haripal, India	22.817	88.105	Kolkata, India	22.573	88.364	24
Kolkata, India	22.573	88.364	Jaipur, India	26.912	75.787	844
Jaipur, India	26.912	75.787	Agra, India	27.177	78.008	138
Agra, India	27.177	78.008	Delhi, India	28.614	77.209	111
Delhi, India	28.614	77.209	Singapore	1.352	103.820	2574
Singapore	1.352	103.820	Kuala Lumpur, Malaysia	3.139	101.687	192
Kuala Lumpur, Malaysia	3.139	101.687	Geroge Town, Malaysia	5.415	100.330	183
Geroge Town, Malaysia	5.415	100.330	Hong Kong	22.396	114.109	1491
Hong Kong	22.396	114.109	New York, NY	40.713	-74.006	8046
New York, NY	40.713	-74.006	Boston, MA	42.360	-71.059	190
Boston, MA	42.360	-71.059	New York, NY	40.713	-74.006	190

leaflet(data=allPlaces) %>% addTiles() %>% setView(80, 20, zoom = 3) %>% addPolylines(~Longitude, ~Latitude) %>% addMarkers(lng=~Longitude, lat=~Latitude, popup=~Place, icon=JS("L.icon({
    iconUrl: 'https://www.jaredlander.com/images/jaredlanderfavicon.png', iconSize: [20, 20]})"))

The other night I attended a talk about the history of Brooklyn pizza at the Brooklyn Historical Society by Scott Wiener of Scott’s Pizza Tours. Toward the end, a woman stated she had a theory that pizza slice prices stay in rough lockstep with New York City subway fares. Of course, this is a well known relationship that even has its own Wikipedia entry, so Scott referred her to a New York Times article from 1995 that mentioned the phenomenon.

However, he wondered if the preponderance of dollar slice shops has dropped the price of a slice below that of the subway and playfully joked that he wished there was a statistician in the audience.

Naturally, that night I set off to calculate the current price of a slice in New York City using listings from MenuPages. I used R’s XML package to pull the menus for over 1,800 places tagged as “Pizza” in Manhattan, Brooklyn and Queens (there was no data for Staten Island or The Bronx) and find the price of a cheese slice.

After cleaning up the data and doing my best to find prices for just cheese/plain/regular slices I found that the mean price was $2.33 with a standard deviation of $0.52 and a median price of $2.45. The base subway fare is $2.50 but is actually $2.38 after the 5% bonus for putting at least $5 on a MetroCard.

So, even with the proliferation of dollar slice joints, the average slice of pizza ($2.33) lines up pretty nicely with the cost of a subway ride ($2.38).

Taking it a step further, I broke down the price of a slice in Manhattan, Queens and Brooklyn. The vertical lines represented the price of a subway ride with and without the bonus. We see that the price of a slice in Manhattan is perfectly right there with the subway fare.

The average price of a slice in each borough. The dots are the means and the error bars are the two standard deviation confidence intervals. The two vertical lines represent the discounted subway fare and the base far, respectively.

MenuPages even broke down Queens Neighborhoods so we can have a more specific plot. The average price of a slice in each Manhattan, Brooklyn and Queens neighborhoods. The dots are the means and the error bars are the two standard deviation confidence intervals. The two vertical lines represent the discounted subway fare and the base far, respectively.

The code for downloading the menus and the calculations is after the break.

Continue reading →

It’s here! Another Pi Day, another Pi Cake, as tradition demands. Once again from Chrissie Cook.

Previous cakes in the gallery after the break.

1 comment

Based on data collected from polls conducted at the beginning of the New York Open Statistical Programming meetups.

After two years of writing and editing and proof reading and checking my book, R for Everyone is finally out!

There are so many people who helped me along the way, especially my editor Debra Williams, production editor Caroline Senay and the man who recruited me to write it in the first place, Paul Dix. Even more people helped throughout the long process, but with so many to mention I’ll leave that in the acknowledgements page.

Online resources for the book are available (https://www.jaredlander.com/r-for-everyone/) and will continue to be updated.

As of now the three major sites to purchase the book are Amazon, Barnes & Noble (available in stores January 3rd) and InformIT. And of course digital versions are available.

Jared Lander

MIT Sloan Sports Analytics Conference

Related Posts

Pi Cake 2016

Related Posts

First Bayesian Short Course

Related Posts

2015 New York R Conference

Related Posts

Pi Cake 2015

Related Posts

Teaching R in Asia

Day 1 – Basics

Day 2 – Modeling

Day 3 – Machine Learning

Day 4 – Data Presentation and Portability

Day 5 – High Performance Computing with R

Related Posts

Average Cost of a New York Slice in 2014

Related Posts

Pi Cake 2014

Related Posts

Pizza Ratings

Related Posts

My Book is Out

Related Posts