Fiore Subway Car

The other night I attended a talk about the history of Brooklyn pizza at the Brooklyn Historical Society by Scott Wiener of Scott’s Pizza Tours. Toward the end, a woman stated she had a theory that pizza slice prices stay in rough lockstep with New York City subway fares. Of course, this is a well known relationship that even has its own Wikipedia entry, so Scott referred her to a New York Times article from 1995 that mentioned the phenomenon.

However, he wondered if the preponderance of dollar slice shops has dropped the price of a slice below that of the subway and playfully joked that he wished there was a statistician in the audience.

Naturally, that night I set off to calculate the current price of a slice in New York City using listings from MenuPages. I used R’s XML package to pull the menus for over 1,800 places tagged as “Pizza” in Manhattan, Brooklyn and Queens (there was no data for Staten Island or The Bronx) and find the price of a cheese slice.

After cleaning up the data and doing my best to find prices for just cheese/plain/regular slices I found that the mean price was $2.33 with a standard deviation of $0.52 and a median price of $2.45. The base subway fare is $2.50 but is actually $2.38 after the 5% bonus for putting at least $5 on a MetroCard.

So, even with the proliferation of dollar slice joints, the average slice of pizza ($2.33) lines up pretty nicely with the cost of a subway ride ($2.38).

Taking it a step further, I broke down the price of a slice in Manhattan, Queens and Brooklyn. The vertical lines represented the price of a subway ride with and without the bonus.  We see that the price of a slice in Manhattan is perfectly right there with the subway fare.

The average price of a slice in each borough.  The dots are the means and the error bars are the two standard deviation confidence intervals.  The two vertical lines represent the discounted subway fare and the base far, respectively.

MenuPages even broke down Queens Neighborhoods so we can have a more specific plot. The average price of a slice in each Manhattan, Brooklyn and Queens neighborhoods.  The dots are the means and the error bars are the two standard deviation confidence intervals.  The two vertical lines represent the discounted subway fare and the base far, respectively.

The code for downloading the menus and the calculations is after the break.

First we have to get the URLs for each place tagged as “Pizza” which we get from http://www.menupages.com/restaurants/all-areas/all-neighborhoods/pizza and all of its subsequent pages. My XML isn’t the greatest so this might be unnecessarily complicated.

require(XML)
require(plyr)
require(stringr)

numPages <- 19

get.restaurants <- function(address)
{
    # parse the html so it's ready to use
    doc <- htmlParse(address)

    # build data.frame of the URL for the restaurant page and the name of the place
    # do not do recursive so it does not grab the span with the row number
    placeNameLink <- ldply(xpathApply(doc, "//table/tr/td[@class='name-address']/a[@class='link']", fun = function(x){ c(Name=xmlValue(x, recursive = F), Link=xmlAttrs(x)[2]) }))

    # Figure out how many stars the place is rated based on the URL for the star image
    placeRatingImg <- ldply(xpathApply(doc, "//table/tr/td[@class='rating']/img", fun=function(x) c(Stars=xmlAttrs(x)[1])))
    placeRating <- str_extract(placeRatingImg$Stars.src, pattern="\\d(_\\d)?")
    placeRating <- as.numeric(str_replace(placeRating, "_", "."))

    # Figure out the number of dolar signs
    placePrice <- xpathApply(doc, "//table/tr/td[@class='price']", fun=function(x) nchar(xmlValue(x)))
    placePrice <- sapply(placePrice, identity)

    # Find the number of reviews
    placeReviews <- xpathApply(doc, "//table/tr/td[@class='reviews']", xmlValue)
    placeReviews <- sapply(placeReviews, as.numeric)

    # combine them together
    places <- cbind(placeNameLink, Rating=placeRating, Price=placePrice, Reviews=placeReviews)

    return(places)
}

urls <- sprintf("http://www.menupages.com/restaurants/all-areas/all-neighborhoods/pizza/%s/", seq_len(numPages))
places <- adply(urls, 1, get.restaurants)

Then we need to download each of those pages individually. To speed things up we do this in parallel using the foreach and doParallel packages.

require(foreach)
require(doParallel)

dataDir <- "data"
subFolder <- "menuDir"

# get rid of listing with special character
places <- places[which(!str_detect(places$Link.href, "Ã|Â")), ]
rownames(places) <- NULL

# get root url for file name
places$FileName <- file.path(dataDir, subFolder, sprintf("%s.%s", str_replace_all(places$Link.href, "(/restaurants/)|(/$)", ""), "rdata"))

# spell out the full URL
places$Link.href <- sprintf("http://www.menupages.com%smenu", places$Link.href)

# setup the cluster
cl <- makeCluster(2)
registerDoParallel(cl)

foreach(i=539:nrow(places)) %dopar%
{
    thePage <- readLines(places$Link.href[i])
    save(thePage, file=places$FileName[i])
}

# stop the cluster
stopCluster(cl)

Now we only keep the places that at least mention the word slice in the menu.

require(dplyr)
# clean up the data a bit
places <- places[which(!str_detect(places$FileName, "\\t")), ]
places <- places %>% select(-X1)

# setup the cluster
cl <- makeCluster(2)
registerDoParallel(cl)

# see if slice is mentioned at least once
containsSlice <- parSapply(cl, places$FileName, function(x){ load(x); sum(stringr::str_detect(thePage, stringr::ignore.case("slice"))) })

# stop the cluster
stopCluster(cl)

places$Slice <- containsSlice

places <- places %>% filter(Slice >= 1)

To parse the menus we need a couple of helper functions.

extract.price <- function(x)
{
    as.numeric(str_replace_all(x, "[^0-9.]", ""))
}

get.items <- function(place, item="slice")
{
    # load place html
    load(place)
    # render it to grab info
    pageRender <- htmlParse(thePage)
    # get address info
    address <- xpathApply(pageRender, "//li[@class='address adr']/span[@class='addr street-address']", fun=xmlValue)[[1]]
    city <- xpathApply(pageRender, "//li[@class='address adr']/span/span[@class='locality']", fun=xmlValue)[[1]]
    state <- xpathApply(pageRender, "//li[@class='address adr']/span/span[@class='region hide-microformat']", fun=xmlValue)[[1]]
    zip <- xpathApply(pageRender, "//li[@class='address adr']/span/span[@class='postal-code']", fun=xmlValue)[[1]]

    # sometimes the slices are mentioned in the header block, not in the table itself
    headers <- xpathSApply(pageRender, "//*[@id='restaurant-menu']/h3", xmlValue)

    # get items listed in tables
    items <- xpathSApply(pageRender, "//table[starts-with(@class, 'prices-')]")
    items <- lapply(items, readHTMLTable, stringsAsFactors=FALSE)

    # if the number of headers don't match up with the number of tables, just return nothing
    if(length(headers) != length(items))
        return(NULL)

    # combine headers with tables
    names(items) <- headers
    items <- ldply(items)
    names(items) <- c("Category", "Item", "Open1", "Open2", "Price")

    # only keep rows that mention slice
    items <- items %>% filter(str_detect(Item, ignore.case(item)) | str_detect(Category, ignore.case(item)))

    # if no rows do return nothing
    if(nrow(items) == 0)
        return(NULL)

    # get pricing information
    items$Open1 <- extract.price(items$Open1)
    items$Open2 <- extract.price(items$Open2)
    items$Price <- extract.price(items$Price)

    # get ride of special characters
    items$Item <- str_replace_all(items$Item, "Â", "")

    # combine address info with menu items
    cbind(Name=place, Address=address, Boro=city, Zip=zip, items, stringsAsFactors=FALSE)
}

We apply those functions to each of the menus. In this case sequential processing was faster than parallel.

allItems <- adply(places$FileName, 1, function(x){ print(x); get.items(x) })

With all of this information we are almost ready to run some numbers. But first we need to get rid of any items that are not cheese slices so we build up a lexicon of things that don’t look right such as references to mushrooms, meat, vegetables, pepperoni, sandwich, etc. . .. We also eliminate anything priced over $6 as this usually indicates a different type of item all.together.

allItems$Name <- str_replace_all(allItems$Name, "(data/menuDir/)|(\\.rdata)", "")
allItems$Name <- str_replace_all(allItems$Name, "-", " ")

allItems$Name <- str_replace_all(allItems$Name, "(data/menuDir/)|(\\.rdata)", "")
allItems$Name <- str_replace_all(allItems$Name, "-", " ")
specialNames <- c("grandma", "sicilian", "white", "topping", "salad", "meat", "extra", "Vege", "pepper", "pepp", "sausage", "mushroom", "onion", "anchov", "sand", "roll", "chicken", "strombol", "pan", "beef", "artichoke", "two", "three", "pineapple", "spinach", "black", "olive", "special", "Linguine", "veal", "pasta", "wheat", "fresh", "korea", "tripple", "calzone", "breakfast", "wrap", "soda", "can", "fruit", "square", "brocc", "marg", "nea", "prosc", "vodka", "hazel", "buffalo", "pork", "beet", "almond", "nut", "grand", "grand", "egg", "hawaii", "Bruschetta", "arugula", "apple", "tart", "steak", "chicago", "froze", "app", "deep", "bake", "ziti", "tomato", "napoli", "gourment", "fried", "double", "bbq", "feta", "hero", "broil", "ham", "fry", "knot", "bread", "supreme", "ricotta", "cookie", "shashim", "bistro", "Lasagna", "avoca", "bianca", "gourmet", "comb", "bacon", "cake", "pollo", "pastram", "\\d", "matzo", "taco", "Sfincioni", "stuff", "banana", "potato", "works", "garlic", "macaroni", "Magarita", "Designer", "nonna", "item", "Neopolitan", "Parmigiana", "granola", "pie", "Focaccia", "lime", "mango", "carrot", "blanca", "kirby", "Primavera", "delia", "Focaccia", "Buff", "chick", "Napoleon", "day", "V\\.I\\.P\\.", "Primavera", "fresca", "VIP", "ravioli", "pinwheel", "Jalapeno", "gyro", "shrimp", "old", "Jalapeno", "Marinara", "ronni", "Gramma", "tirami", "jumbo", "upside", "down", "seafood", "sea", "Calamari", "Sweet", "spice", "spicy", "Kishke", "falafel", "french", "Sfingione", "Green", "lox", "greek", "Barbacoa", "peper", "pepper", "Spigoli", "Napoletana", "baby", "Bianoco", "farm", "Casalinga", "Verdu", "madrid", "Sirignese", "penne", "pesto", "qua", "Tiramisu", "garden", "Parmigiano", "veg", "Brooklyn Pizza Slice", "New Yorker Pizza Slice", "gluten", "Zucchini", "salmon", "Sorrento", "jun", "house", "Contadina", "Caprese", "Arrabiata", "Fiorentina", "Fajita", "bagel", "stop", "salami", "lamb", "V.I.P", "Pedazo", "Foccaccia", "Roni", "Carpaccio", "filet", "grama", "kid", "Bologna", "Turkey", "swiss", "provo", "Salami", "Sopressata", "Sopressata", "Alpine", "Jarlsberg", "Muenster", "Cheddar", "Delicious", "Four", "4", "Portobello", "Foccacia", "cheeseless", "every", "Cucumbers", "soup", "Choco", "bite", "Foccacia", "jalep", "Strawberries", "cracker", "gramm", "Foccacia", "pepsi", "Mona", "Gouda", "Brie", "havarti", "Mortadella", "nona", "Crispino", "jalap", "thick", "Sicilain", "whole")
pattern <- sprintf("(%s)", paste(specialNames, collapse=")|("))
slices <- allItems %>% select(-X1) %>% filter(!str_detect(allItems$Item, ignore.case(pattern))) %>% filter(Price <= 6 | is.na(Price))
dim(slices)
slices$Price[which(is.na(slices$Price))] <- as.numeric(str_extract(slices$Category[which(is.na(slices$Price))], "\\d(\\.\\d{2})?"))
slices$Price[is.na(slices$Price)] <- as.numeric(str_extract(slices$`NA`[is.na(slices$Price)], "\\d(\\.\\d{2})?"))

slices$Borough <- ifelse(slices$Boro == "New York", "Manhattan", ifelse(slices$Boro == "Brooklyn", "Brooklyn", "Queens"))

Finally time to run some numbers.

mean(slices$Price, na.rm=TRUE)
## [1] 2.332
median(slices$Price, na.rm=TRUE)
## [1] 2.45
sliceBorough <- slices %>% group_by(Borough) %>% dplyr::summarize(Mean=mean(Price, na.rm=TRUE), Median=median(Price, na.rm=TRUE), SD=sd(Price, na.rm=TRUE), count=length(Price)) %>% mutate(Lower=Mean - 2*SD, Upper=Mean + 2*SD)
sliceHood <- slices %>% group_by(Boro, Borough) %>% dplyr::summarize(Mean=mean(Price, na.rm=TRUE), Median=median(Price, na.rm=TRUE), SD=sd(Price, na.rm=TRUE), count=length(Price)) %>% mutate(Lower=Mean - 2*SD, Upper=Mean + 2*SD)

sliceBorough
## Source: local data frame [3 x 7]
## 
##     Borough  Mean Median     SD count Lower Upper
## 1  Brooklyn 2.301   2.25 0.4854   219 1.330 3.272
## 2 Manhattan 2.456   2.50 0.6201   240 1.216 3.696
## 3    Queens 2.210   2.25 0.3508   192 1.509 2.912
sliceHood
## Source: local data frame [37 x 8]
## Groups: Boro
## 
##                   Boro   Borough  Mean Median      SD count  Lower Upper
## 1              Astoria    Queens 2.281  2.250 0.28864    31 1.7034 2.858
## 2              Bayside    Queens 2.375  2.350 0.09574     4 2.1835 2.566
## 3            Bellerose    Queens 2.450  2.450     NaN     1    NaN   NaN
## 4             Brooklyn  Brooklyn 2.301  2.250 0.48541   219 1.3305 3.272
## 5        College Point    Queens 2.000  2.000 0.00000     2 2.0000 2.000
## 6               Corona    Queens 2.185  2.165 0.20345     6 1.7781 2.592
## 7        East Elmhurst    Queens 2.300  2.300     NaN     2    NaN   NaN
## 8             Elmhurst    Queens 1.950  2.000 0.08660     3 1.7768 2.123
## 9         Far Rockaway    Queens 2.500  2.500 0.70711     2 1.0858 3.914
## 10         Floral Park    Queens 2.350  2.350     NaN     1    NaN   NaN
## 11            Flushing    Queens 2.435  2.400 0.42051    13 1.5936 3.276
## 12        Forest Hills    Queens 2.393  2.500 0.19670     7 1.9995 2.786
## 13       Fresh Meadows    Queens 2.500  2.500 0.00000     4 2.5000 2.500
## 14              Hollis    Queens 2.650  2.650     NaN     1    NaN   NaN
## 15        Howard Beach    Queens 2.750  2.750 0.00000     2 2.7500 2.750
## 16     Jackson Heights    Queens 2.131  2.000 0.26059    11 1.6097 2.652
## 17             Jamaica    Queens 2.136  2.000 0.49653    12 1.1433 3.129
## 18         Kew Gardens    Queens 2.250  2.250     NaN     1    NaN   NaN
## 19         Little Neck    Queens 2.500  2.500     NaN     2    NaN   NaN
## 20    Long Island City    Queens 2.212  2.250 0.33885     8 1.5348 2.890
## 21             Maspeth    Queens 1.939  2.250 0.65768    10 0.6236 3.254
## 22      Middle Village    Queens 2.125  2.250 0.25000     4 1.6250 2.625
## 23            New York Manhattan 2.456  2.500 0.62013   240 1.2160 3.696
## 24     Oakland Gardens    Queens 2.250  2.250 0.00000     3 2.2500 2.250
## 25          Ozone Park    Queens 2.036  2.000 0.33630     7 1.3631 2.708
## 26      Queens Village    Queens 2.250  2.250     NaN     1    NaN   NaN
## 27           Rego Park    Queens 2.375  2.375 0.17678     2 2.0214 2.729
## 28       Richmond Hill    Queens 2.220  2.250 0.13038     5 1.9592 2.481
## 29           Ridgewood    Queens 2.106  2.250 0.32983    18 1.4459 2.765
## 30            Rosedale    Queens 2.050  1.900 0.39686     3 1.2563 2.844
## 31    South Ozone Park    Queens 2.750  2.750     NaN     1    NaN   NaN
## 32 South Richmond Hill    Queens 2.250  2.250 0.35355     2 1.5429 2.957
## 33           St Albans    Queens 2.250  2.250     NaN     1    NaN   NaN
## 34           Sunnyside    Queens 2.200  2.250 0.20917     5 1.7817 2.618
## 35          Whitestone    Queens 2.050  2.000 0.44721     5 1.1556 2.944
## 36           Woodhaven    Queens 1.917  1.750 0.28868     3 1.3393 2.494
## 37            Woodside    Queens 2.128  2.200 0.13489     9 1.8580 2.398

And now we use ggplot to make the plots.

require(ggplot2)
require(scales)
ggplot(sliceBorough, aes(x=Mean, y=reorder(Borough, Mean))) + geom_errorbarh(aes(xmin=Lower, xmax=Upper), height=0, color="blue") + geom_point(color="blue") + geom_vline(xintercept=c(2.38, 2.50), linetype=2, color="grey") + ylab("Neighborhood") + ggtitle("Slice Price") + scale_x_continuous(labels=dollar)

plot of chunk pizza-plots

ggplot(sliceHood, aes(x=Mean, y=reorder(Boro, Mean), color=Borough)) + geom_errorbarh(aes(xmin=Lower, xmax=Upper), height=0) + geom_point() + geom_vline(xintercept=c(2.38, 2.50), linetype=2, color="grey") + ylab("Neighborhood") + ggtitle("Slice Price") + scale_x_continuous(labels=dollar)

plot of chunk pizza-plots

Related Posts



Jared Lander is the Founder and CEO of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York R Conference and author of R for Everyone.

4 thoughts on “Average Cost of a New York Slice in 2014

  1. Hey Jared,

    Very interesting and nicely done! But could you tell me why the prices rise together? is there a reason that the cost rises together? like does the head of the metro really like pizza or something?

    Thanks
    Rich

    Reply

Leave a reply

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> 

required