Slice recently reported that Fark user “Certainly You Jest” tabulated a list of the 25 most mentioned pizzerias.  Naturally, I decided to play with the numbers.  Rather than write up another formal paper, I did some quick ad hoc analysis for posting on this blog and I will skip some of the more technical aspects.

First, I augmented the data with the price of a typical plain pie that could feed two to four people and the pizzeria’s distance from New York City.  Adding the distance meant I had to remove the multi-state chains, like Monical’s, from the data.

While the number of times a pizzeria is mentioned is count data, it doesn’t quite fit a poisson distribution, and the poisson regression didn’t seem to be a good fit.  This makes sense since I have three predictors (distance from New York, price and their interaction).  You can see this in the two histograms below.

Since the regression didn’t fit the data well, I decided to run a few clustering algorithms on the data.  Typically clustering is done on datasets with a lot more variables, but this is for fun, so why not.

To figure the proper number of clusters to use I employed a measure known as Hartigan’s Rule.  This provides a general rule of thumb for how many clusters to choose:  If the Hartigan’s Number is greater than 10, then it is worthwhile to add another cluster. So we want the highest number of clusters that still score above a 10 in Hartigan’s Rule.  Looking at the graph below, you see that three or four clusters seem ideal.

Since clustering algorithms such as k-Means rely on Euclidean distance between objects it is no surprise that the biggest separating characteristic for the pizzerias is their distance from New York, which has the largest spread of the three variables.  I tried minimizing the affect of geographic distance by standardizing the variable and by taking its log.  This expectedly led to the same results.

The graph below shows the result of clustering with three means.  This clearly shows that the geographic component was the strongest.  The three distinct clusters are the New York area (including Pittsburgh, Trenton, New Haven and Boston), the Chicago area and pizzerias in Oakland and Phoenix.  Using four clusters nicely separates the Boston and Pittsburgh locations away from the Trenton-New York-New Haven group.

Another clustering method, Partitioning around Medoids (PAM), is more robust to outliers so I tried that with similar results.  I included a slightly different visualization below.

An interesting next step would be to perform a social network analysis.  This would require knowing what sites the pizzerias were mentioned on and would be even better if the posters were identifiable.

To wrap things up, given the data at hand, it becomes clear that biggest separating factor between the pizzerias is distance from New York and price has little affect.  Another interesting project would be to throw down the statistical gauntlet between New York and Chicago. Certainly, collecting the data will be a lot of fun.

For in-depth information on Hartigan’s Rule, k-Means Clustering, and PAM see this extensive presentation by David Madigan at Columbia.  The data I used is available here and my adhoc R code is also online.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

## 4 thoughts on “Top Pizzerias by Blog Shout-Outs”

1. I am a little confused.
In the computation of the Hartigan numbers, there is N (that is, the number of entities in the data set) with multiplication.
So, if N is very large (for example, N = 1000), the number of clusters will be very high. What do you think about this case ?

I have just found this problem of the Hartigan Rule interesting to me. I tested the Rule with a sample data set (where the number of objects is about 10000), and noticed that the number of natural clusters suggested by Hartigan Rule is really high.

I think, the Within Sum of Squares (often call W_k) is important. But, in the Hartigan Rule, the division (W_k / W_{k + 1}) tends to destroy the effect of the Within Sum of Squares. The ratio (W_k / W_{k + 1}) will not be very small. So, when multiplied by (N – K – 1), the suggested number of clusters is very high. What do you think ?