Slice recently reported that Fark user “Certainly You Jest” tabulated a list of the 25 most mentioned pizzerias. Naturally, I decided to play with the numbers. Rather than write up another formal paper, I did some quick ad hoc analysis for posting on this blog and I will skip some of the more technical aspects.

First, I augmented the data with the price of a typical plain pie that could feed two to four people and the pizzeria’s distance from New York City. Adding the distance meant I had to remove the multi-state chains, like Monical’s, from the data.

While the number of times a pizzeria is mentioned is count data, it doesn’t quite fit a poisson distribution, and the poisson regression didn’t seem to be a good fit. This makes sense since I have three predictors (distance from New York, price and their interaction). You can see this in the two histograms below.

Since the regression didn’t fit the data well, I decided to run a few clustering algorithms on the data. Typically clustering is done on datasets with a lot more variables, but this is for fun, so why not.

To figure the proper number of clusters to use I employed a measure known as Hartigan’s Rule. This provides a general rule of thumb for how many clusters to choose: If the Hartigan’s Number is greater than 10, then it is worthwhile to add another cluster. So we want the highest number of clusters that still score above a 10 in Hartigan’s Rule. Looking at the graph below, you see that three or four clusters seem ideal.

Since clustering algorithms such as k-Means rely on Euclidean distance between objects it is no surprise that the biggest separating characteristic for the pizzerias is their distance from New York, which has the largest spread of the three variables. I tried minimizing the affect of geographic distance by standardizing the variable and by taking its log. This expectedly led to the same results.

The graph below shows the result of clustering with three means. This clearly shows that the geographic component was the strongest. The three distinct clusters are the New York area (including Pittsburgh, Trenton, New Haven and Boston), the Chicago area and pizzerias in Oakland and Phoenix. Using four clusters nicely separates the Boston and Pittsburgh locations away from the Trenton-New York-New Haven group.

Another clustering method, Partitioning around Medoids (PAM), is more robust to outliers so I tried that with similar results. I included a slightly different visualization below.

An interesting next step would be to perform a social network analysis. This would require knowing what sites the pizzerias were mentioned on and would be even better if the posters were identifiable.

To wrap things up, given the data at hand, it becomes clear that biggest separating factor between the pizzerias is distance from New York and price has little affect. Another interesting project would be to throw down the statistical gauntlet between New York and Chicago. Certainly, collecting the data will be a lot of fun.

For in-depth information on Hartigan’s Rule, k-Means Clustering, and PAM see this extensive presentation by David Madigan at Columbia. The data I used is available here and my adhoc R code is also online.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

I am a little confused.

In the computation of the Hartigan numbers, there is N (that is, the number of entities in the data set) with multiplication.

So, if N is very large (for example, N = 1000), the number of clusters will be very high. What do you think about this case ?

That is a fair point. But it is there to see if the change in Within Sum of Squares is significant or just due to large data. See this great PowerPoint from David Madigan: http://www.stat.columbia.edu/~madigan/DM08/descriptive.ppt.pdf

thanks for your discussion, and for the link 🙂

I have just found this problem of the Hartigan Rule interesting to me. I tested the Rule with a sample data set (where the number of objects is about 10000), and noticed that the number of natural clusters suggested by Hartigan Rule is really high.

I think, the Within Sum of Squares (often call W_k) is important. But, in the Hartigan Rule, the division (W_k / W_{k + 1}) tends to destroy the effect of the Within Sum of Squares. The ratio (W_k / W_{k + 1}) will not be very small. So, when multiplied by (N – K – 1), the suggested number of clusters is very high. What do you think ?

The division is there to compare a clustering of k groups versus a clustering of k+1 groups, so I don’t think the division destroys the effect. It’s sort of like a likelihood ratio test. And then the (N-k-1) is a standardizing constant.

I believe, W_{k+1} should be larger than W_k, so the ratio will be less than one which is indeed small, and multiplying by (N-k-1) helps see if that change is really significant.

I do get what you’re saying, but I think the ratio is typically less than 1 (which I think it is), so you have to think about it a little differently.

Then again, I could be off on the ratio, so check into that before taking what I say at face value.