Last week Slice ran a post about a tomato taste test they conducted with Scott Wiener (of Scott’s NYC Pizza Tours), Brooks Jones, Jason Feirman, Nick Sherman and Roberto Caporuscio from Keste. While the methods used may not be rigorous enough for definitive results, I took the summary data that was in the post and performed some simple analyses.
The first thing to note is that there are only 16 data points, so multiple regression is not an option. We can all thank the Curse of Dimensionality for that. So I stuck to simpler methods and visualizations. If I can get the raw data from Slice, I can get a little more advanced.
For the sake of simplicity I removed the tomatoes from Eataly because their price was such an outlier that it made visualizing the data difficult. As usual, most of the graphics were made using ggplot2 by Hadley Wickham. The coefficient plots were made using a little function I wrote. Here is the code. Any suggestions for improvement are greatly appreciated, especially if you can help with increasing the left hand margin of the plot. And as always, all the work was done in R.
The most obvious relationship we want to test is Overall Quality vs. Price. As can be seen from the scatterplot below with a fitted loess curve, there is not a linear relationship between price and quality.
More after the break.
Further, fitting a linear regression between the two yields insignificant results. The coefficient plot is a graphical representation of the regression results. Controlling for the round of taste testing did not significantly change the results. Any more variables than that and the model would have been overfit.
In the original Slice post Adam was concerned that the round of testing would have an impact on the results. As seen below, the round alone did not have a significant effect on quality.
A lot of attention is paid to the merits of San Marzano tomatoes. The boxplot shows a comparison of San Marzano and regular tomatoes. Since some of statisticians don’t like boxplots, I made a similar plot with just dots.
As can be seen from the coefficient plot, San Marzano tomatoes are not significantly better than other tomatoes.
Similarly, being DOP certified (only San Marzano tomatoes can be certified thus) does not make a difference.
Below is a matrix comparing quality for varying prices, test rounds, the San Marzano strain and DOP certification. I did not perform any statistical tests as there were too many variables to get a proper fit.
Testing whether DOP certified San Marzano tomatoes are better than non-DOP certified San Marzano tomatoes again return insignificant results.
We can see from the scatterplot below that the source of the tomatoes does not impact the overall quality. I did fit a regression to test this but neither ggplot nor my own coefficient plot could visualize this well. I put the coefficient plot below to illustrate how my margins cut off the labels. I am working on code to better display the names of factors, but it’s not ready yet.
Given more time I would use some nonparametric methods, but that’s for another day. The data (which have been modified a little from the original in order to be read properly from the CSV) and code are available for everyone.
So after all that, what can we determine? That cheap tomatoes can be just as good as or better than expensive tomatoes. That the San Marzano strain is not necessarily superior to other tomatoes and that among San Marzano tomatoes, having DOP certification does not guarantee quality.
Below the break I put visualizations If you ask, I’ll send a zip file of the other measurements such as Sweetness, Acidity and Texture. WordPress just makes it too difficult for me to add the pictures one at a time. For each measure there will be two graphs: One using Source as a facet and Round as a color, the other uses Round as a facet and Source as a Color. I did this because sometimes it’s easier to find a pattern when the data are displayed in different ways.
Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.