The other week I finally made it to the Ed Tufte exhibit in Chelsea. The gallery is a collection of his art and not about data, though as he tells it data is not important, but information is and that his art conveys information of all kinds. Going on a Saturday means you’ll get a tour from the artist himself. Getting to hear him describe his art and the way the eye and mind see it is really fascinating.
We had a chance to briefly chat about data (how could I resist) and he reinforced the notion that the medium, or the code or graphics, don’t matter. He “would use sock puppets to get his point across” if that was necessary. Something that al data visualists should keep in mind.
The Wall Street Journal is reporting that even with all the concern around gerrymandering that in reality the upcoming redistricting probably won’t have much affect on upcoming elections. Gary King is mentioned as having written a paper “that helped demonstrate the relative impotence of partisan redistricting” yet “he favors the efforts to create a statistical method that would replace it.” I personally am always for using math and hard numbers to solve any problem whenever possible.
The article also mentioned a “conference last year in Washington, D.C., researchers proposed alternatives.” David Epstein presented a paper at that conference that Andy Gelman and I worked on.
While the article quoted one of Dr. Gelman’s papers it unfortunately did not mention him, or any of us by name. However, the accompanying blog post did mention both Dr.s Gelman and Epstein with specific quotes of them and their work.
As requested I have posted the presentation for all to see. Please feel free to contact me with any questions. The data and Rcodeare also posted and we will post at least the presentation on the Meetup page. Everything is also available in one convenient package at GitHub.
Update: Harlan wrote up a great summary of the night.
A great way to visualize the results of a regression is to use a Coefficient Plot like the one to the right. I’ve seen people on Twitter asking how to build this and there has been an option available using Andy Gelman’scoefplot() in the arm package. Not knowing this I built my own (as seen in this post about taste testing tomatoes) and they both suffered the same problems:. Long coefficient names often got cut off by the left margin of the graph and the name of the variable was appended to all the levels of a factor. One big difference between his and mine is that his does not include the Intercept by default. Mine includes the intercept with the option of excluding it.
I managed to solve the latter problem pretty quickly using some regularexpressions. Now the levels of factors are displayed alone, without being prepended by the factor name. As for the former, I fixed that yesterday by taking advantage of ggplot by Hadley Wickham which deals with the margins better than I do.
Both of these changes made for a vast improvement over what I had avialable before. Future improvements will address the sorting of the coefficients displayed and allow users to choose their own display names for the coefficients.
The function is in this file and is called plotCoef() and is very customizable, down to the color and line thickness. I kept my old version, plotCoefBase(), in the file in case some people are adverse to using ggplot, though no one should be. I sent the code to Dr. Gelman to hopefully be incorporated into his function which I’m sure gets used by a lot more people than mine will. Examples of my old version and of Dr. Gelman’s are after the break.
Last Wednesday I made a trip to Di Fara in Midwood, Brooklyn. Since that place is wellcoveredandlauded I won’t talk about the pizza, as amazing as it is.
I gave Dom a copy of my thesis (pdf) on NYCpizza and he loved that his place was one of the few pizzerias mentioned by name (along with Lombardi’s and Otto Enoteca, two of my favorites) in the paper. My friend captured these great photos and I’m extremely thankful to Dom for letting me in his kitchen.
And to make the trip all the more surreal, Avenue J was lined with lulav and etrog vendors trying to clear out stock before Sukkot started. The juxtaposition of Di Fara and the surrounding Orthodox neighborhood was striking and really shows the beauty of New York City.
After years of waiting a new Artichoke Basilles our dreams have been answered. The new spot, which hasn’t even been updated on the website yet, is at 17th and 10th opened this weekend. Unlike the original location this one has seats and wait service and only sells pies, albeit smaller than the originals. There is a side shop where they sell slices, but I didn’t venture in there.
The pie, seen below in the blurry iPhone shot, is just a smaller version of the pies at the original shop and were just as tasty. One pie was too much food for a friend and me, so figure one pie ($17) for 2.5 to 3 people.
Additionally, they have a larger selection of pies and non-pizza food, such as salad and the sorts. They don’t have beer or liquor yet, but should soon.
The first thing to note is that there are only 16 data points, so multiple regression is not an option. We can all thank the Curse of Dimensionality for that. So I stuck to simpler methods and visualizations. If I can get the raw data from Slice, I can get a little more advanced.
For the sake of simplicity I removed the tomatoes from Eataly because their price was such an outlier that it made visualizing the data difficult. As usual, most of the graphics were made using ggplot2 by Hadley Wickham. The coefficient plots were made using a little function I wrote. Here is the code. Any suggestions for improvement are greatly appreciated, especially if you can help with increasing the left hand margin of the plot. And as always, all the work was done in R.
The most obvious relationship we want to test is Overall Quality vs. Price. As can be seen from the scatterplot below with a fitted loess curve, there is not a linear relationship between price and quality.