The Father of Gerrymandering
The Father of Gerrymandering

The Wall Street Journal is reporting that even with all the concern around gerrymandering that in reality the upcoming redistricting probably won’t have much affect on upcoming elections.  Gary King is mentioned as having written a paper “that helped demonstrate the relative impotence of partisan redistricting” yet “he favors the efforts to create a statistical method that would replace it.”  I personally am always for using math and hard numbers to solve any problem whenever possible.

The article also mentioned a “conference last year in Washington, D.C., researchers proposed alternatives.”  David Epstein presented a paper at that conference that Andy Gelman and I worked on.

While the article quoted one of Dr. Gelman’s papers it unfortunately did not mention him, or any of us by name.  However, the accompanying blog post did mention both Dr.s Gelman and Epstein with specific quotes of them and their work.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Today is World Statistics Day as declared by the United Nations.  There are events all over the world including a mourning for the Canadian census.  The official US event (pdf) is in Washington, DC but a bunch of New Yorkers are celebrating at the bit.ly hack.a.bit.

Drew Conway has some ideas how to celebrate.

Ban Ki-Moon’s (UN Secretary General) message(pdf) on World Statistics Day:

On this first World Statistics Day I encourage the international community to work with the United Nations to enable all countries to meet their statistical needs.
 

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Last night, Harlan Harris and I gave a talk at the NY Predictive Analytics meetup.  Despite the rain there was a good turn out and people seemed to both enjoy and benefit from the presentation.

As requested I have posted the presentation for all to see.  Please feel free to contact me with any questions.  The data and R code are also posted and we will post at least the presentation on the Meetup page.  Everything is also available in one convenient package at GitHub.

Update:  Harlan wrote up a great summary of the night.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Tonight I will be giving a talk with Harlan Harris at the Predictive Analytics and Machine Learning Meetup in New York.  It is going to be an introduction to Multilevel Models with examples in R and from previous projects I have worked.

Here’s the details for the talk.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

A great way to visualize the results of a regression is to use a Coefficient Plot like the one to the right.  I’ve seen people on Twitter asking how to build this and there has been an option available using Andy Gelman’s coefplot() in the arm package.  Not knowing this I built my own (as seen in this post about taste testing tomatoes) and they both suffered the same problems:.  Long coefficient names often got cut off by the left margin of the graph and the name of the variable was appended to all the levels of a factor.  One big difference between his and mine is that his does not include the Intercept by default.  Mine includes the intercept with the option of excluding it.

I managed to solve the latter problem pretty quickly using some regular expressions.  Now the levels of factors are displayed alone, without being prepended by the factor name.  As for the former, I fixed that yesterday by taking advantage of ggplot by Hadley Wickham which deals with the margins better than I do.

Both of these changes made for a vast improvement over what I had avialable before.  Future improvements will address the sorting of the coefficients displayed and allow users to choose their own display names for the coefficients.

The function is in this file and is called plotCoef() and is very customizable, down to the color and line thickness.  I kept my old version, plotCoefBase(), in the file in case some people are adverse to using ggplot, though no one should be.  I sent the code to Dr. Gelman to hopefully be incorporated into his function which I’m sure gets used by a lot more people than mine will.  Examples of my old version and of Dr. Gelman’s are after the break.

As always, any comments or questions are welcomed.  Go to the Contact page or send an email to contact -at- jaredlander -dot- com or find me on Twitter @jaredlander. Continue reading

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Last week Slice ran a post about a tomato taste test they conducted with Scott Wiener (of Scott’s NYC Pizza Tours), Brooks Jones, Jason Feirman, Nick Sherman and Roberto Caporuscio from Keste.  While the methods used may not be rigorous enough for definitive results, I took the summary data that was in the post and performed some simple analyses.

The first thing to note is that there are only 16 data points, so multiple regression is not an option.  We can all thank the Curse of Dimensionality for that.  So I stuck to simpler methods and visualizations.  If I can get the raw data from Slice, I can get a little more advanced.

For the sake of simplicity I removed the tomatoes from Eataly because their price was such an outlier that it made visualizing the data difficult.  As usual, most of the graphics were made using ggplot2 by Hadley Wickham.  The coefficient plots were made using a little function I wrote.  Here is the code.  Any suggestions for improvement are greatly appreciated, especially if you can help with increasing the left hand margin of the plot.  And as always, all the work was done in R.

The most obvious relationship we want to test is Overall Quality vs. Price.  As can be seen from the scatterplot below with a fitted loess curve, there is not a linear relationship between price and quality.

More after the break. Continue reading

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Less than a month ago, Drew Conway suggested that our R user group present an analysis of the WikiLeaks data.  In that short time he, Mike Dewar, John Myles White and Harlan Harris have put together a beautiful visualization of attacks in Afghanistan.  The static image you see here has since been animated which is a really nice touch.

Within a few hours of them posting their initial results the work spread across the internet, even getting written up in Wired’s Danger Room.  Today, they got picked up by the New York Times where you can see the animation.

The bulk of the work was, of course, done in R.  I remember talking with them about how they were going to scrape the data from the WikiLeaks documents, but I am not certain how they did it in the end.  As is natural for these guys they made their code available on GitHubso you can recreate their results, after you’ve downloaded the data yourself from WikiLeaks.

Briefly looking at their code I can see they used Hadley Wickham’s ggplot and plyr packages (which are almost standard for most R users) as well as R’s mapping packages.  If you want to learn more about how they did this fantastic job come to the next R Meetup where they will present their findings.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

A post on Slashdot caught my attention.  It was about a microchip from Lyric Semiconductor that does calculations using analog probabilities instead of digital bits of 1’s and 0’s.

The article says that this will both make flash storage more efficient and make statistical calculations quicker.  I doubt it will help with fitting simple regressions where have a fixed formula, but the first thing that came to mind were Bayesian problems, especially a Markov chain Monte Carlo (MCMC).  Using BUGS to run these simulations can be VERY time consuming, so a faster approach would make the lives of many statisticians much easier.  The article did mention that the chip uses Bayesian NAND gates as opposed to digital NAND gates, but I don’t know how that relates to MCMC’s.

I reached out to my favorite Bayesian, Andy Gelman, to see what he thinks.  I’ll report back on what he says.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Eye Heart New York has a post with a graph showing the distribution of health code violations and the letter grades they received.  Kaiser at Junk Charts takes the original data and makes a few graphs of his own.  Based on those visualizations it seems that there is not much difference by borough or by cuisine.

This is similar to a system in LA and Singapore, though something tells me an ‘A’ in NY is still only a ‘B’ in Singapore.  The picture below is from an ‘A’ restaurant in Singapore which was so clean that I had no problem eating off a banana leaf.

New Yorkers, known for being tough, might not be deterred even by ‘C’ grades.  Commenters on Serious Eats seemed to relish eating in a ‘C’ joint as it lends greasy, authentic goodness to a place.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Turns out the Census cost less than expected.  I’ve always admired the Census Bureau for their good work, but now in this time of runaway government spending they came in 11% (NY Times) under budget.  That’s truly good work.

According to the Times, the massive advertising campaign helped get people to mail in their forms.  The lack of natural disasters and epidemics helped too.  Now we can look forward to the deluge of data that social scientists will probably go to town on, so I imagine.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.