Statistics « Jared Lander

Last week Slice ran a post about a tomato taste test they conducted with Scott Wiener (of Scott’s NYC Pizza Tours), Brooks Jones, Jason Feirman, Nick Sherman and Roberto Caporuscio from Keste. While the methods used may not be rigorous enough for definitive results, I took the summary data that was in the post and performed some simple analyses.

The first thing to note is that there are only 16 data points, so multiple regression is not an option. We can all thank the Curse of Dimensionality for that. So I stuck to simpler methods and visualizations. If I can get the raw data from Slice, I can get a little more advanced.

For the sake of simplicity I removed the tomatoes from Eataly because their price was such an outlier that it made visualizing the data difficult. As usual, most of the graphics were made using ggplot2 by Hadley Wickham. The coefficient plots were made using a little function I wrote. Here is the code. Any suggestions for improvement are greatly appreciated, especially if you can help with increasing the left hand margin of the plot. And as always, all the work was done in R.

The most obvious relationship we want to test is Overall Quality vs. Price. As can be seen from the scatterplot below with a fitted loess curve, there is not a linear relationship between price and quality.

More after the break. Continue reading →

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Less than a month ago, Drew Conway suggested that our R user group present an analysis of the WikiLeaks data. In that short time he, Mike Dewar, John Myles White and Harlan Harris have put together a beautiful visualization of attacks in Afghanistan. The static image you see here has since been animated which is a really nice touch.

Within a few hours of them posting their initial results the work spread across the internet, even getting written up in Wired’s Danger Room. Today, they got picked up by the New York Times where you can see the animation.

The bulk of the work was, of course, done in R. I remember talking with them about how they were going to scrape the data from the WikiLeaks documents, but I am not certain how they did it in the end. As is natural for these guys they made their code available on GitHubso you can recreate their results, after you’ve downloaded the data yourself from WikiLeaks.

Briefly looking at their code I can see they used Hadley Wickham’s ggplot and plyr packages (which are almost standard for most R users) as well as R’s mapping packages. If you want to learn more about how they did this fantastic job come to the next R Meetup where they will present their findings.

A post on Slashdot caught my attention. It was about a microchip from Lyric Semiconductor that does calculations using analog probabilities instead of digital bits of 1’s and 0’s.

The article says that this will both make flash storage more efficient and make statistical calculations quicker. I doubt it will help with fitting simple regressions where have a fixed formula, but the first thing that came to mind were Bayesian problems, especially a Markov chain Monte Carlo (MCMC). Using BUGS to run these simulations can be VERY time consuming, so a faster approach would make the lives of many statisticians much easier. The article did mention that the chip uses Bayesian NAND gates as opposed to digital NAND gates, but I don’t know how that relates to MCMC’s.

I reached out to my favorite Bayesian, Andy Gelman, to see what he thinks. I’ll report back on what he says.

I’m a few days behind on my posts, so please excuse my tardiness and the slew of posts that should be forthcoming.

A-Rod finally reached 600 homeruns a couple weeks ago. While that may have relieved pressure on him, now people are looking toward Jeter’s 3,000th hit. The Wall Street Journal ran a piece predictingthat Jeter should hit the 3,000 mark around June 6th next year.

They looked at his historical numbers and took into account the 27 other players to hit that number and determined that Jeter should get a hit every 3.66 at-bats next season. I’m not sure what method they used to calculate 3.66, but I would guess some sort of simple average. Then, based on how many hits he needs (128 at the time of the article), his average number of at-bats per game, the average number of games he plays a season and the Yankees typical schedule, they determined the June 6th date.

I don’t really have much to add other than that this seems like a solid method. What do the sabermetricians think? By the way, that looks like an awesome cast.

Eye Heart New York has a post with a graph showing the distribution of health code violations and the letter grades they received. Kaiser at Junk Charts takes the original data and makes a few graphs of his own. Based on those visualizations it seems that there is not much difference by borough or by cuisine.

This is similar to a system in LA and Singapore, though something tells me an ‘A’ in NY is still only a ‘B’ in Singapore. The picture below is from an ‘A’ restaurant in Singapore which was so clean that I had no problem eating off a banana leaf.

New Yorkers, known for being tough, might not be deterred even by ‘C’ grades. Commenters on Serious Eats seemed to relish eating in a ‘C’ joint as it lends greasy, authentic goodness to a place.

Turns out the Census cost less than expected. I’ve always admired the Census Bureau for their good work, but now in this time of runaway government spending they came in 11% (NY Times) under budget. That’s truly good work.

According to the Times, the massive advertising campaign helped get people to mail in their forms. The lack of natural disasters and epidemics helped too. Now we can look forward to the deluge of data that social scientists will probably go to town on, so I imagine.

Both the Journal and the Times reported on a studyabout New York City traffic which someone has called the “most statistically ambitious ever undertaken by a U.S. city.” That just sounds awesome to me, both as a statistican and a pedestrian. According to the report, New York is one of the safest cities in America to travel in but trails a number of major European and Asian cities.

One takeaway from the report is, that contrary to common belief, taxis are responsible for very few accidents. This was always my feeling since cabbies are the experts of New York City streets and are under heavy scrutiny from the police and T&LC. They have more incentive to be alert and cautious than private drivers.

It also found that Manhattan is more dangerous than the other boroughs. I hope that doesn’t encourage congestion pricing though. That’s an idea I still can’t get behind.

The Bloomberg administration is likely to use the report to further its (popular) street reforms. As a biker, I like the dedicated bike lanes that use a column of parked cars–and sometimes a concrete median–to separate cyclists from moving traffic. As a pedestrian it’s the countdown cross signals that are already in place near Union Square and Greenwich Avenue. Hopefully Union Square will also be getting its own pedestrian plaza.

Thanks to some early data from Pizza Girl, of Slice fame, I have some very preliminary findings.

There are a few different ways to tip, check (only one person did this), credit card at the door, pre-tipping with a credit card and cash. As seen in these boxplots, cash tippers were the highest, on average. Pre-tippers, who really are just tipping based on feeling, not performance, have the greatest variability. There was even someone who only pre-tipped a dollar. Pre-tipping a large amount might be a good idea–kind of like greasing a palm at a restaurant to get a table–but I don’t see how a small pre-tip is a good idea.

I wonder why people give bigger tips with cash than with credit cards. I would have thought it would be the other way around.

This is just the beginning. Pizza Girl is providing more data as the weeks go on. And as I get more data the analysis will become more sophisticated, so stay tuned as we unravel the world of pizza delivery. In the mean time, check out Pizza Girl’s third installment of her findings on Slice.

Temple professor John Allen Paulos has an article in the New York Times that got Slashdotted today suggesting people be wary of all the metrics that fill our daily lives.

His first contention is whether assumptions about categorization are correct. This is certainly important, but hopefully qualified statisticians, social scientists, doctors, etc. . .are making these decisions and properly counting the results.

Next he discusses whether numbers you are looking at have been aggregated properly and were arrived at by using the proper choices of criteria, protocols and weights. He gives articles such as “The 10 Friendliest Colleges” and “The 20 Most Lovable Neighborhoods” as examples. Having done a lot of work where variable selection and shrinkage is important I can say that I, for one, allow the data to speak for itself and use various statistical methods to arrive at the correct decision.

Dr. Paulos makes more points, but I’ll let you read the article for yourself. The important take away–at least to me–is that when looking at reported statistics and measurements, try to figure out what methods were used. That’s why I always am disappointed when articles do not report their methods. I realize that understanding the techniques might be beyond the average person, but that’s when you ask your statistician friend.

Today, Google announced two new services that are sure to be loved by data geeks. First is their BigQuery which lets you analyze “Terabytes of data, trillions of records.” This is great for people with large datasets. I wonder if a program like R(my favorite statistical analysis package) can read it? If so would R just pull down the data like it would from any other database? That would most likely result in a data.frame that is far too large for a standard computer to handle. Maybe R can be ran in a way that it hits the BigQuery service and leaves the data in there. Maybe even the processing can be done on Google’s end, allowing for much better computation time. This is something I’ve been dreaming of for a while now.

Further, can BigQuery produce graphics? If so, this might be a real shot at Business Intelligence tools like QlikView or Cognosthat specialize in handling LARGE datasets. Continue reading →

Jared Lander

Category Archives: Statistics

Tomato Taste Test

Related Posts

Visualizing the WikiLeaks Data

Related Posts

Probability Based Chip

Related Posts

Predicting Jeter’s 3,000th

Related Posts

Visualizing New York’s Restaurant Health Inspection Grades

Related Posts

Census for a Song

Related Posts

NYC Streets

Related Posts

First Findings on Tipping

Related Posts

Critique on Measurements

Related Posts

Google-Enabled Data Mining

Related Posts