Temple professor John Allen Paulos has an article in the New York Times that got Slashdotted today suggesting people be wary of all the metrics that fill our daily lives.

His first contention is whether assumptions about categorization are correct.  This is certainly important, but hopefully qualified statisticians, social scientists, doctors, etc. . .are making these decisions and properly counting the results.

Next he discusses whether numbers you are looking at have been aggregated properly and were arrived at by using the proper choices of criteria, protocols and weights.  He gives articles such as “The 10 Friendliest Colleges” and “The 20 Most Lovable Neighborhoods” as examples.  Having done a lot of work where variable selection and shrinkage is important I can say that I, for one, allow the data to speak for itself and use various statistical methods to arrive at the correct decision.

Dr. Paulos makes more points, but I’ll let you read the article for yourself.  The important take away–at least to me–is that when looking at reported statistics and measurements, try to figure out what methods were used.  That’s why I always am disappointed when articles do not report their methods.  I realize that understanding the techniques might be beyond the average person, but that’s when you ask your statistician friend.

Today, Google announced two new services that are sure to be loved by data geeks.  First is their BigQuery which lets you analyze “Terabytes of data, trillions of records.”  This is great for people with large datasets.  I wonder if a program like R(my favorite statistical analysis package) can read it?  If so would R just pull down the data like it would from any other database?  That would most likely result in a data.frame that is far too large for a standard computer to handle.  Maybe R can be ran in a way that it hits the BigQuery service and leaves the data in there.  Maybe even the processing can be done on Google’s end, allowing for much better computation time.  This is something I’ve been dreaming of for a while now.

Further, can BigQuery produce graphics?  If so, this might be a real shot at Business Intelligence tools like QlikView or Cognosthat specialize in handling LARGE datasets. Continue reading

The other day, I was working near Houston street, teaching a class on QlikView (which itself could be a great post topic about data munging for statisticians).  On the last day of the class we decided to head to Bleecker street for a pizza feast.

We got two pies from Keste (Pizza del Pappa and the Margherita), a large pie from John’s of Bleecker (half plain, half sausage and pepporoni) and a large cheese pie from Joe’s. Continue reading

Steven Strogatz is writing a column for the New York Times where he discusses math, starting with basic concepts and working his way up to the complex and cerebral.

I, and a lot of people, love his column.  However, last week’s piece on probability was not received so well by the statistics community., particularly on Andy Gelman’s blog and Junk Charts’ sister blog, Numbers Rule Your World. Continue reading

Pizza Girl, a pizza delivery girl who is a regular contributor on Slice, tallied up and analyzed the time she spends on various duties in her pizzeria.  This is just the first part in a series, but so far she determined that she spends 67% of her shift driving.

According to her pay schedule, she makes less money while driving ($4.95/hr) than she does while in the pizzeria ($7.50). Continue reading

The New York Times, in what seems like a continuing series on NYC transportation, has an article about a decline in subway ridership.  The article points out declines that were to be expected such as in the financial district or Midtown as well as expected increases like along J, which shares a route with the M and Z which are facing service cuts.  It will be interesting to see how these findings impact the expected service cuts.

Another area with expected results was a massive drop off at the moribund Mets’ stop and a below average drop at the World Champion Yankees stop.  However, the Mets–unlike the Yankees–have a convenient commuter rail stop.  Perhaps that explains the drop more than the team’s performance. Continue reading

Slice recently reported that Fark user “Certainly You Jest” tabulated a list of the 25 most mentioned pizzerias.  Naturally, I decided to play with the numbers.  Rather than write up another formal paper, I did some quick ad hoc analysis for posting on this blog and I will skip some of the more technical aspects.

First, I augmented the data with the price of a typical plain pie that could feed two to four people and the pizzeria’s distance from New York City.  Adding the distance meant I had to remove the multi-state chains, like Monical’s, from the data.

While the number of times a pizzeria is mentioned is count data, it doesn’t quite fit a poisson distribution, and the poisson regression didn’t seem to be a good fit.  This makes sense since I have three predictors (distance from New York, price and their interaction).  You can see this in the two histograms below.

  Continue reading

This Thursday, April 8th, I’ll be giving two brief talks (5 to 10 minutes) about statistical methods at the New York R User Meetup.  The first will be applying multilevel models to World Health Organization data to study noncommunicable diseases.  The second, and probably more fun, will be a presentation of my pizza paper (pdf) that was featured on Slice.

I just filled out my Census form and I have to say it was fairly painless and simple.  The short form (pdf) really only asks about age, ethnicity and other residences.  If anyone has a long form (now called the American Community Survey), please let me know your experiences filling that out.

The question concerning residence can be a bit tricky these days with so many people having multiple residences, children who live on their own but visit home frequently and couples who live togetherbut also maintain separate residences.

Continue reading