Statistics « Jared Lander

A new study, reported in the New York Times, tracked population movements in post-earthquake Haiti using cell phone data. The article grabbed my attention because one of the authors, Richard Garfield (whom I have done numerous projects with and who has his own Wikipedia entry!), had told me about this very study just a few months ago.

Over dinner in New York’s Little India he explained how the largest cell phone company in Haiti provided him with anonymized cell tower records. As many people are aware, cell phones–even those without GPS–report their locations back to cell towers at regular intervals. By tracking the daily position of the phones before and after the earthquake they were able to determine that 20% of Port-Au-Prince’s population had left the capitol within 19 days of the disaster.

They used plenty of solid math in the analysis and amazingly did it all without resorting to spatial statistics. They have some nice map-based visualizations but I’ve been meaning to get the data from Dr. Garfield so I can attempt something similar to the amazing work done by the NYC Data Mafia on the WikiLeaks Afghanistan data. Though I don’t promise anything nearly as good.

It is also worth noting that they did this at a fraction of the cost and time of an extensive UN survey. That survey only had about 2,500 respondents whereas the cell phone project incorporated around 1.9 million people without them spending valuable time with an interviewer.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

While playing Words with Friends my randomly chosen opponent played “radiale” as her first word. Since that used up all of her tiles, she received a bonus on top of all the points the word itself got, resulting in a one-move score of 53 points! Rather than being impressed I was upset at the large deficit I would have to overcome.

To combat this I did what comes naturally: Write an R script to find the perfect word!

Needing to combine my seven letters with one of her letters there were two routes I could take. The first would be for each combination of my seven letters and one of hers, find all 40,320 (8!) permutations then hit dictionary.com to see if it is a real word for a total of 282,240 (8!*7) http calls. That seemed a bit excessive and impractical so I moved on to the next idea.

So, first thing I did was pull a list of common eight-letter words. Then for each combination of my letters and one of hers (only 7 iterations) I checked if those letters (in any order) matched the letters in any of the possible words. Once a match was found there was a check for the counts of the letters and if that passed then the word was recorded as a true match.

The algorithm took about 17 seconds to run and found me one possible word for my letters combined with one of hers: “headrace”, for 63 points! Perhaps I should have been able to figure that out on my own, but where would be the fun in that. Find the code after the break.

Continue reading →

: Fig. 1: This graph shows received and sent text messages by month. Notice the spike in July 2010.

A few weeks ago my iPhone for some reason erased ALL of my previous text messages (SMS and MMS) and it was as if I was starting with a new phone. After doing some digging I discovered that each time you sync your iPhone a copy of its text message database is saved on your computer which can be accessed without jailbreaking.

My original intent was to take the old database and union it with the new database for all the texting I had done since then, thus restoring all of my text messages. But once I got into the SQLite database I realized that I had a ton of information on my hands that was begging to be analyzed. It also didn’t hurt that I was in a lovely but small Vermont town for the week without much else to do at night.

My first finding, as seen above, is that my text messaging spiked after my girlfriend and I broke up around July of last year. Notice that for both years there is a dip in December. That’s because in 2009 I was in Burma during December and for 2010 the data stopped on December 6th when the last backup was made. A simple t-test confirmed that my texting did indeed increase after the breakup.

Texts by Month and Gender — Fig. 2: This graph shows my text messaging pattern over time for both men and women. Notice the crossover around August 2010.

More interestingly, is that before my girlfriend and I broke up last year I texted more men than women, but shortly after we broke up that flipped. I don’t think that needs much of an explanation. The above graph and further analysis excludes her and family members because they would bias the gender effect. Being a good statistician I ran a poisson regression to see if there really was a significant change. The coefficient plot below (which is on the logarithmic scale) shows that my texting with males increased after the breakup (or Epoch) by 74% (calculated by summing the coefficients for “Epoch”, “Male” and “Male:Epoch” and then exponentiating) while my texting with females increased 127%.

Coefficient Plot for Gender and Epoch — Fig. 3: Here the “Male” coefficient seems statistically insignificant but its direction makes sense so it is left in the model. The “Intercept” is interpreted as the texting rate with females before the breakup, “Epoch” is the increase with females after the breakup, “Intercept” plus “Male” is the rate with males before the breakup. “Epoch” combined with “Male:Epoch” is the change in rate for texts with males after the breakup.

Further analysis and a how-to after the break.

Continue reading →

For the past few weeks Time Out New York‘s Dating columnist, Jamie Bufalino, has been fielding letters discussing the ratio of homosexual to heterosexual questions he answers. The readers suggested that disproportionate attention is paid to Gay and Lesbian issues compared to the Gay and Lesbian proportion of the general population.

Jamie rudely called his readers “ass-wipes” and repeatedly told them to “remove your head from your ass.” He also professed to have “no idea what the percentage is of gay/bi versus straight issues that end up in the column.”

One question and response:

Q I see statistics that show NYC to be 6 percent gay, lesbian and bi, and yet in “Get Naked” you feature letters from them almost to the exclusion of heteros. Why the preoccupation with them in your column? It doesn’t seem right or logical. As one of the other 94 percent, I am disappointed and offended weekly.

A All I can say is: You’ve got your head up your butt. Just in the past month or so, I’ve answered letters from a straight guy with a weird fetish that suddenly stopped delivering the jollies it used to, a straight guy who was juggling a woman from the Ukraine and a woman from Jersey, a woman who had an issue with sticking her finger up her boyfriend’s butt, a 19-year-old woman who was getting pressured to have sex with her boyfriend, and on and on. If, for some reason, you happen to be obsessing over the gay and bi questions and not acknowledging the straight ones, that’s your issue, not mine.

And another:

Q I always read your column to see if I can learn something and just for shits and giggles. The one thing that has always bothered me is your preoccupation with gay and bi problems. Gays and lesbians get their own special section of three to four pages!

A First of all, dude, you sound like one of those total ass-wipes who believes that gay people somehow have all these special privileges that straight people aren’t entitled to. Honestly, I have no idea what the percentage is of gay/bi versus straight issues that end up in the column, because it doesn’t matter. If you removed your head from your ass, you’d realize that so many sexual issues are universal and that you can learn something from all sorts of people who don’t fit into your specific demographic.

When confronted with the data he once again reffered to a “head lodged up [a] rectum” and suggested the reader was “paranoid.”

Q As a statistician I am disappointed by your response to a question in the November 4 issue [TONY 788]. The reader wrote, “I see statistics that show NYC to be 6 percent gay, lesbian and bi, and yet in ‘Get Naked’ you feature letters from them almost to the exclusion of heteros. Why the preoccupation with them in your column? … As one of the other 94 percent, I am disappointed and offended weekly.” You responded by citing individual examples of heterosexual questions you’ve fielded, which is not a valid form of proof. I went through about ten months’ worth of “Get Naked” columns on the TONY website and found that approximately 19 percent of the questions were from gay (15 percent) or lesbian (4 percent) readers. Whether or not that percentage is representative of the general population is not my concern. I just feel that Jamie should have his data correct and not write, “You’ve got your head up your butt.”

A I seriously cannot believe I am still getting letters about this. Okay, Mr. Disappointed Statistician: If you don’t want to come off as someone who has his head lodged up his rectum, it would be an awesome idea not to leap to the defense of some jackass who claims I cater to homo letters “almost to the exclusion of heteros” and then point out that straight issues actually make up a full 81 percent of the subject matter here in “Get Naked.” What I want to know is, why are you even keeping score? Are you really that insecure about the amount of attention heterosexual sex gets in the media? If so, that’s both laughable and sad. This is the last time I’m addressing this, so here’s my final bit of advice to you (and your like-minded brethren): Stop being so paranoid.

Since Jamie is so rude to his readers and clearly doesn’t have any sense of the data, I thought I’d take a look at the numbers. Results after the break.

Continue reading →

The Wall Street Journal is reporting that even with all the concern around gerrymandering that in reality the upcoming redistricting probably won’t have much affect on upcoming elections. Gary King is mentioned as having written a paper “that helped demonstrate the relative impotence of partisan redistricting” yet “he favors the efforts to create a statistical method that would replace it.” I personally am always for using math and hard numbers to solve any problem whenever possible.

The article also mentioned a “conference last year in Washington, D.C., researchers proposed alternatives.” David Epstein presented a paper at that conference that Andy Gelman and I worked on.

While the article quoted one of Dr. Gelman’s papers it unfortunately did not mention him, or any of us by name. However, the accompanying blog post did mention both Dr.s Gelman and Epstein with specific quotes of them and their work.

Today is World Statistics Day as declared by the United Nations. There are events all over the world including a mourning for the Canadian census. The official US event (pdf) is in Washington, DC but a bunch of New Yorkers are celebrating at the bit.ly hack.a.bit.

Drew Conway has some ideas how to celebrate.

Ban Ki-Moon’s (UN Secretary General) message(pdf) on World Statistics Day:

On this first World Statistics Day I encourage the international community to work with the United Nations to enable all countries to meet their statistical needs.

Last night, Harlan Harris and I gave a talk at the NY Predictive Analytics meetup. Despite the rain there was a good turn out and people seemed to both enjoy and benefit from the presentation.

As requested I have posted the presentation for all to see. Please feel free to contact me with any questions. The data and R code are also posted and we will post at least the presentation on the Meetup page. Everything is also available in one convenient package at GitHub.

Update: Harlan wrote up a great summary of the night.

Tonight I will be giving a talk with Harlan Harris at the Predictive Analytics and Machine Learning Meetup in New York. It is going to be an introduction to Multilevel Models with examples in R and from previous projects I have worked.

Here’s the details for the talk.

A great way to visualize the results of a regression is to use a Coefficient Plot like the one to the right. I’ve seen people on Twitter asking how to build this and there has been an option available using Andy Gelman’s coefplot() in the arm package. Not knowing this I built my own (as seen in this post about taste testing tomatoes) and they both suffered the same problems:. Long coefficient names often got cut off by the left margin of the graph and the name of the variable was appended to all the levels of a factor. One big difference between his and mine is that his does not include the Intercept by default. Mine includes the intercept with the option of excluding it.

I managed to solve the latter problem pretty quickly using some regular expressions. Now the levels of factors are displayed alone, without being prepended by the factor name. As for the former, I fixed that yesterday by taking advantage of ggplot by Hadley Wickham which deals with the margins better than I do.

Both of these changes made for a vast improvement over what I had avialable before. Future improvements will address the sorting of the coefficients displayed and allow users to choose their own display names for the coefficients.

The function is in this file and is called plotCoef() and is very customizable, down to the color and line thickness. I kept my old version, plotCoefBase(), in the file in case some people are adverse to using ggplot, though no one should be. I sent the code to Dr. Gelman to hopefully be incorporated into his function which I’m sure gets used by a lot more people than mine will. Examples of my old version and of Dr. Gelman’s are after the break.

As always, any comments or questions are welcomed. Go to the Contact page or send an email to contact -at- jaredlander -dot- com or find me on Twitter @jaredlander. Continue reading →

Jared Lander

Category Archives: Statistics

Cell Phone Tracking for Disaster Relief

Related Posts

How to Succeed in Scrabble

Related Posts

Texting Patterns

Related Posts

Time Out New York Doesn’t Get Data

Related Posts

NYC Data Mafia T-Shirts

Related Posts

A Take on Gerrymandering

Related Posts

World Statistics Day

Related Posts

Predictive Analytics Wrap Up (Update: Harlan’s Writeup)

Related Posts

Predictive Analytics Talk

Related Posts

Coefficient Plot

Related Posts