A few weeks ago my iPhone for some reason erased ALL of my previous text messages (SMS and MMS) and it was as if I was starting with a new phone. After doing some digging I discovered that each time you sync your iPhone a copy of its text message database is saved on your computer which can be accessed without jailbreaking.
My original intent was to take the old database and union it with the new database for all the texting I had done since then, thus restoring all of my text messages. But once I got into the SQLite database I realized that I had a ton of information on my hands that was begging to be analyzed. It also didn’t hurt that I was in a lovely but small Vermont town for the week without much else to do at night.
My first finding, as seen above, is that my text messaging spiked after my girlfriend and I broke up around July of last year. Notice that for both years there is a dip in December. That’s because in 2009 I was in Burma during December and for 2010 the data stopped on December 6th when the last backup was made. A simple t-test confirmed that my texting did indeed increase after the breakup.
More interestingly, is that before my girlfriend and I broke up last year I texted more men than women, but shortly after we broke up that flipped. I don’t think that needs much of an explanation. The above graph and further analysis excludes her and family members because they would bias the gender effect. Being a good statistician I ran a poisson regression to see if there really was a significant change. The coefficient plot below (which is on the logarithmic scale) shows that my texting with males increased after the breakup (or Epoch) by 74% (calculated by summing the coefficients for “Epoch”, “Male” and “Male:Epoch” and then exponentiating) while my texting with females increased 127%.
Further analysis and a how-to after the break.
Next I saw that my text messages–both incoming and outgoing–surge, unsurprisingly, toward the end of the week. I ran a regression using the day of the week as an unordered factor and saw that texts do trend up toward the end of the week, particularly Sunday which I can’t explain. Then again, it’s probably counting late night Saturday, after midnight, when I’m texting people while enjoying New York’s nightlife.
Update: As suspected, the uptick on Sunday is really due to late night Saturday texts as seen in the graph below. I didn’t run ay test on it, the graph was good enough for me.
And now for code and instructions
First, you have to locate the two necessary databases in your backup folder. These are in different places depending on your OS, so please see these instructions. The two databases you want are 3d0d7e5fb2ce288813306e4d4636395e047a3d28 (the messages) and 31bb7ba8914766d4ba40d6dfb6113c8b614be442 (your contacts). Save those databases to a new folder (you don’t want to mess with originals, just in case) and rename them “sms.db” and “AddressBook sqlitedb” respectively. I don’t know if you need to rename them, but I did and it worked. A complete list of databases is available here.
Next you must download and install the appropriate driver for SQLite from this site. Then you need to use that driver to make connections to both databases under the “File DSN” section of ODBC Data Source Administrator. Sorry, I don’t know where to do this on Mac or Linux.
Click “Add” and select SQLite3 ODBC Driver.
Click “Next” then give a name for the connection such as “sms.db.dsn” in the appropriate folder.
Click “Next” then “Finish” which brings you to following screen where you should click “Browse” then select the “sms.db” database. Do the same for the “AddressBook sqlitedb” database calling the connection “AddressBook sqlitedb.dsn”.
First up, load in the necessary libraries.
## Load needed libraries library(RODBC) # to read the database library(ggplot2) # to make the nice charts and facilitate table joins library(lmtest) # to test how well the models fit the data source("http://www.jaredlander.com/code/plotCoef.r") # grab the coefficient plot function source("http://www.jaredlander.com/code/overdispersion.r") # grab the overdispersion test function
Now we connect to the databases.
# Connect to messages DB DB <- odbcDriverConnect() # Connect to the Contacts DB DB2 <- odbcDriverConnect()
Each of those commands pulls up a dialog like the one below. First choose “sms.db.dsn” and then for the second dialog choose “AddressBook sqlitedb.dsn.” The order is important.
This next bunch of code cleans up the data. Sorry for the text wrapping.
Days <- data.frame(DayNum=1:7, DayOfWeek=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) Days$DayOfWeek <- as.character(Days$DayOfWeek) # change to character messages <- sqlQuery(DB,paste("SELECT * FROM message")) # grab messages data messages$Type <- ifelse(messages$flags == 2, 'Received', ifelse(messages$flags == 3, 'Sent', ifelse(messages$flags == 33, 'Message send failure', ifelse(messages$flags == 129, 'Deleted', ifelse(messages$flags == 35, 'Other', 'Other'))))) ## Denote message type messages$Date <- as.Date(floor(messages$date/60/60/24), origin="1970-01-01") ## Set dates messages$DayOfWeek <- weekdays(messages$Date) ## Find out what day of the week that day fell on messages$Year <- substr(messages$Date, 1, 4) # get the Year messages <- within(messages, Month <- droplevels(factor(strftime(Date, format = "%B"),levels = month.name))) # Clean up the data messages <- messages[messages$Type %in% c("Sent", "Received"), ] # only keep data on messages successfully sent or received messages <- join(x=messages, y=Days, by="DayOfWeek") # join the days of week information to the messages table members <- sqlQuery(DB,paste("SELECT * FROM group_member")) # This table is used as a link from the messages table to the contacts table members$TextLink <- members$address # Creates the key field to the contacts table members$Number <- members$address # Get the cleanest version of phone number available personNumbers <- sqlQuery(DB2,paste("SELECT * FROM ABMultiValue")) # The table that joins between the members table and the contacts table personNumbers$TextLink <- personNumbers$value # Creates the key field to the members table personNumbers$ContactLink <- personNumbers$record_id # Creates the key field to the contacts table</p> addresses <- sqlQuery(DB2, paste("SELECT * FROM ABPerson")) # The Contacts table addresses$ContactLink <- addresses$ROWID # Creates the key field to the personNumbers table addresses$AddressRow <- addresses$ROWID # For indexing reasons addresses$Name <- paste(addresses$First, addresses$Last, sep=" ") # Make a combined name
I have gone through and labelled my contacts male or female. If you don’t want to go through that work, you can skip the gender analysis or scrape a site like Baby Name Guesser. It’s fairly simple, you put the name of interest in the URL then use readLines() to get what the site guesses.
gender <- read.csv("C:\\Users\\Jared\\Documents\\iPhone SMS Backup\\Gender.csv", sep=",", header=T, stringsAsFactors=F)
Now join all those tables together.
## Join all of our tables master <- join(x=messages, y=members, by="group_id") master <- join(x=master, y=personNumbers, by="TextLink") master <- join(x=master, y=addresses, by="ContactLink") master <- join(x=master, y=gender, by="Name")
Build the data for plotting texts by month and year in Figure 1.
# Count per month, year, type textMonth <- aggregate(ROWID ~ Year + Month + Type, data=master, length) # Plot by month, year, type qplot(Month, ROWID, data=textMonth, group=Type, colour=Type, geom="line", ylab="# Texts", main="Texts by Month") + facet_grid(Year ~ .) + opts(axis.text.x=theme_text(angle=45, hjust=1)) # Figure 1
The following t-test (and aggregation) tests to see if texting increased after a certain date.
textByDay <- aggregate(ROWID ~ Date + DayOfWeek + DayNum + Year + Month, data=master, length) textByDay$Epoch <- ifelse(textByDay$Date >= "YYYY-MM-DD", 1, 0) t.test(ROWID ~ Epoch, data=textByDay[textByDay$Month != "December", ], alternastive="less")
Now to make the data ready to check the gender crossover, plotting it (Figure 2), and testing it (Figure 3).
# Count per month, year, gender textGenderMonth <- aggregate(ROWID ~ Year + Month + Gender, data=master[!master$Number %in% c("(xxx) xxx-xxxx", "(xxx) xxx-xxxx", "(xxx) xxx-xxxx", "(xxx) xxx-xxxx"), ], length) textGenderMonth$Gender <- ifelse(textGenderMonth$Gender == "f", "Female", "Male") # Plot by day, year, type # Shows crossover point for texting girls vs guys qplot(Month, ROWID, data=textGenderMonth, group=Gender, colour=Gender, geom="line", ylab="# Texts") + facet_grid(Year ~ .) + opts(axis.text.x=theme_text(angle=45, hjust=1)) # Figure 2 textByDayGender <- aggregate(ROWID ~ Date + DayOfWeek + DayNum + Year + Month + Gender, data=master[!master$Number %in% c("(xxx) xxx-xxxx", "(xxx) xxx-xxxx", "(xxx) xxx-xxxx", "(xxx) xxx-xxxx"), ], length) textByDayGender$Epoch <- ifelse(textByDayGender$Date >= "YYYY-MM-DD", 1, 0) textByDayGender$Gender <- ifelse(textByDayGender$Gender == "f", "Female", "Male") textByDayGender$Gender <- factor(textByDayGender$Gender) femaleLM <- lm(ROWID ~ Gender*Epoch, data = textByDayGender[textByDayGender$Month != "December", ]) femaleGLM <- glm(ROWID ~ Gender*Epoch, data = textByDayGender[textByDayGender$Month != "December", ], family="quasipoisson") overdispersionTest(femaleGLM) bptest(femaleGLM) resettest(femaleGLM) plotCoef(femaleGLM, Main="Coefficient Plot for Poisson Regression", YLab="Variable", XLab="Coefficient") # Figure 3
And now we check for the trend in texts over the course of a week (Figures 4 and 5).
textDayNoYear <- aggregate(ROWID ~ DayOfWeek + DayNum + Type, data=master, length) qplot(reorder(DayOfWeek, DayNum), ROWID, data=textDayNoYear, group=Type, colour=Type, geom="line", xlab="Day of Week", ylab="# Texts") + opts(axis.text.x=theme_text(angle=45, hjust=1)) # Figure 4 textByDay <- aggregate(ROWID ~ Date + DayOfWeek + DayNum + Year + Month, data=master, length) textByDay$Epoch <- ifelse(textByDay$Date >= "YYYY-MM-DD", 1, 0) dayOfWeekLM <- lm(ROWID ~ factor(DayNum) - 1, data=textByDay) plotCoef(dayOfWeekLM, Main="Average Texts by Day", YLab="Day", XLab="Texts") #Figure 5
And that’s the basic idea. I left out a lot of the nitty gritty of model checking and all the troubleshooting. I hope everyone enjoyed it and please feel free to message (or @jaredlander on Twitter) me with any questions.
Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.