This graphs shows received and sent texts by month.  Notice the spike in July 2010.
Fig. 1: This graph shows received and sent text messages by month. Notice the spike in July 2010.

A few weeks ago my iPhone for some reason erased ALL of my previous text messages (SMS and MMS) and it was as if I was starting with a new phone. After doing some digging I discovered that each time you sync your iPhone a copy of its text message database is saved on your computer which can be accessed without jailbreaking.

My original intent was to take the old database and union it with the new database for all the texting I had done since then, thus restoring all of my text messages. But once I got into the SQLite database I realized that I had a ton of information on my hands that was begging to be analyzed. It also didn’t hurt that I was in a lovely but small Vermont town for the week without much else to do at night.

My first finding, as seen above, is that my text messaging spiked after my girlfriend and I broke up around July of last year. Notice that for both years there is a dip in December. That’s because in 2009 I was in Burma during December and for 2010 the data stopped on December 6th when the last backup was made. A simple t-test confirmed that my texting did indeed increase after the breakup.

Fig. 2: This graph shows my text messaging pattern over time for both men and women. Notice the crossover around August 2010.

More interestingly, is that before my girlfriend and I broke up last year I texted more men than women, but shortly after we broke up that flipped. I don’t think that needs much of an explanation. The above graph and further analysis excludes her and family members because they would bias the gender effect. Being a good statistician I ran a poisson regression to see if there really was a significant change. The coefficient plot below (which is on the logarithmic scale) shows that my texting with males increased after the breakup (or Epoch) by 74% (calculated by summing the coefficients for “Epoch”, “Male” and “Male:Epoch” and then exponentiating) while my texting with females increased 127%.

Fig. 3: Here the “Male” coefficient seems statistically insignificant but its direction makes sense so it is left in the model. The “Intercept” is interpreted as the texting rate with females before the breakup, “Epoch” is the increase with females after the breakup, “Intercept” plus “Male” is the rate with males before the breakup. “Epoch” combined with “Male:Epoch” is the change in rate for texts with males after the breakup.

Further analysis and a how-to after the break.

Next I saw that my text messages–both incoming and outgoing–surge, unsurprisingly, toward the end of the week. I ran a regression using the day of the week as an unordered factor and saw that texts do trend up toward the end of the week, particularly Sunday which I can’t explain. Then again, it’s probably counting late night Saturday, after midnight, when I’m texting people while enjoying New York’s nightlife.

Fig 4: This graph shows a clear upward trend toward the end of the week with a bump on Thursday as well.
Fig. 5: This coefficient plot shows that Sunday clearly stands out far and above the other days in terms of text messages sent and receive. Day 1 is Monday and Day 7 is Sunday. Putting the days in their natural order isn’t built into my coefficient plot yet.

Update:  As suspected, the uptick on Sunday is really due to late night Saturday texts as seen in the graph below.  I didn’t run ay test on it, the graph was good enough for me.

Figure 6: Now the lines represent calls in the AM or PM. PM Calls steadily increases toward the end of the week and AM calls jump dramatically on Saturday and Sunday. This is due to late night texting on Friday and Saturday when I am out well past midnight.

And now for code and instructions

First, you have to locate the two necessary databases in your backup folder. These are in different places depending on your OS, so please see these instructions. The two databases you want are 3d0d7e5fb2ce288813306e4d4636395e047a3d28 (the messages) and 31bb7ba8914766d4ba40d6dfb6113c8b614be442 (your contacts). Save those databases to a new folder (you don’t want to mess with originals, just in case) and rename them “sms.db” and “AddressBook sqlitedb” respectively. I don’t know if you need to rename them, but I did and it worked. A complete list of databases is available here.

Next you must download and install the appropriate driver for SQLite from this site. Then you need to use that driver to make connections to both databases under the “File DSN” section of ODBC Data Source Administrator. Sorry, I don’t know where to do this on Mac or Linux.

Click “Add” and select SQLite3 ODBC Driver.

Click “Next” then give a name for the connection such as “sms.db.dsn” in the appropriate folder.

Click “Next” then “Finish” which brings you to following screen where you should click “Browse” then select the “sms.db” database. Do the same for the “AddressBook sqlitedb” database calling the connection “AddressBook sqlitedb.dsn”.

As usual, all of my analysis was done in R and the graphs were made with ggplot.

First up, load in the necessary libraries.

## Load needed libraries
library(RODBC) # to read the database
library(ggplot2) # to make the nice charts and facilitate table joins
library(lmtest) # to test how well the models fit the data
source("https://www.jaredlander.com/code/plotCoef.r") # grab the coefficient plot function
source("https://www.jaredlander.com/code/overdispersion.r") # grab the overdispersion test function

Now we connect to the databases.

# Connect to messages DB
DB <- odbcDriverConnect()
# Connect to the Contacts DB
DB2 <- odbcDriverConnect()

Each of those commands pulls up a dialog like the one below. First choose “sms.db.dsn” and then for the second dialog choose “AddressBook sqlitedb.dsn.” The order is important.

This next bunch of code cleans up the data. Sorry for the text wrapping.

Days <- data.frame(DayNum=1:7, DayOfWeek=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
Days$DayOfWeek <- as.character(Days$DayOfWeek) # change to character
messages <- sqlQuery(DB,paste("SELECT * FROM message")) # grab messages data
messages$Type <- ifelse(messages$flags == 2, 'Received', ifelse(messages$flags == 3, 'Sent', ifelse(messages$flags == 33, 'Message send failure', ifelse(messages$flags == 129, 'Deleted', ifelse(messages$flags == 35, 'Other', 'Other'))))) ## Denote message type
messages$Date <- as.Date(floor(messages$date/60/60/24), origin="1970-01-01") ## Set dates
messages$DayOfWeek <- weekdays(messages$Date) ## Find out what day of the week that day fell on
messages$Year <- substr(messages$Date, 1, 4) # get the Year
messages <- within(messages, Month <- droplevels(factor(strftime(Date, format = "%B"),levels = month.name))) # Clean up the data
messages <- messages[messages$Type %in% c("Sent", "Received"), ] # only keep data on messages successfully sent or received
messages <- join(x=messages, y=Days, by="DayOfWeek") # join the days of week information to the messages table
members <- sqlQuery(DB,paste("SELECT * FROM group_member")) # This table is used as a link from the messages table to the contacts table
members$TextLink <- members$address # Creates the key field to the contacts table
members$Number <- members$address # Get the cleanest version of phone number available
personNumbers <- sqlQuery(DB2,paste("SELECT * FROM ABMultiValue")) # The table that joins between the members table and the contacts table
personNumbers$TextLink <- personNumbers$value # Creates the key field to the members table
personNumbers$ContactLink <- personNumbers$record_id # Creates the key field to the contacts table</p>
addresses <- sqlQuery(DB2, paste("SELECT * FROM ABPerson")) # The Contacts table
addresses$ContactLink <- addresses$ROWID # Creates the key field to the personNumbers table
addresses$AddressRow <- addresses$ROWID # For indexing reasons
addresses$Name <- paste(addresses$First, addresses$Last, sep=" ") # Make a combined name

I have gone through and labelled my contacts male or female. If you don’t want to go through that work, you can skip the gender analysis or scrape a site like Baby Name Guesser. It’s fairly simple, you put the name of interest in the URL then use readLines() to get what the site guesses.

gender <- read.csv("C:\\Users\\Jared\\Documents\\iPhone SMS Backup\\Gender.csv", sep=",", header=T, stringsAsFactors=F)

Now join all those tables together.

## Join all of our tables
master <- join(x=messages, y=members, by="group_id")
master <- join(x=master, y=personNumbers, by="TextLink")
master <- join(x=master, y=addresses, by="ContactLink")
master <- join(x=master, y=gender, by="Name")

Build the data for plotting texts by month and year in Figure 1.

 # Count per month, year, type
textMonth <- aggregate(ROWID ~ Year + Month + Type, data=master, length)
# Plot by month, year, type
qplot(Month, ROWID, data=textMonth, group=Type, colour=Type, geom="line", ylab="# Texts", main="Texts by Month") + facet_grid(Year ~ .) + opts(axis.text.x=theme_text(angle=45, hjust=1)) # Figure 1

The following t-test (and aggregation) tests to see if texting increased after a certain date.

textByDay <- aggregate(ROWID ~ Date + DayOfWeek + DayNum + Year + Month, data=master, length)
textByDay$Epoch <- ifelse(textByDay$Date >= "YYYY-MM-DD", 1, 0)
t.test(ROWID ~ Epoch, data=textByDay[textByDay$Month != "December", ], alternastive="less")

Now to make the data ready to check the gender crossover, plotting it (Figure 2), and testing it (Figure 3).

# Count per month, year, gender
textGenderMonth <- aggregate(ROWID ~ Year + Month + Gender, data=master[!master$Number %in% c("(xxx) xxx-xxxx", "(xxx) xxx-xxxx", "(xxx) xxx-xxxx", "(xxx) xxx-xxxx"), ], length)
textGenderMonth$Gender <- ifelse(textGenderMonth$Gender == "f", "Female", "Male")
# Plot by day, year, type
# Shows crossover point for texting girls vs guys
qplot(Month, ROWID, data=textGenderMonth, group=Gender, colour=Gender, geom="line", ylab="# Texts") + facet_grid(Year ~ .) + opts(axis.text.x=theme_text(angle=45, hjust=1)) # Figure 2
textByDayGender <- aggregate(ROWID ~ Date + DayOfWeek + DayNum + Year + Month + Gender, data=master[!master$Number %in% c("(xxx) xxx-xxxx", "(xxx) xxx-xxxx", "(xxx) xxx-xxxx", "(xxx) xxx-xxxx"), ], length)
textByDayGender$Epoch <- ifelse(textByDayGender$Date >= "YYYY-MM-DD", 1, 0)
textByDayGender$Gender <- ifelse(textByDayGender$Gender == "f", "Female", "Male")
textByDayGender$Gender <- factor(textByDayGender$Gender)
femaleLM <- lm(ROWID ~ Gender*Epoch, data = textByDayGender[textByDayGender$Month != "December", ])
femaleGLM <- glm(ROWID ~ Gender*Epoch, data = textByDayGender[textByDayGender$Month != "December", ], family="quasipoisson")
overdispersionTest(femaleGLM)
bptest(femaleGLM)
resettest(femaleGLM)
plotCoef(femaleGLM, Main="Coefficient Plot for Poisson Regression", YLab="Variable", XLab="Coefficient") # Figure 3

And now we check for the trend in texts over the course of a week (Figures 4 and 5).

textDayNoYear <- aggregate(ROWID ~ DayOfWeek + DayNum + Type, data=master, length)
qplot(reorder(DayOfWeek, DayNum), ROWID, data=textDayNoYear, group=Type, colour=Type, geom="line", xlab="Day of Week", ylab="# Texts") + opts(axis.text.x=theme_text(angle=45, hjust=1)) # Figure 4
textByDay <- aggregate(ROWID ~ Date + DayOfWeek + DayNum + Year + Month, data=master, length)
textByDay$Epoch <- ifelse(textByDay$Date >= "YYYY-MM-DD", 1, 0)
dayOfWeekLM <- lm(ROWID ~ factor(DayNum) - 1, data=textByDay)
plotCoef(dayOfWeekLM, Main="Average Texts by Day", YLab="Day", XLab="Texts") #Figure 5

And that’s the basic idea. I left out a lot of the nitty gritty of model checking and all the troubleshooting. I hope everyone enjoyed it and please feel free to message (or @jaredlander on Twitter) me with any questions.

Related Posts



Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Washington DC R Conferences and author of R for Everyone.

Leave a Reply