Shoot, it's already 2015.

Faced with having to graduate from college in 2014 (albeit I'm happy about no longer having to pay tuition), a lot has changed. Many of my friends have ventured down the industry route by joining tech companies. As a result, a large portion of my buddies moved out of the college town into tech centers such as San Francisco and Palo Alto.

Combining that fact with my startup work hours, I can tell you that my social life hit a lifetime low. I was so busy with work that I didn't even realize my lease was ending. As I scrambled to look for a new apartment, I called up an old friend to crash on his couch while I was homeless. I can honestly say that the 'startup homelessness' is overly romanticized and that the overall experience was miserable.

"In June 2010, I moved out of my apartment and I have been mostly homeless ever since" - Brian Chesky, CEO Co-Founder Airbnb

I've known my friend (let's call him Andre) since the start of college, and have roomed with him for two years. But as I was about to make that call, I realized that I haven't spoken to him in quite a while. The haunting realization that my relationships were dwindling came as I tallied up the number of friends I still kept in touch with (other than my co-founders).

Fast-forward a few weeks. I'm reading up on activity trackers (think Fitbit and Jawbone) and what they can do with my data. I learn that they can figure out when I'm sleeping (or if I'm sleeping less), with just the sensor data from a wristwatch. As a Statistics and Computer Science grad, the creative usage of data made me ecstatic. I've known that my data was used in places like advertisements, but this is the first time that data was directly used to improve my life. Then I thought, how come all of these giant corporations are using my data for me, instead of me using it for myself?

Targeted ads

I spent some time thinking about what data I can use.

So I thought, why not my Facebook data? I'm not a big Twitter user, and I haven't quite caught up with Instagram or Pinterest. Facebook would encompass all of my online social media data. A quick google search revealed that you can in fact download your entire Facebook presence in one archive. (free tutorial on how here: Tutorial for the Analysis)

I went on Facebook and asked for my data dump.

They were careful with my data and asked me to authenticate one additional time (this in turn made me wonder if anyone else had access to this data). After a few minutes of waiting, I received a dump with a (unzipped) size of 49,500,160 bytes (47MB or about ten mp3 songs). That's my entire Internet presence packed up into almost 400 million zeroes and ones. Unsurprisingly, close to 32MB out of the 47MB are media (photos/videos) and 15MB are other user-generated data. Digging deeper, I found that of the 15MB, over 14 MB consisted of my messaging data.

100,000.

That's how many back and forth messages I had over the last 5 years. That's already 50+/day! But was it 50+/day consistently? Or did I chat hundreds of messages one day and remained relatively quiet the next? Moreover, to whom was I talking to? I know I wasn't too active on Facebook early on (I was one of the last waves of MySpace users #tomismyfriend), and I haven't been too active more recently.

To dig deeper, I opened up the message file and saw that the data was encoded in HTML. This makes it easier to explore via my browser, but slightly harder to systematically analyze. Upon inspecting the HTML dump that encoded my messages, I decided to parse the HTML via BeautifulSoup.

Top communicated friends.

Unsurprisingly, I recognized every single one of the top 5. But something was wrong; I haven't talked to some of these people in a quite a long time (some over a year). But what does this mean? Is this simply because of my overall decrease in Facebook use, or does this actually signal for my systematically deteriorating personal relationships? Or perhaps these are just some exceptions?

To clarify the matter, I decided to build a time-series plot of the raw weekly message count for the top friends. To keep things consistent (since I also processed group messages), here are the rules I used:

  • I keep track of [to] and [from] counter for each person I interact with.
    • This means that for each person I've communicated with, I keep a unique counter that represents the number of messages I sent [to] that person, and also the number of message I received [from] that same person.
  • For each person sending a message within a group, I increment the [from] for that person
  • For each message I send, I increment every single group member's [to] count
  • Every message thread with more than 4 participants was ignored (certain group messages contained all event invites, or classes etc.).

Finally I used pandas to wrangle with the data, and matplotlib to plot it (names are removed below).

Fb from plot

Yup, that's kind of hard to read.

The chart shows that the communication was dominated by one person, which has significantly faded in more recent times.

To make it easier to read, I instead plot the top 4 friends (combined [to] and [from]). As an extra measure for clarity, I rank them each week (taking advantage of scipy.stats.rankdata()). For a last bit of added fancy-ness, I incorporate plot.ly to create and deploy an interactive graph into the cloud.

Before we try to interpret the graph, note that the names are munged for privacy reasons.

Try the analysis on your own Facebook data!

Some Observations

The rise and fall of Lisa and James.

That was the spring of 2013 leading into summer of 2013. I had quit my job in January of that year and was feeling pretty burnt out. There are reports referring to losing and gaining friends, this seems like a good example. Though I was pretty aware of what happened, it is still punishing to see it in person. You don't always get to see something so intangible in a such a quantified manner.

Matt remaining consistent.

Matt is someone I've known for a long time. We're no longer the closest friends, but we talk frequently and consistently (don't know why, but we do). I became close to Matt in late high school, but we've consistently stayed friends. What surprised me was my own lack of awareness about the fact that Matt indeed has been a consistently good friend. We're up to date on what's happening in our lives; I'm probably going to give him a call now, maybe let him know how our Demo Day went.

Emergence of Harris.

I met Harris via a common interest near the end of college. We hang out quite a bit now, so this is accurate. He left the country 2 days ago (as of writing this sentence) to pursue his startup abroad. This discovery will serve as a good reminder for me to keep in touch with him.

Friends jump img

Take Aways

Data is insanely powerful.

But I already knew that. Or did I? As an individual experiencing life one data point at a time, I don't get to see the seemingly insignificant changes that added up to a life-changing trend. I thought I was aware of my own personal data (weight over time, GPA over time, mile-run-time over time, bench-max over time, bank balance over time, you name it). Turns out I was missing the most important one: my relationships over time. The hard thing about relationship is that I didn't know of any quantifiable ways to analyze it, until now. It's no coincidence that the simple analysis performed on a single dataset revealed meaningfully insights.

Limits of data.

What? Limits? Yes limits. A quick poke at the data revealed a lot about my social history, but it's still limited to Facebook. This doesn't include my texts/phone calls or real life interactions. This explains why my co-founders (and long time friends) aren't on the top list, despite the fact that I spend over 80% of my awake and functional time with those guys.

Journal?

I now regret that I hadn't kept a journal to analyze. I'm confident that if we can figure out what is happening (and might happen) in the stock market via sentiment analysis, perhaps we can leverage journal entries to figure out the complex thing that is human relationship. Maybe it'll predict when I'll be depressed / stressed (I'm willing to bet the word "fundraising" is going to be closely tied with my stress level), and help me be more aware of my personal well being.

Last Thoughts

All I did was build a plot that ranked my messaging behavior. Yet it revealed quite a lot of insights about my own social behavior. With more data, I can see ways of building models that can reliably predict ups and downs in my personal life. Given how caught up we are with our lives and careers, maybe we all need something that reminds us to keep in touch with our human friends. Perhaps that'll my next startup idea.

Tools Used

  • BeautifulSoup is an easy to use python package that helps you parse HTML.
  • Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming.
  • Matplotlib is a python 2D plotting library which produces publication quality figures.
  • Scipy is an open source library of Scientific Tools.
  • Plot.ly is an online analytics and data visualization tool.

Share the story!