Project: Facebook Data Liberation — Part 2

Back in Part 1 I explained why I’m making this HTML5 application. The condensed version of what I’m making is a web/mobile app that can read a running/fitness Facebook group feed, display the complete list of posts in it, and scan all the content to calculate how many miles the users say they ran.

In this part I’ll talk a little about the crude web page I put together to do a first pass visualization of the JSON formatted Facebook feed.  

The HTML wire frame for the app is three simple panes:

The HTML for this is really simple:

<div style=“float:left; width:48%”>
  <textarea cols=80 rows=20 id=“textarea”></textarea>
  <br />
  <button id=add onclick=“addPosts()”>Add</button>
<div style=“float:right; width:48%; height: 500px; overflow:scroll” id=“posts”></div>
<div style=“clear:both” id=“lowerpane”></div>

My plan is to build up an array of posts and make sure I can parse them properly. Since this is a quick and dirty proof of concept for a personal project I don’t feel bad using a global variable called allPosts to store the data. To start the process I cut and paste the JSON representation of the group feed into the text area and then click the button to call addPosts():

   function addPosts() {
      var posts = $.parseJSON($(“#textarea”).val());
      for (var i = 0; i <; i++) {
          allPosts[postIdx] =[i];

There are two other functions I call after populating the allPosts variable: showTree() and parseMiles().  Show tree is a lot of boring code that loops over the allPosts array and creates an HTML string with the data formatted in bulleted lists.  When the string is complete, it is used as the contents of the “Tree of posts” div.  While I was writing that function I kept an on-line JSON viewer from open in another window so I could explore the data format.  It’s a great tool to have handy and the only interesting thing I can say about that part of the project.

parseMiles() uses some regular expressions to try and guess which posts are people reporting how many miles they ran that day.  It took a bit of twiddling to get the regex capturing most of the target entries without too many false positives.  I used RegexPal to test my regex quickly in an interactive setting. It is another time saving tool for anyone writing a complex expression.  You can put sample text in one input box and type a regex in another.  While you are typing the expression, all the matches in the sample text will be highlighted as you type. 

A little experimentation convinced me that I wasn’t going to get a good success rate without building a real AI.  I settled for a single regex that looks for a whole or decimal number followed by the word “mile”.  Very simple, but it did just as well as some more complex schemes I tried.  The simple regex seemed to get most of the posts without too many false positives.  I figure that the false positives will balance out the posts that I missed.  After a little cleanup for blog posting, the function looks like this:

   function parseMiles() {
    var sums = 0;
    var matches = “”;
    var nonmatches = “”;
    for (var i = 0; i < allPosts.length; i++) {
     var message = “” + allPosts[i].message;
      var m = message.match(/(\d+\.?\d+) mile/i);
      if (m) {
       matches += “<p>” + message + “</b></p>”;
       sums += parseInt (m[1]);
      } else {
       nonmatches  += “<p>” + message + “</p>”;
    // now show the results we built 
    $(“#lowerpane”).html(“<p> I found ” + sums +” miles in the following posts:</p>” + matches + “<h2>non matches</h2><div style=‘color:blue’>” + nonmatches+”</div>”);

If you have never tried to extract data from free-form human written text you probably won’t guess how difficult it can be to get even 95% right.  People find all kinds of clever ways to say “I ran 4 miles today.”  Here are some examples I came across:

  • I ran 4 miles today.
  • 4 done before work.
  • Got in a 4 mi run on the treadmill while the kids were napping.
  • Ran 2 miles listening to podcasts then finished another 2 listening to my running mix.
  • I wanted to run 6 miles but had to stop after 4 to get the kids at school.
  • Ran almost 4 on the beach.

It is really impossible to get all of them right without writing code that would stand up as a doctoral dissertation.

So, at this point I actually have all the information I set out to get:  The complete list of Facebook group posts and a guess at the number of miles that have been run.  I could call it a day at this point, but I’m kind of interested in the Facebook API now. I also have been brewing some ideas for a mobile app that would really leverage the Facebook group feed.  I’ll cover them in Part 3.