Web Scraping in Javascript

    Have you ever wanted to build a better version of Kayak? Or how about aggregating doctor information from various directories? Summarize websites? Stock prices coupled with companies’ news?

While these applications are different in what they do, they share one thing in common — the need for an automated way to collect information from the internet, i.e. a web crawler… Building one is hard. You need to allocate the servers, develop software,  deal with performance, scaling, safeguards, etc. Not anymore.

 

Bobik (http://usebobik.com) is a cloud platform for scraping. Scraping. In the cloud. Imagine that. Without digressing into the usual benefits of using a cloud service, let me get straight to the gist of it. Bobik offers REST API that lets you scrape the web in real time from any language.

I’ll give you a quick demonstration of what this service can do for you. In this example, we’ll gather information on all Italian restaurants in the trendy SOMA district of San Francisco. This example will be in Javascript, the common ground for most developers. The same logic can be easily reproduced in any other language.

At a glance, this is the result format we’ll be aiming to get (the “menu” object expands further into items and their prices):

Object
  address: "489 3rd St | Btwn Stillman & Bryant St"
  menu: Object
  menu_url: "http://sanfrancisco.menupages.com/restaurants/la-briciola/menu"
  name: "La Briciola"
  website: "http://sanfrancisco.menupages.com/restaurants/la-briciola/"

(FYI, La Briciola has great food and the owners are super hospitable!)

A short search on Google tells us that a site called MenuPages has a good coverage of restaurants and, in particular, those in SF. Hence, we’ll rely on data we’ll grab from http://sanfrancisco.menupages.com/ (God bless them!).

To start, we need to create an account at http://usebobik.com. After getting through the usual email-password-terms-accept, we land here:

Ok, with the account set up, we can start scraping. The highlighted “Test API” button gives you an in-browser way to interact with the API (no coding needed). Let’s skip that for now and go straight to implementing our example. First, we’ll configure a few things. The minimum set of parameters you give to Bobik consists of the urls that you want to scrape and the queries that you want Bobik to run at every url.

Both can be pre-loaded so you don’t have to send them with every request and instead can refer to them by alias. In our case, the urls will be fairly dynamic and will depend on the search, but queries will be static (since MenuPages did a great job presenting information in a standardized form). Thus, go to Manage Data and from there — to Queries. Here, we’ll enter our queries but not before doing some homework. Open http://sanfrancisco.menupages.com/restaurants/all-areas/soma/italian/ and study it. First thing to notice is that both the neighborhood and cuisine parameters are part of the url, which makes it easy to configure for different lookups, should we decide to use a different area or cuisine type. Next, notice that restaurants are listed in a neat series of table rows. Also, each attribute has its own element (more or less). Ergo, let’s store the following 3 queries for restaurant names, websites, and addresses on Bobik under the alias “menupages”:


Similarly, we’ll need to find each restaurant’s menu given a restaurant. Menus are parked nearby — at http://sanfrancisco.menupages.com/restaurants/la-briciola/menu, where the restaurant is once again a configurable parameter. The menu queries will look like this (you can customize them further to extract menu sections, quantities, etc):

Well… splendid. Now’s it’s time to write a little bit of code (you didn’t think you would get away with pointing and clicking all the way, did you?) The code will be minimal – call Bobik and print results. We could, of course, code directly against Bobik’s API, but we won’t. The guys at Bobik have just started offering an SDK which simplifies our life even further (before we know it, we’ll be coding just by a mere act of dreaming, although not yet, so don’t relax too much). Without further adieu, here goes the code.

The complete script is located at https://gist.github.com/2859254.

Instantiate Bobik client

// Bobik SDK is available at http://usebobik.com/sdk
var bobik = new Bobik("YOUR_AUTH_TOKEN");

Write the “business” logic. First, find the restaurants’ names, addresses and summary websites on MenuPages:

// Finds restaurant directory information (name, website, address, menu_url).
// Upon success, triggers find_menus().
function find_restaurants(neighborhood, cuisine) {
  console.log("Looking for " + cuisine + " restaurants in " + neighborhood + "...");
  var src_url = "http://sanfrancisco.menupages.com/restaurants/all-areas/" + neighborhood + "/" + cuisine;
  bobik.scrape({
      urls: [src_url],
      query_set:  "menupages"
    }, function (scraped_data) {
      if (!scraped_data) {
        console.log("Data is unavailable");
        return;
      }
      var restaurants = scraped_data[src_url]
      if (!restaurants || restaurants.length == 0) {
        console.log("Did not find any restaurants");
        return;
      }
      var restaurants = group_restaurants(restaurants);
      console.log("Found " + restaurants.length + " restaurants");
      var print_as_they_become_available = true;
      if (print_as_they_become_available)
        find_menus_async(restaurants);
      else
        find_menus_sync(restaurants);
  })
}

// A helper function that takes a hash of restaurant names, addresses and websites,
// and turns them into an array of grouped restaurant attributes.
// Also, each restaurant is augmented with the menu url.
function group_restaurants(restaurants) {
  var names = restaurants['Name'];        // an array of names
  var addresses = restaurants['Address']; // an array of addresses
  var urls = restaurants['Url'];          // an array of urls
  var restaurants = [];
  for (var i=0; i<names.length; i++) {
    var website = "http://sanfrancisco.menupages.com" + urls[i];
    // push this restaurant to the array of results
    restaurants.push({
      'name' : names[i],
      'address' : addresses[i],
      'website' : website,
      'menu_url' : website + "menu"
    })
  }
  return restaurants;
}

Next, find the menu for each restaurant. First, here is a version where each restaurant is processed in parallel (this version is more user-friendly as it produces results as they are collected)

// Finds menus for all restaurants and adds those menus to the corresponding restaurant hashes.
// Upon completion, prints full restaurant information.
// This variant processes restaurants in parallel and prints them out as the information becomes available.
function find_menus_async(restaurants) {
  console.log("Looking for menus...");
  for (var x in restaurants) {
    var restaurant = restaurants[x];
    var menu_url = restaurant['menu_url'];
    bobik.scrape({
        urls: [menu_url], // send only one at a time (and don't wait for it to complete before sending the next)
        query_set:  "menu"
      }, function (scraped_data) {
        restaurant['menu'] = scraped_data[menu_url];
        console.log("Found restaurant:" + restaurant);
    })
  }
}

Here is the output that the asynchronous version produces:

Looking for italian restaurants in soma...
Current progress for job 4fcc3c46192f3c023d000020: 0%
Current progress for job 4fcc3c46192f3c023d000020: 0%
Current progress for job 4fcc3c46192f3c023d000020: 100%
Found 19 restaurants
Looking for menus...
Current progress for job 4fcc3c51192f3c023d000024: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c53192f3c023d00002c: 0%
Current progress for job 4fcc3c53192f3c023d000030: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c54192f3c023d000034: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c56192f3c023d00003c: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c57192f3c023d000040: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c58192f3c023d000044: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c59192f3c023d000048: 0%
Current progress for job 4fcc3c5a192f3c023d00004c: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c5a192f3c023d000050: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c5b192f3c023d000054: 0%
Current progress for job 4fcc3c5c192f3c023d000058: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c5d192f3c023d00005c: 0%
Current progress for job 4fcc3c5f192f3c023d000064: 0%
Current progress for job 4fcc3c60192f3c023d000068: 0%
Current progress for job 4fcc3c61192f3c023d00006c: 0%
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c5e192f3c023d000060: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c53192f3c023d00002c: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c59192f3c023d000048: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c5b192f3c023d000054: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c5d192f3c023d00005c: 0%
Current progress for job 4fcc3c5f192f3c023d000064: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c60192f3c023d000068: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c61192f3c023d00006c: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c5d192f3c023d00005c: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c55192f3c023d000038: 0%
Found restaurant:[object Object]
Current progress for job 4fcc3c52192f3c023d000028: 0%
Found restaurant:[object Object]

Alternatively, we might be interested in displaying results only when all of them are ready. While this style of processing is unnecessary for the case discussed in this post, there are times when you need for all results to be available before acting upon them. A good example is if you are collecting prices and want to display the best price out of all available (you wouldn’t want to adjust it every time a new result arrives).

 function find_menus_sync(restaurants) {   console.log("Looking for menus...");   // Assemble a list of menu urls and a {url -> restaurant} map.
  // We need this map to match results (since they will be bucketed by url)
  var menu_urls = new Array();
  var url_to_restaurant = {};
  for (var x in restaurants) {
    var restaurant = restaurants[x];
    var menu_url = restaurant['menu_url'];
    menu_urls.push(menu_url);
    url_to_restaurant[menu_url] = restaurant;
  }

  bobik.scrape({
      urls: menu_urls,
      query_set:  "menu"
    }, function (scraped_data) {
      for (var url in scraped_data)
        url_to_restaurant[url]['menu'] = scraped_data[url];
      console.log(restaurants);
  })
}

For comparison, here is the output of the synchronous version. Although the log itself is shorter, the actual lookup takes more time.

Looking for italian restaurants in soma...
Current progress for job 4fcc3d5c192f3c0b6f000004: 0%
Current progress for job 4fcc3d5c192f3c0b6f000004: 0%
Current progress for job 4fcc3d5c192f3c0b6f000004: 100%
Found 19 restaurants
Looking for menus...
Current progress for job 4fcc3d67192f3c0b6f000008: 0%
Current progress for job 4fcc3d67192f3c0b6f000008: 21.052631578947366%
Current progress for job 4fcc3d67192f3c0b6f000008: 100%
[Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]

Lastly, kick off the process!

find_restaurants('soma', 'italian')

Bingo!

This is only one of a myriad things you can achieve with Bobik. While I used only XPath queries in this example, Bobik’s definition of “query” extends to anything that can be run or applied to the target url. Thus, among other things, you can use any Javascript you wish to implement more complex page mining algorithms. More on that in my next post. Stay tuned!

About these ads

About Eugene

A software entrepreneur from San Francisco relentlessly pursuing the Holy Grail of computer science for the better quality of life for humanity
This entry was posted in Uncategorized and tagged . Bookmark the permalink.

3 Responses to Web Scraping in Javascript

  1. Karrie Hird says:

    Very efficiently written article. It will be beneficial to anyone who utilizes it, as well as yours truly :). Keep doing what you are doing – for sure i will check out more posts.

  2. Liz says:

    Your first link to usebobik.com is broken, by the way!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s