Web Scraping in Java for Android

Background

American Healthcare is a mess. While pundits are bickering about who should pay, we think that a good start would be simply for people to pay less. You shouldn’t pay a single dollar more when you can get the same product for less. It would be cool to write an app to find this elusive “better deal” for the user. In this post, we’ll do exactly that and show you how easy it is to write such an app, or, in general, any kind of a comparison shopping app. Hopefully, this will inspire you to create something bigger and “make the world a better place”.

Our app will work like so. The user enters a product of interest (e.g. Advil, a popular American drug used against everything except taxes) in a search box and hits ”search”. The app will then go online to 7 pre-selected drugstores, find pricing and other information about the product, collect it, and present it to the user.

Throughout this post, the presumption will be that the app is for the Android platform, although it can be rewritten for the platform of your choice with little effort (in fact, you can write your app in HTML/Javascript over PhoneGap and by doing so make it available for all mobile platforms at once). Selecting Android simply helps us write the example in a single well-understood language, i.e. Java. We will skip the basic wiring code for text input, displaying results, and activities (which Android’s own tutorials explain much better anyway) and focus on the gist of the endeavor – how to get price information from an Android client.

 Choosing the Technology

Server-side Scraping – too onerous

At first, it’s tempting to provide a server backend – after all, all apps like Kayak and Expedia work this way. Good for them, but now estimate how much time it will take you to

  • set up a server,
  • develop the code to collect the data, and
  • how much money it will require to keep the server running and scale it as the app gains in popularity.

All of a sudden, having a server backend doesn’t sound like a fantastic idea anymore.

Embedded Scraping – too slow

All right, what if we scrape target websites directly from the phone relying on something like JSoup? We’ll have the code running purely on the phone, no server needed, so the app will be getting its data directly over the wireless network. This sounds great on paper but falls on its head in reality. Page downloads often spike up to 10 seconds and more when you’re in a busy downtown area. Parsing and filtering the page takes less but still introduces a noticeable delay – around 1-2 seconds. Add multiple pages to the mix and you’re basically looking at a 30+ second search. Hardly appealing user experience. So, what can we do?

Bobik to the Rescue

An elegant solution to this problem is to employ Bobik, a web service for scraping. Bobik employs powerful machinery to perform the work in parallel, supports dynamic websites (i.e. incl. those generated via Ajax), and lets us interact with it using REST API.

Submitting a scraping request to Bobik means that we

  • do only 1 HTTP request (to Bobik) and
  • don’t download the data we won’t use (i.e. 90% of the page).

This means that we can interact with Bobik from any language and with no concern for how much local CPU/memory/network is required – since all the computational work will be carried out by Bobik.

With the technology chosen, it’s time to get our hands dirty. First, let’s figure out what our data sources will be. There are, among others, two large drugstores in the country – CVS and Walgreens. Let’s start with these 2 and add a few others to the mix later. These will be the drugstores we’ll monitor in our app. To do scraping efficiently, let’s see how these stores present the information on their websites. Once we know that structure, we can write code for it in our main scraping procedure.

Preparing to Gather the Data

Determining Search Urls

If you go to cvs.com and enter a search term (e.g. Advil), you’ll get the following url: http://www.cvs.com/search/_/N-3mZ2k?pt=product&searchTerm=advil. Often, such urls have long tails to account for pagination, tracking and other purposes – that are irrelevant to us. This is not the case here and we can proceed with the url “as is”. Here is the full list or search urls:

http://www.cvs.com/search/_/N-3mZ2k?pt=product&searchTerm=" + encodedKeyword,
http://www.myotcstore.com/store/Search.aspx?SearchTerms=" + encodedKeyword,
http://www.familymeds.com/search/search-results.aspx?SearchTerm=" + encodedKeyword,
http://www.canadadrugs.com/search.php?keyword=" + encodedKeyword,
http://thebestonlinepharmacy.net/product.php?prod=" + encodedKeyword,
http://www.walgreens.com/search/results.jsp?Ntt=" + encodedKeyword,
http://www.drugstore.com/search/search_results.asp?N=0&Ntx=mode%2Bmatchallpartial&Ntk=All&Ntt=" + encodedKeyword

Building Queries

Next thing to explore is how to get to the elements like product image, price, description, etc. Bobik supports 4 query types – Javascript, jQuery, XPath, and CSS. While all these are viable options, XPath is the fastest, so we’ll express our queries in XPath.

Open the page source with your browser’s inspector and study the underlying page structure. There is no “one size fits all” recipe for deriving XPath. A good rule of thumb is to think in terms of “if I were a developer of this website, how often and on what event would I change the structure of each element?”. If you ask yourself this question often enough, you’ll notice that some elements appear less reliable (more prone to be changed) than others. Thus, in composing your xpath, try to be as generic as possible while anchoring around key elements that you don’t expect to change much in the future. Once you figure out which queries you probably need, you can double-check them at http://usebobik.com/api/test before you start coding.

Repeating this approach for all drugstores yields a complete list you can inspect for yourself at

https://gist.github.com/3035521.

Compiling such a list takes anywhere from a few minutes to a few hours depending on how fluent you are with web programming. The good thing about it is that once compiled, this list changes only when the destination website changes its design (which happens only 1-2 times a year usually).

Thinking Ahead

Now that we have all queries in one place, it would be really nice to get away from those verbose strings should one of them has to change once the app is in the wild. Bobik allows you to store queries for later referencing. This has several benefits:

  1. You get to assign an English name to each query making it more readable and portable.
  2. You can refer to the entire set by a simple alias rather than having to deal with the full array of queries
  3. You can ship your code without having to worry about updating queries. This is because you edit the queries via web interface (without updating the alias) while your app pulls them by alias. This gives us an invaluable benefit of modifying all this data via web interface, as some of them  will likely change over time.

So, let’s store each set of queries at http://usebobik.com/manage under “cvs”, “walgreens”, and other respective aliases (the list shown at https://gist.github.com/3035521 is already categorized accordingly). In doing so, we’ll also rename each query to one of the following:

  • Title – name of the product
  • Price – price
  • Image – product image
  • Link – the purchase link to the original website
  • Details – any other information relevant to the listing

Write Code

The code to assemble the table of results will involve several steps:

  1. Submit a single request with 7 urls to scrape
  2. Then, wait until results are ready (about 5-10 seconds).
  3. Perform a few post-download transformations (extend urls, remove garbage, sort by price)
  4. Display (print) results.

The code is relatively straightforward, though it appears long due to various post-download cleanup of data. In writing the code, we used Bobik SDK.

The crux of it is shown below.

/**
 * Searches on the web for various buying options for a given drug
 *
 * @param drug
 * @return An array of hashes containing some or all of the following elements:
 *  Title - product title
 *  Image - product image
 *  Price - generally a X.XX number, although there can be something as ugly as "$6.99\r\n2/$11.00 or 1/$5.99\r\n \r\nSavings: $1.00 (14%) on 1"
 *  Details - size, weight, and any additional information that could not be categorized easily
 */
public List<JSONObject> findAllOptions(String drug) throws Exception {
    // First, find options in the raw form, then clean them up (transpose, normalize) and return
    JSONObject request = new JSONObject();
    for (String url : getSearchUrls(drug))
        request.accumulate("urls", url);
    for (String query_set : new String[]{"cvs", "MyOTCStore", "drugstore.com", "FamilyMeds", "walgreens", "CanadaDrugs", "thebestonlinepharmacy"})
        request.accumulate("query_sets", query_set);
    request.put("ignore_robots_txt", true);

    final List<JSONObject> results = new ArrayList<JSONObject>();
    Job job = bobik.scrape(request, new JobListenerImpl() {

        @Override
        public void onSuccess(JSONObject jsonObject) {
            // Aggregate results across all search urls
            Iterator search_urls = jsonObject.keys();
            while (search_urls.hasNext()) {
                String search_url = (String)search_urls.next();
                String url_base = getUrlBase(search_url);
                try {
                    JSONObject results_parallel_arrays_of_attributes = jsonObject.getJSONObject(search_url);
                    if (results_parallel_arrays_of_attributes.getJSONArray("Price").length() == 0)
                        continue;   // no priced results from this source
                    List<JSONObject> results_from_this_url = BobikHelper.transpose(results_parallel_arrays_of_attributes);
                    // Perform some remaining cleanup
                    for (JSONObject r : results_from_this_url) {
                        // 1. Make urls absolute
                        for (String link_key : new String[]{"Image", "Link"}) {
                            try {
                                r.put(link_key, url_base + r.get(link_key));
                            } catch (JSONException e) {
                                // continue to the next result if Image or Link is missing
                            }
                        }
                        // 2. Extract price
                        r.put("Price", cleanPrice(r.getString("Price")));
                    }
                    results.addAll(results_from_this_url);
                } catch (JSONException e) {
                    e.printStackTrace();
                    // continue to the next store if this search url is broken
                }
            }
        }
    });
    // Feel free to remove this call if you'd rather show results as they become available
    job.waitForCompletion();
    return results;
}

You can view the full listing at https://gist.github.com/3041523.

Conclusion

The goal of this post was to show how painless it can be to create data-rich apps without running a backend. Outsourcing your data acquisition to a service like Bobik saves you a lot of time. Now, you can now focus on creating the end-user product instead of burying your head deep in developing and tuning the data acquisition channel.

Posted in Uncategorized | Tagged , , | Leave a comment

Web Scraping in Javascript

    Have you ever wanted to build a better version of Kayak? Or how about aggregating doctor information from various directories? Summarize websites? Stock prices coupled with companies’ news?

While these applications are different in what they do, they share one thing in common — the need for an automated way to collect information from the internet, i.e. a web crawler… Building one is hard. You need to allocate the servers, develop software,  deal with performance, scaling, safeguards, etc. Not anymore.

 

Bobik (http://usebobik.com) is a cloud platform for scraping. Scraping. In the cloud. Imagine that. Without digressing into the usual benefits of using a cloud service, let me get straight to the gist of it. Bobik offers REST API that lets you scrape the web in real time from any language.

I’ll give you a quick demonstration of what this service can do for you. In this example, we’ll gather information on all Italian restaurants in the trendy SOMA district of San Francisco. This example will be in Javascript, the common ground for most developers. The same logic can be easily reproduced in any other language.

At a glance, this is the result format we’ll be aiming to get (the “menu” object expands further into items and their prices):

Object
  address: "489 3rd St | Btwn Stillman & Bryant St"
  menu: Object
  menu_url: "http://sanfrancisco.menupages.com/restaurants/la-briciola/menu"
  name: "La Briciola"
  website: "http://sanfrancisco.menupages.com/restaurants/la-briciola/"

(FYI, La Briciola has great food and the owners are super hospitable!)

A short search on Google tells us that a site called MenuPages has a good coverage of restaurants and, in particular, those in SF. Hence, we’ll rely on data we’ll grab from http://sanfrancisco.menupages.com/ (God bless them!).

To start, we need to create an account at http://usebobik.com. After getting through the usual email-password-terms-accept, we land here:

Ok, with the account set up, we can start scraping. The highlighted “Test API” button gives you an in-browser way to interact with the API (no coding needed). Let’s skip that for now and go straight to implementing our example. First, we’ll configure a few things. The minimum set of parameters you give to Bobik consists of the urls that you want to scrape and the queries that you want Bobik to run at every url.

Both can be pre-loaded so you don’t have to send them with every request and instead can refer to them by alias. In our case, the urls will be fairly dynamic and will depend on the search, but queries will be static (since MenuPages did a great job presenting information in a standardized form). Thus, go to Manage Data and from there — to Queries. Here, we’ll enter our queries but not before doing some homework. Open http://sanfrancisco.menupages.com/restaurants/all-areas/soma/italian/ and study it. First thing to notice is that both the neighborhood and cuisine parameters are part of the url, which makes it easy to configure for different lookups, should we decide to use a different area or cuisine type. Next, notice that restaurants are listed in a neat series of table rows. Also, each attribute has its own element (more or less). Ergo, let’s store the following 3 queries for restaurant names, websites, and addresses on Bobik under the alias “menupages”:


Similarly, we’ll need to find each restaurant’s menu given a restaurant. Menus are parked nearby — at http://sanfrancisco.menupages.com/restaurants/la-briciola/menu, where the restaurant is once again a configurable parameter. The menu queries will look like this (you can customize them further to extract menu sections, quantities, etc):

Well… splendid. Now’s it’s time to write a little bit of code (you didn’t think you would get away with pointing and clicking all the way, did you?) The code will be minimal – call Bobik and print results. We could, of course, code directly against Bobik’s API, but we won’t. The guys at Bobik have just started offering an SDK which simplifies our life even further (before we know it, we’ll be coding just by a mere act of dreaming, although not yet, so don’t relax too much). Without further adieu, here goes the code.

The complete script is located at https://gist.github.com/2859254.

Instantiate Bobik client

// Bobik SDK is available at http://usebobik.com/sdk
var bobik = new Bobik("YOUR_AUTH_TOKEN");

Write the “business” logic. First, find the restaurants’ names, addresses and summary websites on MenuPages:

// Finds restaurant directory information (name, website, address, menu_url).
// Upon success, triggers find_menus().
function find_restaurants(neighborhood, cuisine) {
  console.log("Looking for " + cuisine + " restaurants in " + neighborhood + "...");
  var src_url = "http://sanfrancisco.menupages.com/restaurants/all-areas/" + neighborhood + "/" + cuisine;
  bobik.scrape({
      urls: [src_url],
      query_set:  "menupages"
    }, function (scraped_data) {
      if (!scraped_data) {
        console.log("Data is unavailable");
        return;
      }
      var restaurants = scraped_data[src_url]
      if (!restaurants || restaurants.length == 0) {
        console.log("Did not find any restaurants");
        return;
      }
      var restaurants = group_restaurants(restaurants);
      console.log("Found " + restaurants.length + " restaurants");
      var print_as_they_become_available = true;
      if (print_as_they_become_available)
        find_menus_async(restaurants);
      else
        find_menus_sync(restaurants);
  })
}

// A helper function that takes a hash of restaurant names, addresses and websites,
// and turns them into an array of grouped restaurant attributes.
// Also, each restaurant is augmented with the menu url.
function group_restaurants(restaurants) {
  var names = restaurants['Name'];        // an array of names
  var addresses = restaurants['Address']; // an array of addresses
  var urls = restaurants['Url'];          // an array of urls
  var restaurants = [];
  for (var i=0; i<names.length; i++) {
    var website = "http://sanfrancisco.menupages.com" + urls[i];
    // push this restaurant to the array of results
    restaurants.push({
      'name' : names[i],
      'address' : addresses[i],
      'website' : website,
      'menu_url' : website + "menu"
    })
  }
  return restaurants;
}

Next, find the menu for each restaurant. First, here is a version where each restaurant is processed in parallel (this version is more user-friendly as it produces results as they are collected)

// Finds menus for all restaurants and adds those menus to the corresponding restaurant hashes.
// Upon completion, prints full restaurant information.
// This variant processes restaurants in parallel and prints them out as the information becomes available.
function find_menus_async(restaurants) {
  console.log("Looking for menus...");
  for (var x in restaurants) {
    var restaurant = restaurants[x];
    var menu_url = restaurant['menu_url'];
    bobik.scrape({
        urls: [menu_url], // send only one at a time (and don't wait for it to complete before sending the next)
        query_set:  "menu"
      }, function (scraped_data) {
        restaurant['menu'] = scraped_data[menu_url];
        console.log("Found restaurant:" + restaurant);
    })
  }
}

Here is the output that the asynchronous version produces:

Looking for italian restaurants in soma...
Current progress for job 4fcc3c46192f3c023d000020: 0%
Current progress for job 4fcc3c46192f3c023d000020: 0%
Current progress for job 4fcc3c46192f3c023d000020: 100%
Found 19 restaurants
Looking for menus...
Current progress for job 4fcc3c51192f3c023d000024: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c53192f3c023d00002c: 0%
Current progress for job 4fcc3c53192f3c023d000030: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c54192f3c023d000034: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c56192f3c023d00003c: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c57192f3c023d000040: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c58192f3c023d000044: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c59192f3c023d000048: 0%
Current progress for job 4fcc3c5a192f3c023d00004c: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c5a192f3c023d000050: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c5b192f3c023d000054: 0%
Current progress for job 4fcc3c5c192f3c023d000058: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c5d192f3c023d00005c: 0%
Current progress for job 4fcc3c5f192f3c023d000064: 0%
Current progress for job 4fcc3c60192f3c023d000068: 0%
Current progress for job 4fcc3c61192f3c023d00006c: 0%
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c5e192f3c023d000060: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c53192f3c023d00002c: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c59192f3c023d000048: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c5b192f3c023d000054: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c5d192f3c023d00005c: 0%
Current progress for job 4fcc3c5f192f3c023d000064: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c60192f3c023d000068: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c61192f3c023d00006c: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c5d192f3c023d00005c: 100%
Found restaurant:[object Object]
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c55192f3c023d000038: 0%
Current progress for job 4fcc3c52192f3c023d000028: 0%
Current progress for job 4fcc3c55192f3c023d000038: 0%
Found restaurant:[object Object]
Current progress for job 4fcc3c52192f3c023d000028: 0%
Found restaurant:[object Object]

Alternatively, we might be interested in displaying results only when all of them are ready. While this style of processing is unnecessary for the case discussed in this post, there are times when you need for all results to be available before acting upon them. A good example is if you are collecting prices and want to display the best price out of all available (you wouldn’t want to adjust it every time a new result arrives).

 function find_menus_sync(restaurants) {   console.log("Looking for menus...");   // Assemble a list of menu urls and a {url -> restaurant} map.
  // We need this map to match results (since they will be bucketed by url)
  var menu_urls = new Array();
  var url_to_restaurant = {};
  for (var x in restaurants) {
    var restaurant = restaurants[x];
    var menu_url = restaurant['menu_url'];
    menu_urls.push(menu_url);
    url_to_restaurant[menu_url] = restaurant;
  }

  bobik.scrape({
      urls: menu_urls,
      query_set:  "menu"
    }, function (scraped_data) {
      for (var url in scraped_data)
        url_to_restaurant[url]['menu'] = scraped_data[url];
      console.log(restaurants);
  })
}

For comparison, here is the output of the synchronous version. Although the log itself is shorter, the actual lookup takes more time.

Looking for italian restaurants in soma...
Current progress for job 4fcc3d5c192f3c0b6f000004: 0%
Current progress for job 4fcc3d5c192f3c0b6f000004: 0%
Current progress for job 4fcc3d5c192f3c0b6f000004: 100%
Found 19 restaurants
Looking for menus...
Current progress for job 4fcc3d67192f3c0b6f000008: 0%
Current progress for job 4fcc3d67192f3c0b6f000008: 21.052631578947366%
Current progress for job 4fcc3d67192f3c0b6f000008: 100%
[Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]

Lastly, kick off the process!

find_restaurants('soma', 'italian')

Bingo!

This is only one of a myriad things you can achieve with Bobik. While I used only XPath queries in this example, Bobik’s definition of “query” extends to anything that can be run or applied to the target url. Thus, among other things, you can use any Javascript you wish to implement more complex page mining algorithms. More on that in my next post. Stay tuned!

Posted in Uncategorized | Tagged | 3 Comments

Enough with crawling!

How often do you hear of web crawling these days? Quite a lot, I bet. In the world where a gazillion of information is organized into websites, crawling is an essential mechanism of acquiring that data. Crawling allows you to set the start point and let machines explore outgoing links for further information while collecting useful data along the way. This is comparable to the early days of exploration – you’d send messengers and scouts in all directions and have them come back and report what lands they encountered, what strange people they met and what exotic goods fascinated their minds. This was a fairly effective way of exploration as you knew your scouts would return when either winter approached, or they’d hit an impassable terrain, or they get killed along the way. In other words, there were always some factors limiting the exploration process to a trackable finite period of time.

A similar approach inspired crawling. Just like human counterparts went from one village to another, web crawlers hop from one url to another in search for the information they’ve been entasked to collect. Yet, one important premise was seemingly ignored – the finite size of the target domain. It is one thing “unleashing” your scouts knowing that they’ll come back, but it is a whole different beast to let them explore the ever-expanding universe which has no limits. What happens now is that crawlers end up hitting tons of web addresses without much trust in the information they gather, being often blocked or misled by target websites, and overall committing more time to exploring than bringing the data back. To continue our scout metaphor, imagine those scouts settling in cities to “learn about the local culture” or coming to a fork in the road that led to two distant cities unreachable once you committed to one.

Image

A shrewd tribe leader would not just send one scout and hope that the he would gather all the information alone. The leader would likely give the scout additional resources hire more scouts or “fork” using the computer science term. That could mitigate the issue of not having enough agents to cover the information domain, but that also introduced new problems – scouts could kill each other or be killed wanting to get more money (in software domain, think fighting for who uses the CPU), or they could take the money and wander off (think runaway processes), or they could end up duplicating each other’s work, and so on and so forth.

Basically, the process would invariably become inefficient. In the people domain this was hard to formally study as the time spans involved were proportional to human lifespans. In the computer world, on the other hand, these mistakes often show up much earlier just because crawling programs hit their inevitable defects quite quickly. Alternatively, if they escape the most obvious issues, they end up accumulating a lot of information that the author does not need or cannot store. In other words, whenever an issue is hit, there’s always an overwhelming amount of time or resources wasted. The software engineer has to go through many iterations to get the data they want and the project gets cut sooner or later when finance people realize they are spending a lot more money than the worth of the information the company is getting back.

But, ¡no pasarán!! A more sensible approach is readily available. It’s called scraping. Instead of sending your crawlers “don’t know where” to bring “god knows what”, you can increase the quality coefficient (scientifically defined as usefulness of gathered information per unit of work) by doing some homework before the data collection commences. Rather than crawling everything that can be reached, pick the data sources that you are interested in. Pick a finite set of sites or, maybe pick a directory of sites. In both cases you know your domain enough to allocate resources appropriately. You know the kind of queries that you should issue (because those websites were put together into a directory by another human who spotted some similarity in their content, right?), so you can design your crawlers with a very specific goal in mind, allocate a predictable pool of resources and plan for getting a certain type of data.

One would argue, of course, that a smartly designed crawler does exactly that. I will readily agree. I do not honestly believe that a shrewd software developer will build a “crawl everything” program (although those are the crawlers everyone wants, if you listen to business people). However, I do find quite often that people and their organizations are always interested in collecting all the data they can before they even know what to do with it. While there may be extra business in owning and selling that data (I know of one guy who collected domain ownership for all urls before that information became optionally private – that guy has a very real monopoly on this data now), for most people – this is nothing other than a distraction from the main focus of their business. Instead of being “greedy”, they should spend some time upfront determining what they gather and from which data sources and then make it a pure implementation task to gather that information – as opposed to taking the “research” part of the task well into implementation.

Posted in Uncategorized | Tagged , | 2 Comments