How often do you hear of web crawling these days? Quite a lot, I bet. In the world where a gazillion of information is organized into websites, crawling is an essential mechanism of acquiring that data. Crawling allows you to set the start point and let machines explore outgoing links for further information while collecting useful data along the way. This is comparable to the early days of exploration – you’d send messengers and scouts in all directions and have them come back and report what lands they encountered, what strange people they met and what exotic goods fascinated their minds. This was a fairly effective way of exploration as you knew your scouts would return when either winter approached, or they’d hit an impassable terrain, or they get killed along the way. In other words, there were always some factors limiting the exploration process to a trackable finite period of time.
A similar approach inspired crawling. Just like human counterparts went from one village to another, web crawlers hop from one url to another in search for the information they’ve been entasked to collect. Yet, one important premise was seemingly ignored – the finite size of the target domain. It is one thing “unleashing” your scouts knowing that they’ll come back, but it is a whole different beast to let them explore the ever-expanding universe which has no limits. What happens now is that crawlers end up hitting tons of web addresses without much trust in the information they gather, being often blocked or misled by target websites, and overall committing more time to exploring than bringing the data back. To continue our scout metaphor, imagine those scouts settling in cities to “learn about the local culture” or coming to a fork in the road that led to two distant cities unreachable once you committed to one.
A shrewd tribe leader would not just send one scout and hope that the he would gather all the information alone. The leader would likely give the scout additional resources hire more scouts or “fork” using the computer science term. That could mitigate the issue of not having enough agents to cover the information domain, but that also introduced new problems – scouts could kill each other or be killed wanting to get more money (in software domain, think fighting for who uses the CPU), or they could take the money and wander off (think runaway processes), or they could end up duplicating each other’s work, and so on and so forth.
Basically, the process would invariably become inefficient. In the people domain this was hard to formally study as the time spans involved were proportional to human lifespans. In the computer world, on the other hand, these mistakes often show up much earlier just because crawling programs hit their inevitable defects quite quickly. Alternatively, if they escape the most obvious issues, they end up accumulating a lot of information that the author does not need or cannot store. In other words, whenever an issue is hit, there’s always an overwhelming amount of time or resources wasted. The software engineer has to go through many iterations to get the data they want and the project gets cut sooner or later when finance people realize they are spending a lot more money than the worth of the information the company is getting back.
But, ¡no pasarán!! A more sensible approach is readily available. It’s called scraping. Instead of sending your crawlers “don’t know where” to bring “god knows what”, you can increase the quality coefficient (scientifically defined as usefulness of gathered information per unit of work) by doing some homework before the data collection commences. Rather than crawling everything that can be reached, pick the data sources that you are interested in. Pick a finite set of sites or, maybe pick a directory of sites. In both cases you know your domain enough to allocate resources appropriately. You know the kind of queries that you should issue (because those websites were put together into a directory by another human who spotted some similarity in their content, right?), so you can design your crawlers with a very specific goal in mind, allocate a predictable pool of resources and plan for getting a certain type of data.
One would argue, of course, that a smartly designed crawler does exactly that. I will readily agree. I do not honestly believe that a shrewd software developer will build a “crawl everything” program (although those are the crawlers everyone wants, if you listen to business people). However, I do find quite often that people and their organizations are always interested in collecting all the data they can before they even know what to do with it. While there may be extra business in owning and selling that data (I know of one guy who collected domain ownership for all urls before that information became optionally private – that guy has a very real monopoly on this data now), for most people – this is nothing other than a distraction from the main focus of their business. Instead of being “greedy”, they should spend some time upfront determining what they gather and from which data sources and then make it a pure implementation task to gather that information – as opposed to taking the “research” part of the task well into implementation.