October 21, 2005

The Hidden Web, and how to crawl it Permalink

Fact: general-purpose search engines, such as Google and Yahoo!, only index a small fraction of the information on the real web, about 1/500th.
Think of it this way: Google, considered by most people in the know to have the largest search database, has about eight billion pages in its index. Those eight billion pages seem like a lot until you consider that the Deep Web is estimated to be 500 times bigger than the searchable Web. Multiply 500 by the 8 billion in Google’s index… plus add in the fact that Google is only indexing a fraction of the searchable Web (around 250 billion pages are on the Web today)… and you’ll get a whole bunch of math that makes my head hurt. Suffice it to say that the Deep Web is worth looking into. [LifeHacker, citing About/WebSearch, which report some figures from BrightPlanet, a company specializing in KM solutions]
However, I don't agree with the common opinion that the hidden web is unreachable by search engines because it is voluntarily obscured, or protected by passwords or by the Robot Exclusion Standard.
The main reason is that general-purpose web crawlers (those used by mainstram search engines) can only read web pages as raw data, and the follow the links found in the text. This behaviour leads to two main practical barriers:
  • Links generated by executeable content: many links, expecially "navigational" links, that are often the only gateway to content pages, are not explicitely written in the text, but are generated on the client at runtime - usually by a script (or a flash or java applet), often triggered by a specific behaviour of the human user.
    While embedding a full client-engine simulator in the crawler would be feasible, although unpratical (since executing arbitrary scripts for the sake of looking at what comes out can be very resource consuming, expecially if you are examining billions of pages), the idea of a crawler able to "guess" the possible interactions of an user with the interactive features of the page is totally out of reach.
  • Links generated from user-provided input: when links are created dynamically, using information provided by an user who fills an open-ended form, or issues a query using an open-ended search box, there is no way for a general-purpouse crawler to enumerate all the possible alternatives. While some online databases, such as Wikipedia and the Internet Movies Database solve this problem by providing static links for all the entries (such as Wikipedia's AllPages special page), in other cases, expecially when dealing with content which is dynamically generated, or changing very often (such as financial news services), the very approach of "indexing" content to enable search is inadequate.
During the past three years, I have spent some time developing (as a contractor) a domain-specific programming language for writing context-specific crawlers, and it occurred to me how difficult and compex the contecptually simple act of programmatically accessing a website has become. The problems described above can be solved locally (i.e. for each specific site) by modelling the site-specific patterns of interaction of a typical user, and replicating them using a scriptable client which generates a sequence of synthetic events.

And what is the long term, global solution to these problems, if building indexes is less and less a viable approach?
The answer is always the same: migrating from a web of human-targeted documents and interactive applications to a web of (semantically annotated) data and (semantically annotated) web services. That would enable a novel approach to search based on a federation of context-aware information providers, coordinated by search engines that analyze user queries and route/delegate them to more specific web services.
How, and when, this could became a reality across the industry is not clear right now: both the WS* standards, and the OWL/RDF specifications have until now failed to became widespread, probably because they appear too complex and not really useful; the Semantic Web is still a buzzword; the more economically sustainable "less is more" approach is still dominant (e.g. REST web services, XML-based languages, folksonomies).
However, the idea of programmatically accessible web sites is becoming more and more relevant: we have a lot of web services APIs, but no one cares to agree on the semantics, which is still hard-wired into the applications.

Also: Danny Ayers explores some alternatives to the Semantic Web.

0 Comments:

Post a Comment

<< Home