Home » » How Google crawls the deep web

How Google crawls the deep web



 
 

Sent to you by jeffye via Google Reader:

 
 

via Geeking with Greg by Greg Linden on 1/31/09

A googol of Googlers published a paper at VLDB 2008, "Google's Deep-Web Crawl" (PDF), that describes how Google pokes and prods at web forms to see if it can find things to submit in the form that yield interesting data from the underlying database.

An excerpt from the paper:
This paper describes a system for surfacing Deep-Web content; i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index.

Our objective is to select queries for millions of diverse forms such that we are able to achieve good (but perhaps incomplete) coverage through a small number of submissions per site and the surfaced pages are good candidates for selection into a search engine's index.

We adopt an iterative probing approach to identify the candidate keywords for a [generic] text box. At a high level, we assign an initial seed set of words as values for the text box ... [and then] extract additional keywords from the resulting documents ... We repeat the process until we are unable to extract further keywords or have reached an alternate stopping condition.

A typed text box will produce reasonable result pages only with type-appropriate values. We use ... [sampling of] known values for popular types ... e.g. zip codes ... state abbreviations ... city ... date ... [and] price.
Table 5 in the paper shows the effectiveness of the technique, that they are able to retrieve a significant fraction of the records in small and normally hidden databases across the Web with only 500 or less submissions to the form. The authors also say that "the impact on our search traffic is a significant validation of the value of Deep-Web content."

Please see also my April 2008 post, "GoogleBot starts on the deep web".

 
 

Things you can do from here:

 
 

1 Comments:

Brian said...

While Google's model is certainly one way of approaching the problem, the approach fails to recognize that the data in the Deep Web is contextually different than the data in the surface web. Deep Web data tends to be highly structured and more importantly the current Google approach doesn't capture the contextual elements in the data in the deep web. Ideally when you do an author search in the deep web, you want only authors returned.

Popular Posts