Introduction to Nutch, Part 2: Searching
by Tom White
02/16/2006
- Contents
- Running the Search Application
- Integrating Nutch Search
- Real-World Nutch Search
- Conclusion
- Resources
- Dedication
In
part one of this two part series on Nutch, the
open-source Java search engine, we looked at how to crawl websites.
Recall that the Nutch crawler system produces three key data
structures:
- The WebDB containing the web graph of pages and
links. - A set of segments containing the raw data retrieved from
the Web by the fetchers. - The merged index created by indexing and de-duplicating
parsed data from the segments.
In this article, we turn to searching. The Nutch search
system uses the index and segments generated during the crawling
process to answer users' search queries. We shall see how to get
the Nutch search application up and running, and how to customize
and extend it for integration into an existing website. We'll also
look at how to re-crawl sites to keep your index up to date--a
requirement of all real-world search engines.
Running the Search Application
Without further ado, let's run a search using the results of the
crawl we did last time. Tomcat seems to be the most popular
servlet container for running Nutch, so let's assume you have it
installed (although there is some guidance
on the Nutch wiki for Resin).
The first step is to install the Nutch web app. There are some
reported problems with running Nutch (version 0.7.1) as a
non-root web app, so it is currently safest to install it as the
root web app. This is what the Nutch tutorial advises. If Tomcat's
web apps are in ~/tomcat/webapps/, then type the following in
the directory where you unpacked Nutch:
rm -rf ~/tomcat/webapps/ROOT*
cp nutch*.war ~/tomcat/webapps/ROOT.war
The second step is to ensure that the web app can find the index
and segments that we generated last time. Nutch looks for these in
the index and segments subdirectories of the
directory defined in the searcher.dir
property. The
default value for searcher.dir
is the current
directory (.
), which is where you started Tomcat.
While this may be convenient during development, often you don't
have so much control over the directory in which Tomcat starts up,
so you want to be explicit about where the index and segments may
be found. Recall from part one that Nutch's configuration files are
found in the conf subdirectory of the Nutch distribution.
For the web app, these files can be found in
WEB-INF/classes/. So we simply create a file called
nutch-site.xml in this directory (of the unpacked web app)
and set searcher.dir
to be the crawl directory
containing the index and segments.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<nutch-conf>
<property>
<name>searcher.dir</name>
<value>/Users/tom/Applications/nutch-0.7.1/crawl-tinysite</value>
</property>
</nutch-conf>
After restarting Tomcat, enter the URL of the root web app in
your browser (in this example, I'm running Tomcat on port 80, but
the default is port 8080) and you should see the Nutch home page.
Do a search and you will get a page of search results like Figure
1.
Figure 1. Nutch search results for the query "animals"
The search results are displayed using the format used by all
mainstream search engines these days. The explain and
anchors links that are shown for each hit are unusual and
deserve further comment.
Score Explanation
Clicking the explain link for the page A hit brings up
the page shown in Figure 2. It shows some metadata for the page hit
(page A), and a score explanation. The score explanation is
a Lucene feature that shows all of the factors that contribute to the
calculated score for a particular hit. The formula for score
calculation is rather
technical, so it is natural to ask why this page is promoted by
Nutch when it is clearly unsuitable for the average user.
Figure 2. Nutch's score explanation page for page A, matching the
query "animals"
One of Nutch's key selling points is its transparency. Its
ranking algorithms are open source, so anyone can see them. Nutch's
ability to "explain" its rankings online--via the explain
link--takes this one step further and allows an (expert) user to
see why one particular hit ranked above another for a given search.
In practice, this page is only really useful for diagnostic
purposes for people running a Nutch search engine, so there is no
need to expose it publicly, except perhaps for PR reasons.
Anchors
The anchors page (not illustrated here) provides a list
of the incoming anchor text for the pages that link to the page of
interest. In this case, the link to page A from page B had the
anchor text "A." Again, this is a feature for Nutch site
maintainers rather than the average user of the site.
Integrating Nutch Search
While the Nutch web app is a great way to get started with
search, most projects using Nutch require the search function to be
more tightly integrated with their application. There are various
ways to achieve this, depending on the application. The two ways
we'll look at here are using the Nutch API and using the
OpenSearch API
Using the Nutch API
If your application is written in Java, then it is worth
considering using Nutch's API directly to add a search capability.
Of course, the Nutch web app is written using the Nutch API, so you
may find it fruitful to use it as a starting point for your
application. If you take this approach, the files to take a look at
first are the JSPs in src/web/jsp in the Nutch
distribution.
To demonstrate Nutch's API, we'll write a minimal command-line
program to perform a search. We'll run the program using Nutch's
launcher, so for the search we did above, for the term "animals,"
we type:
bin/nutch org.tiling.nutch.intro.SearchApp animals
And the output is as follows.
'A' is for Alligator (http://keaton/tinysite/A.html)
<b> ... </b>Alligators' main prey are smaller <b>animals</b> that they can kill and<b> ... </b>
'C' is for Cow (http://keaton/tinysite/C.html)
<b> ... </b>leather and as draught <b>animals</b> (pulling carts, plows and<b> ... </b>
Here's the program that achieves this. To get it to run, the
compiled class is packaged in a .jar file, which is then placed in
Nutch's lib directory. See the Resources section to obtain the .jar file.
package org.tiling.nutch.intro;
import java.io.IOException;
import org.apache.nutch.searcher.Hit;
import org.apache.nutch.searcher.HitDetails;
import org.apache.nutch.searcher.Hits;
import org.apache.nutch.searcher.NutchBean;
import org.apache.nutch.searcher.Query;
public class SearchApp {
private static final int NUM_HITS = 10;
public static void main(String[] args)
throws IOException {
if (args.length == 0) {
String usage = "Usage: SearchApp query";
System.err.println(usage);
System.exit(-1);
}
NutchBean bean = new NutchBean();
Query query = Query.parse(args[0]);
Hits hits = bean.search(query, NUM_HITS);
for (int i = 0; i < hits.getLength(); i++) {
Hit hit = hits.getHit(i);
HitDetails details = bean.getDetails(hit);
String title = details.getValue("title");
String url = details.getValue("url");
String summary =
bean.getSummary(details, query);
System.out.print(title);
System.out.print(" (");
System.out.print(url);
System.out.println(")");
System.out.println("\t" + summary);
}
}
}
Although it's a short and simple program, Nutch is doing lots of
work for us, so we'll examine it in some detail. The central class
here is NutchBean
--it orchestrates the search for
us. Indeed, the
doc comment for NutchBean
states that it provides
"One-stop shopping for search-related functionality."
Upon construction, the NutchBean
object opens the
index it is searching against in read-only mode, and reads the set
of segment names and filesystem locations into memory. The index
and segments locations are configured in the same way as they were
for the web app: via the searcher.dir
property.
Before we can perform the search, we parse the query string
given as the first parameter on the command line
(args[0]
) into a Nutch Query
object. TheQuery.parse()
method invokes Nutch's specialized
parser (org.apache.nutch.analysis.NutchAnalysis
), which
is generated from a grammar using the JavaCC parser generator.
Although Nutch relies heavily on Lucene for its text indexing,
analysis, and searching capabilities, there are many places where
Nutch enhances or provides different implementations of core Lucene
functions. This is the case for Query
, so be careful
not to confuse Lucene's org.apache.lucene.search.Query
with Nutch's org.apache.nutch.searcher.Query
. The
types represent the same concept (a user's query), but they are not
type-compatible with one another.
With a Query
object in hand, we can now ask the bean
to do the search for us. It does this by translating the NutchQuery
into an optimized Lucene Query
,
then carrying out a regular Lucene search. Finally, a NutchHits
object is returned, which represents the top
matches for the query. This object only contains index and document
identifiers. To return useful information about each hit, we go
back to the bean to get a HitDetails
object for each
hit we are interested in, which contains the data from the index.
We retrieve only the title and URL fields here, but there are more
fields available: the field names may be found using thegetField(int i)
method of HitDetails
.
The last piece of information that is displayed by the
application is a short HTML summary that shows the context of the
query terms in each matching document. The summary is constructed
by the bean's getSummary()
method. TheHitDetails
argument is used to find the segment and
document number for retrieving the document's parsed text, which is
then processed to find the first occurrence of any of the terms in
the Query
argument. Note that the amount of context to
show in the summary--that is, the number of terms before and after
the matching query terms--and the maximum summary length are both
Nutch configuration properties
(searcher.summary.context
andsearcher.summary.length
, respectively).
That's the end of the example, but you may not be surprised to
learn that NutchBean
provides access to more of the
data stored in the segments, such as cached content and fetch date.
Take a look at the
API documentation for more details.
Using the OpenSearch API
OpenSearch is an
extension of RSS 2.0 for publishing search engine results, and was
developed by A9.com, the search engine
owned by Amazon.com. Nutch supports OpenSearch 1.0 out of the box.
The OpenSearch results for the search in Figure 1 can be accessed
by clicking on the RSS link in the bottom right-hand corner of the
page. This is the XML that is returned:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:nutch="http://www.nutch.org/opensearchrss/1.0/"
xmlns:opensearch="http://a9.com/-/spec/opensearchrss/1.0/">
<channel>
<title>Nutch: animals</title>
<description>Nutch search results for query: animals</description>
<link>http://localhost/search.jsp?query=animals&start=0&hitsPerDup=2&hitsPerPage=10</link>
<opensearch:totalResults>2</opensearch:totalResults>
<opensearch:startIndex>0</opensearch:startIndex>
<opensearch:itemsPerPage>10</opensearch:itemsPerPage>
<nutch:query>animals</nutch:query>
<item>
<title>'A' is for Alligator</title>
<description><b> ... </b>Alligators'
main prey are smaller <b>animals</b>
that they can kill and<b> ... </b></description>
<link>http://keaton/tinysite/A.html</link>
<nutch:site>keaton</nutch:site>
<nutch:cache>http://localhost/cached.jsp?idx=0&id=0</nutch:cache>
<nutch:explain>http://localhost/explain.jsp?idx=0&id=0&query=animals</nutch:explain>
<nutch:docNo>0</nutch:docNo>
<nutch:segment>20051025121334</nutch:segment>
<nutch:digest>fb8b9f0792e449cda72a9670b4ce833a</nutch:digest>
<nutch:boost>1.3132616</nutch:boost>
</item>
<item>
<title>'C' is for Cow</title>
<description><b> ... </b>leather
and as draught <b>animals</b>
(pulling carts, plows and<b> ... </b></description>
<link>http://keaton/tinysite/C.html</link>
<nutch:site>keaton</nutch:site>
<nutch:cache>http://localhost/cached.jsp?idx=0&id=2</nutch:cache>
<nutch:explain>http://localhost/explain.jsp?idx=0&id=2&query=animals</nutch:explain>
<nutch:docNo>1</nutch:docNo>
<nutch:segment>20051025121339</nutch:segment>
<nutch:digest>be7e0a5c7ad9d98dd3a518838afd5276</nutch:digest>
<nutch:boost>1.3132616</nutch:boost>
</item>
</channel>
</rss>
This document is an RSS 2.0 document, where each hit is
represented by an item
element. Notice the two extra
namespaces, opensearch
and nutch
, which
allow search-specific data to be included in the RSS document. For
example, the opensearch:totalResults
element tells you
the number of search results available (not just those returned in
this page). Nutch also defines its own extensions, allowing
consumers of this document to access page metadata or related
resources, such as the cached content of a page, via the URL in thenutch:cache
element.
Using OpenSearch to integrate Nutch is a great fit if your
front-end application is not written in Java. For example, you
could write a PHP front end to Nutch by writing a PHP search page
that calls the OpenSearch servlet and then parses the RSS response and
displays the results.
Real-World Nutch Search
The examples we have looked at so far have been very simple in
order to demonstrate the concepts behind Nutch. In a real Nutch
setup, other considerations come into play. One of the most
frequently asked questions on the Nutch newsgroups concerns keeping
the index up to date. The rest of this article looks at how to
re-crawl pages to keep your search results fresh and relevant.
Re-Crawling
Unfortunately, re-crawling is not as simple as re-running thecrawl
tool that we saw in part one. Recall that this
tool creates a pristine WebDB each time it is run, and starts
compiling lists of URLs to fetch from a small set of seed URLs. A
re-crawl starts with the WebDB structure from the previous crawl
and constructs the fetchlist from there. This is generally a good
idea, as most sites have a relatively static URL scheme. It is,
however, possible to filter out the transient portions of a site's
URL space that should not be crawled by editing the
conf/regex-urlfilter.txt configuration file. Don't be
confused by the similarity between conf/crawl-urlfilter.txt
and conf/regex-urlfilter.txt--while they both have the
same syntax, the former is used only by the crawl
tool, and the latter by all other tools.
The re-crawl amounts to running the generate/fetch/update cycle,
followed by index creation. To accomplish this, we employ the
lower-level Nutch tools to which the crawl
tool delegates. Here is a simple shell script to do it, with the tool names
highlighted:
#!/bin/bash
# A simple script to run a Nutch re-crawl
if [ -n "$1" ]
then
crawl_dir=$1
else
echo "Usage: recrawl crawl_dir [depth] [adddays]"
exit 1
fi
if [ -n "$2" ]
then
depth=$2
else
depth=5
fi
if [ -n "$3" ]
then
adddays=$3
else
adddays=0
fi
webdb_dir=$crawl_dir/db
segments_dir=$crawl_dir/segments
index_dir=$crawl_dir/index
# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
segment=`ls -d $segments_dir/* | tail -1`
bin/nutch fetch $segment
bin/nutch updatedb $webdb_dir $segment
done
# Update segments
mkdir tmp
bin/nutch updatesegs $webdb_dir $segments_dir tmp
rm -R tmp
# Index segments
for segment in `ls -d $segments_dir/* | tail -$depth`
do
bin/nutch index $segment
done
# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $segments_dir bogus
# Merge indexes
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
To re-crawl the toy site we crawled in part one, we would run:
./recrawl crawl-tinysite 3
The script is practically identical to the crawl
tool except that it doesn't create a new WebDB or inject it with
seed URLs. Like crawl
, the script takes an optional
second argument, depth,
which controls the number of
iterations of the generate/fetch/update cycle to run (the default is
five). Here we have specified a depth of three. This allows us to
pick up new links that may have been created since the last
crawl.
The script supports a third argument, adddays, which is
useful for forcing pages to be retrieved even if they are not yet
due to be re-fetched. The page re-fetch interval in Nutch is
controlled by the configuration propertydb.default.fetch.interval
, and defaults to 30 days.
The adddays arguments can be used to advance the clock for
fetchlist generation (but not for calculating the next fetch time),
thereby fetching pages early.
Updating the Live Search Index
Even with the re-crawl script, we have a problem with updating
the live search index. As mentioned above, theNutchBean
class opens the index to search when it is
initialized. Since the Nutch web app caches theNutchBean
in the application servlet context, updates
to the index will never be picked up as long as the servlet
container is running.
This problem is recognized by the Nutch community, so it will
likely be fixed in an upcoming release (Nutch 0.7.1 was the stable
release at the time of writing). Until Nutch provides a way to do
it, you can work around the problem--possibly the simplest way is
to reload the Nutch web app after the re-crawl completes. More
sophisticated ways of solving the problem are
discussed on the newsgroups. These typically involve modifyingNutchBean
and the search JSP to pick up changes to the
index.
Conclusion
In this two-article series, we introduced Nutch and discovered
how to crawl a small collection of websites and run a Nutch search
engine using the results of the crawl. We covered the basics of
Nutch, but there are many other aspects to explore, such as the
numerous plugins available
to customize your setup, the tools for maintaining the search index
(type bin/nutch
to get a list), or even whole-web
crawling and searching. Possibly the best thing about Nutch, though,
is its vibrant user
and developer
community, which is continually coming up with new ideas and ways
to do all things search-related.
Resources
- Download the code supporting this
article.
Part one of this series covers the Nutch crawler system. It also
lists a number of useful
resources.
Dedication
This article is for my elder daughter Emilia.
Tom White is lead Java developer at Kizoom, a leading U.K. software company in the delivery of personalized travel information.
View all java.net Articles.
Showing messages 1 through 14 of 14.
Nutch plugin error
2007-10-19 05:07:43 telmo_friesen
[Reply | View]
After I added this following libraries:
nutch-0.7.2
lucene-core
lucene-misc
The next error occurred:
init:
deps-jar:
compile-single:
run-single:
071018 215310 10 parsing jar:file:/D:/nutch-0.7.2/nutch-0.7.2.zip!/nutch-default.xml
071018 215310 10 parsing jar:file:/D:/nutch-0.7.2/nutch-0.7.2.zip!/nutch-site.xml
071018 215310 10 opening merged index in D:\nutch-0.7.2\crawlteste\index
071018 215310 10 Plugins: directory not found: plugins
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.nutch.analysis.NutchAnalysis.compound(NutchAnalysis.java:262)
at org.apache.nutch.analysis.NutchAnalysis.parse(NutchAnalysis.java:115)
at org.apache.nutch.analysis.NutchAnalysis.parseQuery (NutchAnalysis.java:39)
at org.apache.nutch.searcher.Query.parse(Query.java:408)
at nutchaplication1.SearchApp.main(SearchApp.java:25)
Caused by: java.lang.RuntimeException: org.apache.nutch.searcher.QueryFilter not found.
at org.apache.nutch.searcher.QueryFilters.<clinit>(QueryFilters.java:47)
... 5 more
Java Result: 1
I have found the plug-in file in nutch directory, but where should I put it?
Can you help me? Thanks, Telmo.
compiling SearchApp
2007-07-23 21:45:20 kaimiddleton
[Reply | View]
Hi Tom: I tried to get the SearchApp to compile but I don't understand
what CLASSPATH is necessary. I'm using a very recent nightly build of
nutch. Even if I include every single jar file there is under the
NUTCH_HOME directory tree, plus NUTCH_HOME/src/java, I still get errors:
$ javac -cp $NUTCH_HOME/src/java:[all those jars] SearchApp
SearchApp.java:21: cannot find symbol
symbol : constructor NutchBean()
location: class org.apache.nutch.searcher.NutchBean
NutchBean bean = new NutchBean();
^
SearchApp.java:22: cannot find symbol
symbol : method parse(java.lang.String)
location: class org.apache.nutch.searcher.Query
Query query = Query.parse(args[0]);
^
SearchApp.java:32: incompatible types
found : org.apache.nutch.searcher.Summary
required: java.lang.String
bean.getSummary(details, query);
^
3 errors
I have a full stack trace posted here:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08869.html
Searchapp API example.
2006-08-28 21:10:50 spolanski
[Reply | View]
Hi Tom - could you provide an update to the Searchapp.java example you
have in your article for Nutch-0.8? There seem to be quite a few
differences that I can't seem to adapt to from 0.7.2 to 0.8 and your
example is very useful for me! Thanks, Sandy
Recrawl script version 0.8
2006-08-08 12:13:19 mholt
[Reply | View]
The script has been updated for Nutch 0.8 here:
http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
Version issue?
2006-07-01 05:47:53 avilex
[Reply | View]
Hello. I'm running the 0.8x version. I'm getting this:root@mo:/home/nutch/nutch/trunk# bin/nutch org.tiling.nutch.intro.SearchApp test
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.nutch.searcher.NutchBean: method <init>()V not found
at org.tiling.nutch.intro.SearchApp.main(Unknown Source)
root@mo:/home/nutch/nutch/trunk#
Can you please tell me if this a version issue, or if there's something I need to do to get the NutchBean class to be found?
Several questions...
2006-06-02 09:02:52 ealex
[Reply | View]
Hello,
I have several questions to see how I will use Nutch solution:
1) Each time, we modify .conf files of Nutch, do we have to do
catalina/start for Tomcat, or can we modify Nutch code to avoid such
restart?
2) It seems that the Nutch crawler/searcher is dedicated to one
UserAgent. Imagine that we want to crawl with several user agent. With
the latest version it seems not to be possible
instead of changing a .conf file each time?
3) We want to have search by User Agent. It seems that there is no way to do that?
4) When getting the search result of the crawl (that means "animal" found in page "http://toto..."),
can we have the information of the first url that contains link http://toto?
5) In file crawl-urlfilter.txt, can we set other thing that a domain name: a complete url for example.
Thanks by advance.
About Making a class
2006-04-23 01:03:17 sap007
[Reply | View]
Hello
The article was really useful, i could not get any information, even on
nutch website, about integrating it in own application. But I don't
want to use it as a part of a jar file. I have kept it in my
web-inf/class directory. And when i run it it gives me an exception
"Plugin directory not found" i have the nutch jar file in lib folder.
please let me know how can i over come this problem. May be pasting
plug ins directory some where can solve the problem, i have tried
pasting it at many places but it doesn't work. I am using java Studio
Creator as my IDE.
Thnaks a lot
Index wasn't updated after runned recrawl script
2006-02-17 11:16:27 nutchnewbie
[Reply | View]
First of all, thank you for your instructional article.
I tried to update index using the script in your article, but I didn't get my index updated.
1. bin/nutch crawl urls -dir crawl-tinysite -depth 3
2. nutch search works fine.
3. I added a new link to tinysite/A.html
4. ./recrawl crawl.tinysite 3
after recrawl, index was updated. It seemed that the fetcher didn't generate new entries from the running output.
parsing file:nutch-0.7.1/conf/nutch-default.xml
parsing file:nutch-0.7.1/conf/nutch-site.xml
No FS indicated, using default:local
FetchListTool started
Overall processing: Sorted 0 entries in 0.0 seconds. <---
Overall processing: Sorted Nana entries/second <---
FetchListTool completed
Would you please tell me where caused the problem? thank you.
Fan
Index wasn't updated after runned recrawl script
2006-02-18 10:08:28 tomwhite
[Reply | View]
Glad you enjoyed the article.
I think the problem you are experiencing is that A.html is not due to be re-fetched when you run recrawl so the fetcher doesn't generate any new entries.
The default re-fetch interval is 30 days. To test re-crawling without waiting this long you can use the adddays argument for the recrawl script, as mentioned in the article.
Once you have got the whole process working, you can arrange for recrawl to be run periodically (using cron, for example) without using the adddays argument.
Hope this helps.
Tom
Index wasn't updated after runned recrawl script
2006-07-05 12:24:29 juniorufpa
[Reply | View]
Hi Tom White!
I enjoyed a lot your article too. I'm learned Nutch with it.
I had the same problem when experienced this example. I created a page
with x links, that is my root url. After run the crawler I added 3
links to the root url and then I execute your script using the adddays
argument to be greater then the default re-fetch interval (in my case i
set it to 1) . However I don't get the result I was intend to. Could
you explain in more details how the adddays argument really works?
Another question. Could you suggest any addition to your script that
delete the old segments? After running it some times, many segments are
created and will grow a lot, wasting a lot of disk space.
2 Comments:
Hey there! I know this is kind of off topic but
I was wondering if you knew where I could locate a captcha
plugin for my comment form? I'm using the same blog platform as yours and I'm having trouble
finding one? Thanks a lot!
Stop by my blog - Air Max Pas Cher
It's amazing designed for me to have a web site, which is good for my experience. thanks admin
Look at my web blog: Recommended Reading
Post a Comment