Be care of RangeQuery in Lucene

Reminder, Lucene has many Query types

– TermQuery, BooleanQuery,

ConstantScoreQuery, MatchAllDocsQuery,

MultiPhraseQuery, FuzzyQuery,

WildcardQuery, RangeQuery, PrefixQuery,

PhraseQuery, Span*Query,

DisjunctionMaxQuery, etc.

There is a bunch of Query implements in Lucene, which makes lucene very powerful in search. However, you should be very care of using Query like RangeQuery, especially when the size of your collection is very large.

As you know that lucene will rewrite the original Query, but some of the implement could be ineffective. Let’s see the code snippet in RangeQuery first.

public RangeQuery(Term lowerTerm, Term upperTerm, boolean inclusive,

Collator collator)

{

this(lowerTerm, upperTerm, inclusive);

this.collator = collator;

}

public Query rewrite(IndexReader reader) throws IOException {

BooleanQuery query = new BooleanQuery(true);

String testField = getField();

if (collator != null) {

TermEnum enumerator = reader.terms(new Term(testField, “”));

String lowerTermText = lowerTerm != null ? lowerTerm.text() : null;

String upperTermText = upperTerm != null ? upperTerm.text() : null;

try {

do {

Term term = enumerator.term();

if (term != null && term.field() == testField) { // interned comparison

if ((lowerTermText == null

|| (inclusive ? collator.compare(term.text(), lowerTermText) >= 0

: collator.compare(term.text(), lowerTermText) > 0))

&& (upperTermText == null

|| (inclusive ? collator.compare(term.text(), upperTermText) <= 0

: collator.compare(term.text(), upperTermText) < 0))) {

addTermToQuery(term, query);

}

while (enumerator.next());

}

finally {

enumerator.close();

}

……………

}

As we can see from this the source code, a RangeQuery may be rewrited into thousands of TermQuery. This will make search ineffective, or even cause “TooManyClauses exception”. In addition, the rewrite method in RangeQuery will traverse through the entire dictionary. This is another reason why RangeQuery would make the search operation slow.

In contrast to RangeQuery, RangeFilter will do this job faster. Although RangeFilter will also traverse through the entire dictionary, it does not have additional search operation as RangeQuery.

The implement of RangeFilter in lucene will not consume much memory. It will only used for approximate 12.5M memory for a collection with 10M documents. According to the statement above, I would recommend you to use RangeFilter rather than RangeQuery.

Actually, ConstantScoreRangeQuery is a wrapper of RangeFilter, which enables us to conduct range search. ConstantScoreRangeQuery returns a constant score equal to its boost for all documents in the range. It’s better than RangeQuery when we want to restrict the spectrum of the result rather than to rank the results partly according to the score by the RangeQuery.

Notes: The implements of FuzzyQuery, WildcardQuery, RangeQuery and PrefixQuery are pretty much the same, also be careful of using them.

1 Comments:

Popular Posts

IR、ML、NLP

Total Pageviews