Thoughts on Lucene, Solr, Nutch and vertical search

Trie-based approximate autocomplete implementation with support for ranks and synonyms

Posted by Kelvin on 01 Jul 2009 at 02:30 am | Tagged as: programming

The problem of auto-completing user queries is a well-explored one.

For example,
Type less, find more: fast autocompletion search with a succinct index
http://stevedaskam.wordpress.com/2009/06/07/putting-autocomplete-data-structure-to-the-test/
http://suggesttree.sourceforge.net/
http://sujitpal.blogspot.com/2007/02/three-autocomplete-implementations.html

However, there’s been little written about supporting synonyms and approximate matching for the purpose of autocompletion.

The approach for autocompletion I’ll be discussing in this article supports the following features:

- given a prefix, returns all words starting with that prefix
- approximate/fuzzy prefix matching based on k-errors / edit distance / Levenstein distance, with the operations: substitution, insert and delete (though adjacent transpositions would be trivial to add)
- support for ranks (i.e. a number associated with a word that affects ranking, like number of search results for a given phrase)
- support for synonyms

The search algorithm is independent of dictionary size, and dependent on k (edit distance) and length of prefix.

Data Structures

The data structure used is a trie. Words are broken into characters. Each character forms a node in a tree.

To support synonyms, a pointer is added to terminal nodes which points to the canonical word of the synonym ring.

To support ranks, a pointer is added to terminal nodes with the value of the rank. At search time, nodes are sorted first by edit distance, then by rank.

Implementation

At index-time, as mentioned above, a trie is built from input words/lexicon. No other preprocessing is done.

At search-time, dynamic programming (DP) is applied on a depth-first traversal of the trie, to collect all the “active sub-tries” of the trie which are less than k errors from the given prefix.

Where traditional DP uses a m x n matrix for its DP table where m and n are the 2 strings to be compared, we instantiate a single (prefix length + max k) x (prefix length) matrix for the entire trie, where prefix is the string for which we want to produce autocomplete results for.

Why (prefix length + max k) x (prefix length) ?

1. We don’t need to compare full strings because we’re only interested in collecting active sub-tries which satisfy the k-errors constraint.

2. For cases where the length of the word to be evaluated is greater than the length of the prefix, evaluations of the word should be performed up to prefix length + maximum k. This is to take into account the scenario where the first k characters of a word, when deleted, satisfy the edit distance constraint. e.g. prefix of hamm and word bahamm, with k=2.

Cutoff optimizations

Each level of the trie has a non-decreasing value of k (edit distance).

Therefore when k > maximum k, we can proceed to reject the entire sub-trie of that node. In other words, given the prefix hamm, and maximum k of 1, when we encounter hen which has k=2, we can discard any children of hen because their k-values will be >= 2.

Collecting autocomplete results

With a list of active sub-tries, we will then proceed to collect all terminal strings, sort them by edit distance and rank, and return the top n results.

Further explorations

1. Implementing adjacent transpositions
2. Implementing Ukkonen’s cutoff algorithm for DP (not to be confused with Ukkonen’s algorithm for constant-time creation of suffix trees)
3. Comparing performance of tries vs compact tries (where non-branching nodes are collapsed)

Trie-based approximate autocomplete implementation with support for ranks and synonyms

Thoughts on Lucene, Solr, Nutch and vertical search

Trie-based approximate autocomplete implementation with support for ranks and synonyms

Data Structures

Implementation

Cutoff optimizations

Collecting autocomplete results

Further explorations

0 Comments:

Popular Posts

IR、ML、NLP

Total Pageviews