Indri Structured Query Retrieval
Contents
1. Overview
The Indri structured query language brings structured queries to language modeling retreival. Among other things, this query language enables the use of proximity operators (ordered and unordered windows) and field operators in a language modeling context.
The Indri structured query language can only be used with an Indri index (as built by IndriBuildIndex or BuildIndex with indexType=indri).
Pseudo-feedback is implemented via relevance model (RM1) expansion.
2. Applications
RetEval
See "Lemur Retrieval Applications" section.IndriRunQuery
This application runs retrieval evaluation using the Indri SQL with smoothing options. The indri applications, IndriBuildIndex, IndriDaemon, and IndriRunQuery accept parameters from either the command line or from a file. The parameter file uses an XML format. The command line uses dotted path notation. The top level element in the parameters file is named parameters.Retrieval Parameters:
- memory
- an integer value specifying the number of bytes to use for the query retrieval process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.
- index
- path to an Indri Repository. Specified as <index>/path/to/repository</index> in the parameter file and as -index=/path/to/repository on the command line. This element can be specified multiple times to combine Repositories.
- server
- hostname of a host running an Indri server (Indrid). Specified as <server>hostname</server> in the parameter file and as -server=hostname on the command line. The hostname can include an optional port number to connect to, using the form hostname:portnum. This element can be specified multiple times to combine servers.
- count
- an integer value specifying the maximum number of results to return for a given query. Specified as <count>number</count> in the parameter file and as -count=number on the command line.
- rule
- specifies the smoothing rule (TermScoreFunction) to apply. Format of the rule is:
( key ":" value ) [ "," key ":" value ]*Here's an example rule in command line format:
-rule=method:linear,collectionLambda:0.2,field:titleand in parameter file format:
<rule>method:linear,collectionLambda:0.2,field:title</rule>This corresponds to Jelinek-Mercer smoothing with background lambda equal to 0.2, only for items in a title field.
If nothing is listed for a key, all values are assumed. So, a rule that does not specify a field matches all fields. This makes -rule=method:linear,collectionLambda:0.2 a valid rule.
Valid keys:
- method
- smoothing method (text)
- field
- field to apply this rule to
- operator
- type of item in query to apply to { term, window }
Valid methods:
- dirichlet
- (also 'd', 'dir') (default mu=2500)
- jelinek-mercer
- (also 'jm', 'linear') (default collectionLambda=0.4, documentLambda=0.0), collectionLambda is also known as just "lambda", either will work
- twostage
- (also 'two-stage', 'two') (default mu=2500, lambda=0.4)
- stopper
- a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.
- queryOffset
- an integer value specifying one less than the starting query number, eg 150 for TREC formatted output. Specified as <queryOffset>number</queryOffset> in the parameter file and as -queryOffset=number on the command line.
- runID
- a string specifying the id for a query run, used in TREC scorable output. Specified as <runID>someID</runID> in the parameter file and as -runID=someID on the command line.
- trecFormat
- the symbol true to produce TREC scorable output, otherwise the symbol false. Specified as <trecFormat>true</trecFormat> in the parameter file and as -trecFormat=true on the command line. Note that 0 can be used for false, and 1 can be used for true.
- fbDocs
- an integer specifying the number of documents to use for feedback. Specified as <fbDocs>number</fbDocs> in the parameter file and as -fbDocs=number on the command line.
- fbTerms
- an integer specifying the number of terms to use for feedback. Specified as <fbTerms>number</fbTerms> in the parameter file and as -fbTerms=number on the command line.
- fbMu
- a floating point value specifying the value of mu to use for feedback. [NB: document the feedback formulae]. Specified as <fbMu>number</fbMu> in the parameter file and as -fbMu=number on the command line.
- fbOrigWeight
- a floating point value in the range [0.0..1.0] specifying the weight for the original query in the expanded query. Specified as <fbOrigWeight>number</fbOrigWeight> in the parameter file and as -fbOrigWeight=number on the command line.
3. Indri Structured Query Language
The structured query operators are classified as either belief, proximity, or field operators. Belief operators allow beliefs about terms, proximity expressions, and other complex expressions to be combined. The primary operators for most queries are #combine and #weight.
TERMS / PROXIMITY
Terms:
- term -- stemmed / normalized term
- "term" -- unstemmed / unnormalized term
- #base64( ... ) -- converts from base64 -> ascii and then stems and normalizes. useful for including non-parsable terms in a query
- #base64quote( ... ) -- same as #base64 except the the ascii term is unstemmed and unnormalized
- dogs
- "NASA"
- #base64(Wyh2Lm4ucC5hLnIucy5hLmIubC5lLild) -- equivalent to query term [(u.n.p.a.r.s.a.b.l.e.)]
Proximity terms:
- #odN( ... ) -- ordered window -- terms must appear ordered, with at most N-1 terms between each
- #N( ... ) -- same as #odN
- #uwN( ... ) unordered window -- all terms must appear within window of length N in any order
- #uw( ... ) -- unlimited unordered window -- all terms must appear within current context in any order
- #1(white house) -- matches "white house" as an exact phrase
- #2(white house) -- matches "white * house" (where * is any word or null)
- #uw2(white house) -- matches "white house" and "house white"
Synonyms:
- #syn( ... )
- { ... }
- < ... >
Examples:
- #syn( #1(united states) #1(united states of america) )
- {dog canine}
- <#1(light bulb) lightbulb>
"Any" operator:
- #any -- used to match extent types
- #any:PERSON -- matches any occurence of a PERSON extent
- #1(napolean died in #any:DATE) -- matches exact phrases of the form: "napolean died in <date>...</date>"
Field restriction / evaluation:
- expression.f1,,...,fN(c1,...,cN) -- matches when the expression appears in field f1 AND f2 AND ... AND fN and evaluates the expression using the language model defined by the concatenation of fields c1...cN within the document.
- dog.title -- matches the term dog appearing in a title extent (uses document language model)
- #1(trevor strohman).person -- matches the phrase "trevor strohman" when it appears in a person extent (uses document language model)
- dog.(title) -- evaluates the term based on the title language model for the document
- #1(trevor strohman).person(header) -- builds a language model from all of the "header" text in the document and evaluates #1(trevor strohman).person in that context (matches only the exact phrase appearing within a person extent within the header context)
COMBINING BELIEFS
Belief operators:
- #sum
- #wsum
- #wand (weighted and)
- #or
- #combine
- #weight
- #max
- #not
- #band (boolean and)
- #combine( <dog canine> training )
- #combine( #1(white house) <#1(president bush) #1(george bush)> )
- #weight( 1.0 #1(white house) 2.0 #1(easter egg hunt) )
Extent retrieval:
- #beliefop[field]( query ) -- evaluates #beliefop( query ) for all extents of type "field" in the document and returns a score for each. the language model used to evaluate the query is formed from the text of the extent.
- #combine[sentence]( #1(napolean died in #any:DATE ) ) -- returns a scored list of sentence extents that match the given query
FILTER OPERATORS
Filter operators:
- #filreq -- filter require
- #filrej -- filter reject
- #filreq( sheep #combine(dolly cloning) ) -- only consider those documents matching the query "sheep" and rank them according to the query #combine(dolly cloning)
- #filrej( parton #combine(dolly cloning) ) -- only consider those documents NOT matching the query "parton" and rank them according to the query #combine(dolly cloning)
NUMERIC / DATE FIELD OPERATORS
General numeric operators:
- #less( F N ) -- matches numeric field extents of type F if value < N
- #greater( F N ) -- matches numeric field extents of type F if value > N
- #between( F N_low N_high ) -- matches numeric field extents of type F if N_low < value < N_high
- #equals( F N ) -- matches numeric field extents of type F if value == N
Date operators:
- #date:after( D ) -- matches numeric "date" extents if date is after D
- #date:before( D ) -- matches numeric "date" extents if date is before D
- #date:between( D_low, D_high ) -- matches numeric "date" extents if D_low < date < D_high
- 11 january 2004
- 11-JAN-04
- 11-JAN-2004
- January 11 2004
- 01/11/04 (MM/DD/YY)
- 01/11/2004 (MM/DD/YYYY)
- #filreq(#less(READINGLEVEL 10) george washington) -- if each document in a collection contained a numeric tag that specified the reading level of the document, then this query will only retrieve documents that have a reading level below grade 10 and documents will be ranked according to the query "george washington".
- #combine( european history #date:between( 01/01/1800, 01/01/1900 ) ) -- such a query may be constructed to find information about 19th century european history, as this query will find pages that discuss "european history" and contain 19th century dates.
DOCUMENT PRIORS
Prior
- #prior( NAME ) -- creates the document prior specified by the name given
- #combine(#prior(RECENT) global warming) -- we might create a prior named RECENT to be used to give greater weight to documents that were published more recently.
2 Comments:
Your blog has become my go-to source for inspiration and knowledge. Whether I'm looking for practical advice or words of encouragement on professional Essay Editing & Proofreading, I know I can always find it here. Thank you for consistently delivering excellent content that enriches the lives of your readers
The insights on Indri and structured query retrieval are both informative and thought-provoking! It's fascinating to see how advancements in information retrieval can enhance our ability to access and utilize data effectively. The discussion on the application of Indri in various contexts is particularly interesting for anyone involved in data science or research. Your explanations help demystify complex concepts, making them more accessible to a broader audience.
Post a Comment