A search application can improve the quality of the experience it delivers by interpreting the intended meaning of the user queries. Once the meaning is guessed, the query can be rewritten to one that will satisfy the user better than the raw query. Vespa includes a query rewriting language which makes it easy to use query rewriting to understand and act upon the semantics of queries.
Some typical tasks that can be achieved using query rewriting include:
- Query focusing: Decide a field to search for a term
- Query enhancing: Add additional terms which improves the query
- Stopwords: Remove terms which hurts recall or precision
- Synonyms: Replace terms or phrases by others
Of course, all these techniques can be combined to create a really good search experience.
Query rewriting in Vespa is done by semantic rules or searchers. Semantic rules is a simple production rule language which operates on queries. For more complex query rewriting logic which could not be handled by simple rules, one could create a rewriting searcher making use of the query rewriting framework.
A simple semantic rule may look like this:
lotr -> lord of the rings;
This means that whenever the term lotr is encountered in a query, replace it by the terms lord of the rings.
Rules can also refer to conditions, and the produced terms can be a modified version of whatever is matched instead of a concrete term:
[brand] -> company:[brand]; [brand] :- sony, dell, ibm, hp;
This rule says that, whenever the condition named brand is matched, replace the matched term(s) by the same term(s) searching the company field. In addition, the brand condition is defined to match any of a list of brands. Note how -> means a replacing production rule, :- means a condition and , separates alternatives.
It is also possible to do grouping using parentheses, list multiple terms which must be matched in sequence, and to write adding production rules using +> instead of ->. Terms are by default added using the query default (as if they were written in the search box), but it is also possible to force them to be AND, OR, NOT or RANK using respectively +, ?, - and $.
Here is a more complex rule illustrating this:
[destination] (in, by, at, on) [place] +> $name:[destination]
This rule boosts matches which has a destination which matches the name field followed by a preposition and a place (the definition of the destination and place conditions are not shown). This is achieved by adding a RANK term—a term which do not impact whether or not a document is matched but which adds a relevancy boost if it is.
The complete syntax of this language is found in the semantic rules reference.
A collection of rules used together are collected in a rule base—a text file containing rules and conditions and which haves the ending .sr; (for semantic rules). Here is an example of a complete rule base:
# Replacements lotr -> lord of the rings; colour -> color; audi -> skoda; # Stopwords [stopword] -> ; # (Replace them by nothing) [stopword] :- and, or, the, be; # Focus brands to the brand field. If we think the brand # field has high quality data, we can replace. We use the same name # for the condition and the field, but this is not necessary. [brand] :- brand:[brand]; [brand] :- sony, dell, ibm, hp; # Boost recognized categories [category] +> $category:[category]; [category] :- laptop, digital camera, camera;
The rules in a rule base is evaluated in order from the top down. A rule will be matched as many times as is possible before evaluation moves on to the next query. So the query colour colour will be rewritten to color color before moving on to the next rule.
Configuring a rule base in your application
A rule base file is placed in the
rules/ directory under
the application package, and will be named as the file excluding the
.sr. E.g. if we save the rules above
[my-application]/rules/example.sr, we will have a
rules base available named example.
To make a rule base be used by default in queries,
@default on a separate line to the rule base. To
deactivate the default rules, add rules.off to the
The rules can safely be updated at any time by
vespa-deploy prepare again. If there are errors in
the rule bases, they will not be updated, and the errors will be
reported on the command line.
To see what the rules are doing, add tracelevel.rules=[number] to the query. 1-5 gives increasingly more detailed output.
Using multiple rule bases
It is possible to place multiple rule bases under
[my-application]/rules/ directory and choose between
them in the query, and rules may also include each other. This is
useful to organize larger sets of rules, to experiment with variants
of the rule set in new bases which includes the standard base, or to
use different sets of rules for different use cases.
To include one rule base in another,
@include(rulebasename) on a separate line, where
rulebasename is the file name (with or without the
.sr). The result will be the same as if the included
rule base were copied in to the location of the include line. If a
condition is defined in both bases, the one from the including
base will be used. It is also possible to refer to the same-named
condition in an included rule base using the
directive as a condition. For example, this rule base adds some more
categories to the category definition in
@include(example) # Category becomes laptop, digital camera, camera, palmtop, phone [category] :- @super, palmtop, phone;
Multiple rule bases can be included, and included rule bases can
themselves have included rule bases. All the rule bases included in
the application package will be available when making queries. One of
the rule bases can be made the default by adding
on a separate line in the rule base. To use another rule base,
add rules.rulebase=[rulebasename] to the query.
Using a finite state automaton
When you want to match more than some thousand words or phrases in conditions, it becomes suboptimal, and perhaps impractical to keep them in a rule base. To help in such cases, it is possible to create a finite state automaton which contains the definition of some (or all) of the conditions of a rule base. Finite state automata are very efficient in storing and making lookups in large lists of strings. An automaton is created from a text file which lists the condition terms to match and the condition names separated by a tab (by default). The name of the condition can be followed by a semicolon and additional data which will be ignored.
This automaton source file defines the same as the stopword and brand conditions in the example rule base:
and stopword or stopword be stopword the stopword sony brand dell brand ibm brand; This text is ignored hp brand
If this automaton is included in the example rule base, the two conditions can be removed from the rule base file. To use the automaton instead, the automaton source file must be compiled into an automaton file. The 'vespa-makefsa' executable is installed as part of Vespa:
$ vespa-makefsa -t sourcefile.txt targetfile.fsa
The target file is used from a rule base by adding @automata(automatonfile) on a separate line in the rule base file (the file path is relative to $VESPA_HOME). Automata files must be added to all QRS nodes manually.
Note that automata are not included in others, so a rule base including another which uses an automaton must also declare to use the same automaton (or an automaton containing any desired changes from the automaton of the included base).