Text Matching and Ranking

Refer to the ranking introduction for Vespa ranking. See the text search and text search through ML tutorials. Also relevant is the guide for Semantic Retrieval for Question Answering Applications.

Refer to Linguistics for details.

Matching

Text search is normally best run using a string field in index mode:

field album type string {
    indexing: summary | index
}
Below, find details on transformations to the text for text indexing and search. Use inspection tools to dump data from a content node, and query tracing to understand matching details.

The album field has index mode. For text fields, this enables the transformations below, and increases query recall. Use the quick start guide and stop after the feeding step. Make sure to feed all 5 albums with titles:

$ cat sample-apps/album-recommendation-selfhosted/src/test/resources/*.json | grep album
        "album": "A Head Full of Dreams",
        "album": "Hardwired...To Self-Destruct",
        "album": "Liebe ist für alle da",
        "album": "Love Is Here To Stay",
        "album": "When We All Fall Asleep, Where Do We Go?",
Flush and dump index data (file name can change for subsequent flushes):
$ docker exec vespa bash -c '/opt/vespa/bin/vespa-proton-cmd --local triggerFlush && \
    /opt/vespa/bin/vespa-index-inspect dumpwords \
    --indexdir /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready/index/index.flush.1 \
    --field album'

a	1
all	2
asleep	1
da	1
destruct	1
do	1
dream	1
fall	1
full	1
fur	1
go	1
hardwire	1
head	1
here	1
is	1
ist	1
lieb	1
love	1
of	1
self	1
stay	1
to	2
we	1
when	1
where	1
Observe the linguistic transformations to the data before indexed:
Hardwired...To → Hardwired ToTokenization - split terms on non-characters
Head → headLowercasing
für → furNormalizing
dreams → dreamStemming
Then, change from index to attribute in src/main/application/schemas/music.sd:
    field album type string {
-       indexing: summary | index
+       indexing: summary | attribute
    }
Run the tutorial again using the new schema, this time dumping data from the attributes (snapshot name can change with flushes):
$ docker exec vespa bash -c '/opt/vespa/bin/vespa-proton-cmd --local triggerFlush && \
  /opt/vespa/bin/vespa-attribute-inspect -p \
  /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready/attribute/album/snapshot-10/album && \
  cat /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready/attribute/album/snapshot-10/album.out'

doc 0: valueCount(1)
    0: []
doc 1: valueCount(1)
    0: [A Head Full of Dreams]
doc 2: valueCount(1)
    0: [Love Is Here To Stay]
doc 3: valueCount(1)
    0: [Hardwired...To Self-Destruct]
doc 4: valueCount(1)
    0: [Liebe ist für alle da]
doc 5: valueCount(1)
    0: [When We All Fall Asleep, Where Do We Go?]

The most important observation is that the strings are added as-is to attributes. Hence, values are matched in full, including whitespace. When searching attributes, both query terms and attribute data are lowercased before matching. Read more about the attribute word match mode.

Query Trace

Adding tracelevel=2 gives insight when testing queries - example attribute lowercasing:

http://localhost:8080/search/?ranking=rank_albums&yql=select%20%2A%20from%20sources%20%2A%20where%20album%20contains%20%22Liebe+ist+f%C3%BCr+alle+da%22%3B&tracelevel=2

http://localhost:8080/search/?ranking=rank_albums&yql=select%20%2A%20from%20sources%20%2A%20where%20album%20contains%20%22liebe+ist+f%C3%BCr+alle+da%22%3B&tracelevel=2
Also try query tracing to see how query parsing changes with index and attribute indexing modes.

A prefix search will match the query term "hea" to documents with "head". However, prefix search is not supported in text matching using index mode - some alternatives:

Array of string attributes Use an attribute with array of strings in addition to the index string field. The array to support more terms - if only one term in the field, the array is not needed. The application must split the text into terms and add these to the array. This can be a viable approach for fields with few terms and semi-structured input - example:
field company_name type array<string> {
    indexing: attribute
    match:    prefix
}
Adding "Goldman" and "Sachs" will match query terms like "Gold" and "Sach".
Streaming Search Some applications like personal search will search in a small set of documents. A content cluster in streaming mode has prefix search enabled. It is hence possible to query smaller streaming search content clusters and still have low query latency.
N-Gram indexing This is most often used in languages that are not tokenized (example: many Asian languages). Example - this will index "A Head Full of Dreams" to:
field album type string {
    indexing: summary | index
    match {
        gram
        gram-size: 3
    }
}

a	1
ams	1
dre	1
ead	1
eam	1
ful	1
hea	1
of	1
rea	1
ull	1
This will enable matching of any 3-term substring. This is generally not useful for text searching, other than possibly for an extra field with these n-grams for increased recall.
Regular expressions Using regular expressions is supported in attributes. There are however no optimizing data structures for query speed, it runs the expression over all attribute values. This is hence considered an experimental feature and included here for completeness.

Ranking

The default ranking is nativeRank in the first phase and no second phase re-ranking. The nativeRank is a feature which gives a reasonably good rank score, while being fast enough to be suitable for first phase ranking. See the native rank reference and native rank introduction for more information.

An alternative to nativeRank is using the BM25 rank feature.

If the expression is written manually, it might be most convenient to stick with using the fieldMatch(name) feature for each field. This feature combines the more basic fieldMatch features in a reasonable way. A good way to combine the fieldMatch score of each field is to use a weighted average as explained above. Another way is to combine the field match scores using the fieldMatch(name).weight/significance/importance features which takes term weight or rareness or both into account and allows a normalized score to be produced by simply summing the product of this feature and any other normalized per-field score for each field. In addition, some attribute value(s) must usually be included to determine the a priori quality of each document.

For example, assuming the title field is more important than the body field, create a ranking expression which gives more weight to that field, as in the example above. Vespa contains some built-in convenience support for this - weights can be set in the individual fields by weight: <number> and the feature match can be used to get a weighted average of the fieldMatch scores of each field. The overall ranking expression might contain other ranking dimensions than just text match, like freshness, the quality of the document, or any other property of the document or query.

Weight, significance and connectedness

Modify the values of the match features from the query by sending weight, significance and connectedness with the query:

Weight

Set query term weight. Example: ... where (title contains ([{"weight":200}]"heads") AND title contains "tails") specifies that heads is twice as important for the final rank score than tails (the default weight is 100).

Weight is used in fieldMatch(name).weight, which can be multiplied with fieldMatch(name) to yield a weighted score for the field, and in fieldMatch(name).weightedOccurrence to get a occurrence score which is higher if higher weighted terms occurs most. Configure static field weights in the schema.

Significance

How rare a particular term is in the corpus or the language. This is sometimes valuable information because if a document matches a rare word, it might mean the document is more important than one which matches a common word. Significance is calculated automatically by Vespa during indexing, but can also be overridden by setting the significance values on the query terms in a Searcher component. Significance is accessible in fieldMatch(name).significance, which can be used the same way as weight. Weight and significance are also averaged into fieldMatch(name).importance for convenience.

Connectedness

Signify the degree of connection between adjacent terms in the query. For example, the query new york newspaper should have a higher connectedness between the terms "new" and "york" than between "york" and "newspaper" to rank documents higher if they contain "new york" as a phrase. Term connectedness is taken into account by fieldMatch(name).proximity, which is also an important contribution to fieldMatch(name). Connectedness is a normalized value which is 0.1 by default. It must be set by a custom Searcher, looking up connectivity information from somewhere - there is no query syntax for it.

Match Configuration Debug

Indexed and streaming search differ in execution and the match-configuration differs. Some times it is useful to inspect generated configuration to understand or validate the match configuration. Run this to find the value of the -i argument below:

$ vespa-configproxy-cmd

Note the difference for the artist and album fields:

field artist type string {
    indexing: summary | index
    match   : exact
}
field album type string {
    indexing: summary | index
}

$ vespa-get-config -n vespa.configdefinition.ilscripts \
    -i container/docprocchains/chain/indexing/component/com.yahoo.docprocs.indexing.IndexingProcessor
maxtermoccurrences 100
fieldmatchmaxlength 1000000
ilscript[0].doctype "music"
ilscript[0].docfield[0] "artist"
ilscript[0].docfield[1] "album"
ilscript[0].docfield[2] "year"
ilscript[0].docfield[3] "category_scores"
ilscript[0].docfield[4] "tags"
ilscript[0].content[0] "clear_state | guard { input artist | exact                          | summary artist | index artist; }"
ilscript[0].content[1] "clear_state | guard { input album  | tokenize normalize stem:"BEST" | summary album | index album; }"
ilscript[0].content[2] "clear_state | guard { input year | summary year | attribute year; }"
ilscript[0].content[3] "clear_state | guard { input category_scores | summary category_scores | attribute category_scores; }"
ilscript[0].content[4] "input tags | passthrough tags"

See examples in streaming search. The first example is using default (i.e. string tokenized) matching, the artist field has default matching, arg1 is hence empty. The second example uses match: exact:

$ vespa-get-config -n vespa.config.search.vsm.vsmfields -i music/search/cluster.music.music | egrep 'name|arg1'
fieldspec[0].name "artist"
fieldspec[0].arg1 ""

# vespa-get-config -n vespa.config.search.vsm.vsmfields -i music/search/cluster.music.music | egrep 'name|arg1'
fieldspec[0].name "artist"
fieldspec[0].arg1 "exact"
Use vespa-configproxy-cmd to find the value for the -i argument above.