• [+] expand all

Text Matching and Ranking

Refer to the ranking introduction for Vespa ranking, and review the different match modes. See the text search and text search through ML tutorials. Also relevant is the guide for Semantic Retrieval for Question Answering Applications. Finally, refer to Linguistics for details.

Matching

Text search is normally best run using a string field in index mode:

field album type string {
    indexing: summary | index
}

Below, find details on transformations to the text for text indexing and search. Use inspection tools to dump data from a content node, and query tracing to understand matching details.

The album field has index mode. For text fields, this enables the transformations below, and increases query recall. Use the quick start guide and stop after the feeding step. Make sure to feed all 5 albums with titles:

$ cat sample-apps/album-recommendation/src/test/resources/*.json | grep album
        "album": "A Head Full of Dreams",
        "album": "Hardwired...To Self-Destruct",
        "album": "Liebe ist für alle da",
        "album": "Love Is Here To Stay",
        "album": "When We All Fall Asleep, Where Do We Go?",

Flush and dump index data (file name can change for subsequent flushes):

$ docker exec vespa bash -c '/opt/vespa/bin/vespa-proton-cmd --local triggerFlush && \
    /opt/vespa/bin/vespa-index-inspect dumpwords \
    --indexdir /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready/index/index.flush.1 \
    --field album'

a	1
all	2
asleep	1
da	1
destruct	1
do	1
dream	1
fall	1
full	1
fur	1
go	1
hardwire	1
head	1
here	1
is	1
ist	1
lieb	1
love	1
of	1
self	1
stay	1
to	2
we	1
when	1
where	1

Observe the linguistic transformations to the data before indexed:

Transformation Type
Hardwired...To → Hardwired ToTokenization - split terms on non-characters
Head → headLowercasing
für → furNormalizing
dreams → dreamStemming

Then, change from index to attribute in src/main/application/schemas/music.sd:

    field album type string {
-       indexing: summary | index
+       indexing: summary | attribute
    }

Run the tutorial again using the new schema, this time dumping data from the attributes (snapshot name can change with flushes):

$ docker exec vespa bash -c '/opt/vespa/bin/vespa-proton-cmd --local triggerFlush && \
  /opt/vespa/bin/vespa-attribute-inspect -p \
  /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready/attribute/album/snapshot-10/album && \
  cat /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready/attribute/album/snapshot-10/album.out'

doc 0: valueCount(1)
    0: []
doc 1: valueCount(1)
    0: [A Head Full of Dreams]
doc 2: valueCount(1)
    0: [Love Is Here To Stay]
doc 3: valueCount(1)
    0: [Hardwired...To Self-Destruct]
doc 4: valueCount(1)
    0: [Liebe ist für alle da]
doc 5: valueCount(1)
    0: [When We All Fall Asleep, Where Do We Go?]

The most important observation is that the strings are added as-is to attributes. Hence, values are matched in full, including whitespace. When searching attributes, both query terms and attribute data are lowercased before matching. Read more about the attribute word match mode.

Query Trace

Adding trace.level=2 gives insight when testing queries - example attribute lowercasing:

http://localhost:8080/search/?ranking=rank_albums&yql=select%20%2A%20from%20sources%20%2A%20where%20album%20contains%20%22Liebe+ist+f%C3%BCr+alle+da%22&trace.level=2

http://localhost:8080/search/?ranking=rank_albums&yql=select%20%2A%20from%20sources%20%2A%20where%20album%20contains%20%22liebe+ist+f%C3%BCr+alle+da%22&trace.level=2

Also try query tracing to see how query parsing changes with index and attribute indexing modes.

Prefix Match

A prefix search will match the query term "hea" to documents with the term "head" in a string field with indexing: attribute. Note that it is requires that the field matching is specified with the annotation {prefix: true} to enable prefix matching.

E.g. ... contains ([{prefix:true}]"hea")

The search-suggestions sample application uses prefix search, see README for a design discussion.

Array of string attributes

Use an attribute with array of strings in addition to the index string field. The array to support more terms - if only one term in the field, the array is not needed. The application must split the text into terms and add these to the array. This can be a viable approach for fields with few terms and semi-structured input - example:

schema company {
    document company {
        field company_name_string type string {
            indexing: summary
        }
    }
    field company_name_array type array<string> {
        indexing: input company_name_string | trim | split " +" | attribute
    }
}
Adding "Goldman" and "Sachs" will match query terms like "Gold" and "Sach".

N-Gram Match

N-Gram indexing

This is most often used in languages that are not tokenized (example: many Asian languages). Example - this will index "A Head Full of Dreams" to:

field album type string {
    indexing: summary | index
    match {
        gram
        gram-size: 3
    }
}

a	1
ams	1
dre	1
ead	1
eam	1
ful	1
hea	1
of	1
rea	1
ull	1

This will enable matching of any 3-term substring. This is generally not useful for text searching, other than possibly for an extra field with these ngrams for increased recall.

Example - the album name translated to traditional Chinese, 满脑子的梦想, will index the album title to:

field album type string {
    indexing: {
        "zh-hant" | set_language;
        summary | index
    }
    match {
        gram
        gram-size: 2
    }
}

子的	1
梦想	1
满脑	1
的梦	1
脑子	1

This will enable matching of any 2-term substring, which makes more sense in traditional Chinese than in English.

Regular expression Match

Regular expressions

Using regular expressions is supported in attributes. There are however no optimizing data structures for query speed, it runs the expression over all attribute values.

Ranking

The default ranking is the first-phase function nativeRank, that is a function returning the value of the nativeRank rank feature, and no second-phase.

An good simple alternative to nativeRank for text ranking is using the BM25 rank feature.

If the expression is written manually, it might be most convenient to stick with using the fieldMatch(name) feature for each field. This feature combines the more basic fieldMatch features in a reasonable way. A good way to combine the fieldMatch score of each field is to use a weighted average as explained above. Another way is to combine the field match scores using the fieldMatch(name).weight/significance/importance features which takes term weight or rareness or both into account and allows a normalized score to be produced by simply summing the product of this feature and any other normalized per-field score for each field. In addition, some attribute value(s) must usually be included to determine the a priori quality of each document.

For example, assuming the title field is more important than the body field, create a ranking expression which gives more weight to that field, as in the example above. Vespa contains some built-in convenience support for this - weights can be set in the individual fields by weight: <number> and the feature match can be used to get a weighted average of the fieldMatch scores of each field. The overall ranking expression might contain other ranking dimensions than just text match, like freshness, the quality of the document, or any other property of the document or query.

Weight, significance and connectedness

Modify the values of the match features from the query by sending weight, significance and connectedness with the query:

Weight

Set query term weight. Example: ... where (title contains ({weight:200}"heads") AND title contains "tails") specifies that heads is twice as important for the final rank score than tails (the default weight is 100).

Weight is used in fieldMatch(name).weight, which can be multiplied with fieldMatch(name) to yield a weighted score for the field, and in fieldMatch(name).weightedOccurrence to get an occurrence score which is higher if higher weighted terms occurs most. Configure static field weights in the schema.

Significance

Set query term significance - how rare a particular term is in the corpus or the language. This is sometimes valuable information because if a document matches a rare word, it might mean the document is more important than one which matches a common word. Significance is calculated automatically by Vespa during indexing, but can also be overridden by setting the significance values on the query terms in a Searcher component. Significance is accessible in fieldMatch(name).significance, which can be used the same way as weight. Weight and significance are also averaged into fieldMatch(name).importance for convenience.

Connectedness

Signify the degree of connection between adjacent terms in the query - set query term connectivity to another term. For example, the query new york newspaper should have a higher connectedness between the terms "new" and "york" than between "york" and "newspaper" to rank documents higher if they contain "new york" as a phrase. Term connectedness is taken into account by fieldMatch(name).proximity, which is also an important contribution to fieldMatch(name). Connectedness is a normalized value which is 0.1 by default. It must be set by a custom Searcher, looking up connectivity information from somewhere - there is no query syntax for it.

Match Configuration Debug

Sometimes it is useful to inspect generated configuration to understand or validate the match configuration. Run this to find the value of the -i argument below:

$ vespa-configproxy-cmd

Note the difference for the artist and album fields:

field artist type string {
    indexing: summary | index
    match   : exact
}
field album type string {
    indexing: summary | index
}

$ vespa-get-config -n vespa.configdefinition.ilscripts \
    -i container/docprocchains/chain/indexing/component/com.yahoo.docprocs.indexing.IndexingProcessor
maxtermoccurrences 100
fieldmatchmaxlength 1000000
ilscript[0].doctype "music"
ilscript[0].docfield[0] "artist"
ilscript[0].docfield[1] "album"
ilscript[0].docfield[2] "year"
ilscript[0].docfield[3] "category_scores"
ilscript[0].docfield[4] "tags"
ilscript[0].content[0] "clear_state | guard { input artist | exact                          | summary artist | index artist; }"
ilscript[0].content[1] "clear_state | guard { input album  | tokenize normalize stem:"BEST" | summary album | index album; }"
ilscript[0].content[2] "clear_state | guard { input year | summary year | attribute year; }"
ilscript[0].content[3] "clear_state | guard { input category_scores | summary category_scores | attribute category_scores; }"
ilscript[0].content[4] "input tags | passthrough tags"