• [+] expand all

Text Matching

Refer to the text ranking introduction for ranking, and review the different match modes. See the text search and text search through ML tutorials. Also relevant is the guide for Semantic Retrieval for Question Answering Applications. Finally, refer to Linguistics for details.

Using query tracing is useful when debugging text matching.

This guide has examples based on the quick start. Use this, reconfigure the schema field, deploy and stop after the feeding step.

Index and attribute

Text search is normally best run using a string field in index mode:

field album type string {
    indexing: summary | index

Below, find details on transformations to the text for text indexing and search. Use inspection tools to dump data from a content node, and query tracing to understand matching details.

The album field has index mode. For text fields, this enables the transformations below, and increases query recall.

Add another document summary to schemas/music.sd that contains an extra summary field using tokens with the source set to the proper index field:

@@ -27,6 +27,15 @@
+    document-summary my-debug-summary {
+        summary album type string { }
+        summary album_tokens type string {
+            source: album
+            tokens
+        }
+        from-disk
+    }
     fieldset default {
         fields: artist, album

Redeploy the application to enable the new document summary:

$ vespa deploy --wait 300

Show original content of album field:

$ vespa query "select * from music where true" summary=my-debug-summary | \
  jq -c '.root.children[].fields.album'
"Liebe ist für alle da"
"A Head Full of Dreams"
"Hardwired...To Self-Destruct"
"When We All Fall Asleep, Where Do We Go?"
"Love Is Here To Stay"

Show tokens used for indexing the album field:

$ vespa query "select * from music where true" summary=my-debug-summary | \
  jq -c '.root.children[].fields.album_tokens'

Observe the linguistic transformations to the data before indexed:

Transformation Type
Hardwired...To → hardwire toTokenization - split terms on non-characters, here "..."
Head → headLowercasing
für → furNormalizing
dreams → dreamStemming

Then, change from index to attribute in schemas/music.sd (and remove all bm25 settings):

@@ -13,8 +13,7 @@
         field album type string {
-            indexing: summary | index
-            index: enable-bm25
+            indexing: summary | attribute
         field year type int {
@@ -51,7 +50,7 @@
             query(user_profile) tensor<float>(cat{})
         first-phase {
-            expression: bm25(album) + 0.25 * sum(query(user_profile) * attribute(category_scores))
+            expression: 0.25 * sum(query(user_profile) * attribute(category_scores))

Run the tutorial again using the new schema, stop after feeding. Show tokens used for indexing the album field:

$ vespa query "select * from music where true" summary=my-debug-summary | \
  jq -c '.root.children[].fields.album_tokens'
["a head full of dreams"]
["love is here to stay"]
["when we all fall asleep, where do we go?"]
["liebe ist für alle da"]
["hardwired...to self-destruct"]

The most important observation is that the strings are added as-is to attributes. Hence, values are matched in full, including whitespace. When searching attributes, both query terms and attribute data are lowercased before matching unless the match setting for the field has been set to cased. Read more about the attribute word match mode.

Prefix match

Use the prefix annotation to match string prefixes in attributes of type string:

field album type string {
    indexing: summary | attribute
$ vespa query 'select * from music where album contains ({prefix: true}"a hea")'

The search-suggestions sample application uses prefix search, see README for a design discussion.

To prefix-match individual terms in a string, use an attribute with array of strings in addition to the index string field, e.g.:

schema company {
    document company {
        field company_name_string type string {
            indexing: index | summary
    field company_name_array type array<string> {
        indexing: input company_name_string | trim | split " +" | attribute | summary

Use the indexing-language to split the string, as shown above. Adding "Goldman" and "Sachs" will match query terms like "Gold" and "Sach".

Fuzzy match

Use fuzzy matching to match in string attributes with configurable edit distance - field configuration:

field album type string {
    indexing:  summary | attribute
    attribute: fast-search
$ vespa query 'select * from music where album contains ({maxEditDistance: 1}fuzzy("A Head Full of Dreems"))'

Fuzzy matching is great for misspellings. See use of prefixLength and fast-search in the reference.

Character normalization is not performed for fuzzy matches.

Regular expression match

Using regular expressions is supported in attributes. There are however no optimizing data structures for query speed, it runs the expression over all attribute values.

field album type string {
    indexing:  summary | attribute

Example, matching from start of string:

$ vespa query 'select * from music where album matches "^a head fu[l]+ of dreams"'

A substring search:

$ vespa query 'select * from music where album matches "head"'

Character normalization is not performed for regular expression matches.

N-Gram match

This is most often used in languages that are not tokenized (example: many Asian languages). Example - this will index "A Head Full of Dreams" to:

field album type string {
    indexing: summary | index
    match {
        gram-size: 3

a	1
ams	1
dre	1
ead	1
eam	1
ful	1
hea	1
of	1
rea	1
ull	1

This will enable matching of any 3-term substring. This is generally not useful for text searching, other than possibly for an extra field with these ngrams for increased recall.

Example - the album name translated to traditional Chinese, 滿腦子的夢想, will index the album title to:

field album type string {
    indexing: {
        "zh-hant" | set_language;
        summary | index
    match {
        gram-size: 2

子的	1
夢想	1
滿腦	1
的夢	1
腦子	1

This will enable matching of any 2-term substring, which makes more sense in traditional Chinese than in English.

Example use case

What is the best way to index short word-length documents, like names of all locations/towns in the world, such that they:

  • Are robust to misspelling in user queries eg: "Amsterdam" --> "amstredam"
  • Are cross-lingual for search, e.g.: "America" --> "美國"

To make this multilingual, use an array<string> field to store all the alternatives. One can also translate to a canonical single language used in indexing at query time, but in cases with very short documents, opt for doing it indexing time.

Alternatives for matching with spell checking:

  1. Make the field an attribute and use fuzzy matching.
  2. Make the field an index with gram matching.
  3. Having an array of alternatives anyway, just stuff all the misspellings to match into it. Consider using a weighted set instead to weight them by closeness to the original.

3. will give the cheapest queries and exact control over misspelled matching, but a larger index, more work for the developer, and adjusting spell correction becomes more complicated. 1. will be most expensive, but maybe also most convenient There are currently no rank signals giving you the match quality. 2. Is in between, and will probably work best when incorporating ranking signals that use proximity (such as e.g. nativeRank but not bm25).

Read Simplify Search with Multilingual Embedding Models for semantic matching and ranking.

Query Trace

Adding trace.level=2 gives insight when testing queries - example attribute lowercasing (observe that queries with "Liebe" and "liebe" give the same result):

$ vespa config set target local
$ vespa query 'select * from music where album contains "Liebe ist für alle da"' \
  ranking=rank_albums \

Also try query tracing to see how query parsing changes with index and attribute indexing modes.

Appendix: Match Configuration Debugging

Inspect generated configuration to understand or validate the match configuration. Run this to find the value of the -i argument used below:

$ docker exec vespa sh -c vespa-configproxy-cmd | grep IndexingProcessor

  vespa.configdefinition.ilscripts,default/docprocchains/chain/indexing/component/com.yahoo.docprocs.indexing.IndexingProcessor, ...

Start over, deploy with the indexing settings below and feed data. Note the difference for the artist (with exact matching) and album fields:

field artist type string {
    indexing: summary | index
    match   : exact
field album type string {
    indexing: summary | index

$ docker exec vespa sh -c 'vespa-get-config \
  -n vespa.configdefinition.ilscripts \
  -i default/docprocchains/chain/indexing/component/com.yahoo.docprocs.indexing.IndexingProcessor'

maxtermoccurrences 100
fieldmatchmaxlength 1000000
ilscript[0].doctype "music"
ilscript[0].docfield[0] "artist"
ilscript[0].docfield[1] "album"
ilscript[0].docfield[2] "year"
ilscript[0].docfield[3] "category_scores"
ilscript[0].content[0] "clear_state | guard { input artist | exact | summary artist | index artist; }"
ilscript[0].content[1] "clear_state | guard { input album | tokenize normalize stem:"BEST" | summary album | index album; }"
ilscript[0].content[2] "clear_state | guard { input year | summary year | attribute year; }"
ilscript[0].content[3] "clear_state | guard { input category_scores | summary category_scores | attribute category_scores; }"