Refer to the text ranking introduction for ranking, and review the different match modes. See the text search and text search through ML tutorials. Also relevant is the guide for Semantic Retrieval for Question Answering Applications. Finally, refer to Linguistics for details.
Using query tracing is useful when debugging text matching.
This guide has examples based on the quick start. Use this, reconfigure the schema field, deploy and stop after the feeding step.
Text search is normally best run using a string field in index mode:
field album type string { indexing: summary | index }
Below, find details on transformations to the text for text indexing and search. Use inspection tools to dump data from a content node, and query tracing to understand matching details.
The album field has index mode. For text fields, this enables the transformations below, and increases query recall.
$ jq .fields.album < sample-apps/album-recommendation/ext/documents.jsonl "A Head Full of Dreams" "Hardwired...To Self-Destruct" "Liebe ist für alle da" "Love Is Here To Stay" "When We All Fall Asleep, Where Do We Go?"
Flush and dump index data (file name can change for subsequent flushes):
$ docker exec vespa sh -c 'vespa-proton-cmd --local triggerFlush && \ vespa-index-inspect dumpwords \ --indexdir /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready/index/index.flush.1 \ --field album' a 1 all 2 asleep 1 da 1 destruct 1 do 1 dream 1 fall 1 full 1 fur 1 go 1 hardwire 1 head 1 here 1 is 1 ist 1 lieb 1 love 1 of 1 self 1 stay 1 to 2 we 1 when 1 where 1
Observe the linguistic transformations to the data before indexed:
Transformation | Type |
---|---|
Hardwired...To → hardwire to | Tokenization - split terms on non-characters, here "..." |
Head → head | Lowercasing |
für → fur | Normalizing |
dreams → dream | Stemming |
Then, change from index to attribute in schemas/music.sd (and remove all bm25 settings):
field album type string { - indexing: summary | index - index: enable-bm25 + indexing: summary | attribute } field year type int { @@ -42,7 +41,7 @@ query(user_profile) tensor(cat{}) } first-phase { - expression: bm25(album) + 0.25 * sum(query(user_profile) * attribute(category_scores)) + expression: 0.25 * sum(query(user_profile) * attribute(category_scores)) }
Run the tutorial again using the new schema, stop efter feeding. Find the snapshot number (here 12) and inspect data from the attributes:
$ docker exec vespa sh -c \ 'ls /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready/attribute/album' meta-info.txt snapshot-12 $ docker exec vespa sh -c \ 'cd /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready/attribute/album/snapshot-12 && \ vespa-proton-cmd --local triggerFlush && \ vespa-attribute-inspect -p album && \ cat album.out' doc 0: valueCount(1) 0: [] doc 1: valueCount(1) 0: [A Head Full of Dreams] doc 2: valueCount(1) 0: [Love Is Here To Stay] doc 3: valueCount(1) 0: [Hardwired...To Self-Destruct] doc 4: valueCount(1) 0: [Liebe ist für alle da] doc 5: valueCount(1) 0: [When We All Fall Asleep, Where Do We Go?]
The most important observation is that the strings are added as-is to attributes. Hence, values are matched in full, including whitespace. When searching attributes, both query terms and attribute data are lowercased before matching. Read more about the attribute word match mode.
Use the prefix annotation to match string prefixes in attributes of type string:
field album type string { indexing: summary | attribute }
$ vespa query 'select * from music where album contains ({prefix: true}"a hea")'
The search-suggestions sample application uses prefix search, see README for a design discussion.
To prefix-match individual terms in a string, use an attribute with array of strings in addition to the index string field, e.g.:
schema company { document company { field company_name_string type string { indexing: index | summary } } field company_name_array type array<string> { indexing: input company_name_string | trim | split " +" | attribute | summary } }
Use the indexing-language to split the string, as shown above. Adding "Goldman" and "Sachs" will match query terms like "Gold" and "Sach".
Use fuzzy matching to match in string attributes with configurable edit distance - field configuration:
field album type string { indexing: summary | attribute attribute: fast-search }
$ vespa query 'select * from music where album contains ({maxEditDistance: 1}fuzzy("A Head Full of Dreems"))'
Fuzzy matching is great for misspellings. See use of prefixLength and fast-search in the reference.
Using regular expressions is supported in attributes. There are however no optimizing data structures for query speed, it runs the expression over all attribute values.
field album type string { indexing: summary | attribute }
Example, matching from start of string:
$ vespa query 'select * from music where album matches "^a head fu[l]+ of dreams"'
A substring search:
$ vespa query 'select * from music where album matches "head"'
This is most often used in languages that are not tokenized (example: many Asian languages). Example - this will index "A Head Full of Dreams" to:
field album type string { indexing: summary | index match { gram gram-size: 3 } } a 1 ams 1 dre 1 ead 1 eam 1 ful 1 hea 1 of 1 rea 1 ull 1
This will enable matching of any 3-term substring. This is generally not useful for text searching, other than possibly for an extra field with these ngrams for increased recall.
Example - the album name translated to traditional Chinese, 满脑子的梦想, will index the album title to:
field album type string { indexing: { "zh-hant" | set_language; summary | index } match { gram gram-size: 2 } } 子的 1 梦想 1 满脑 1 的梦 1 脑子 1
This will enable matching of any 2-term substring, which makes more sense in traditional Chinese than in English.
What is the best way to index short word-length documents, like names of all locations/towns in the world, such that they:
To make this multilingual, use an array<string> field to store all the alternatives. One can also translate to a canonical single language used in indexing at query time, but in cases with very short documents, opt for doing it indexing time.
Alternatives for matching with spell checking:
3. will give the cheapest queries and exact control over misspelled matching, but a larger index, more work for the developer, and adjusting spell correction becomes more complicated. 1. will be most expensive, but maybe also most convenient There are currently no rank signals giving you the match quality. 2. Is in between, and will probably work best when incorporating ranking signals that use proximity (such as e.g. nativeRank but not bm25).
Read Simplify Search with Multilingual Embedding Models for semantic matching and ranking.
Adding trace.level=2 gives insight when testing queries - example attribute lowercasing (observe that queries with "Liebe" and "liebe" give the same result):
$ vespa config set target local $ vespa query 'select * from music where album contains "Liebe ist für alle da"' \ ranking=rank_albums \ trace.level=2
Also try query tracing to see how query parsing changes with index and attribute indexing modes.
Inspect generated configuration to understand or validate the match configuration. Run this to find the value of the -i argument used below:
$ docker exec vespa sh -c vespa-configproxy-cmd | grep IndexingProcessor vespa.configdefinition.ilscripts,default/docprocchains/chain/indexing/component/com.yahoo.docprocs.indexing.IndexingProcessor, ...
Start over, deploy with the indexing settings below and feed data. Note the difference for the artist (with exact matching) and album fields:
field artist type string { indexing: summary | index match : exact } field album type string { indexing: summary | index } $ docker exec vespa sh -c 'vespa-get-config \ -n vespa.configdefinition.ilscripts \ -i default/docprocchains/chain/indexing/component/com.yahoo.docprocs.indexing.IndexingProcessor' maxtermoccurrences 100 fieldmatchmaxlength 1000000 ilscript[0].doctype "music" ilscript[0].docfield[0] "artist" ilscript[0].docfield[1] "album" ilscript[0].docfield[2] "year" ilscript[0].docfield[3] "category_scores" ilscript[0].content[0] "clear_state | guard { input artist | exact | summary artist | index artist; }" ilscript[0].content[1] "clear_state | guard { input album | tokenize normalize stem:"BEST" | summary album | index album; }" ilscript[0].content[2] "clear_state | guard { input year | summary year | attribute year; }" ilscript[0].content[3] "clear_state | guard { input category_scores | summary category_scores | attribute category_scores; }"