This guide demonstrates tokenization, linguistic processing and matching over string fields in Vespa. The guide features examples based on the quick start.
Refer to the ranking introduction for ranking, and review the different match modes that Vespa supports per field. See the text search and text search through ML tutorials. Finally, refer to linguistics for linguistic processing in Vespa.
Using query tracing is useful when debugging text matching.
Vespa string fields can have a mix of settings specified per field, such as the indexing and match modes.
Free-text search is normally solved using a string field in index mode:
field album type string { indexing: summary | index }
Below, find details on transformations to the text for text indexing and search using the quick start sample application as an example.
The album field has index mode. For text fields, this enables transformations of the string field to increase query recall.
The following is useful for dumping the resulting text tokens after indexing, to understand the transformations. This coupled with query tracing can help us understand why a document field doesn't match or match a query.
Add another document summary to schemas/music.sd that contains an extra summary field using tokens with the source set to the proper index field:
document-summary my-debug-summary { summary album { } summary album_tokens { source: album tokens } from-disk } fieldset default { fields: artist, album }
Redeploy the application to enable the new document summary:
$ vespa deploy --wait 300
Show original content of album field:
$ vespa query "select * from music where true" summary=my-debug-summary | \ jq -c '.root.children[].fields.album' "Liebe ist für alle da" "A Head Full of Dreams" "Hardwired...To Self-Destruct" "When We All Fall Asleep, Where Do We Go?" "Love Is Here To Stay"
Show tokens used for indexing the album field:
$ vespa query "select * from music where true" summary=my-debug-summary | \ jq -c '.root.children[].fields.album_tokens' ["lieb","ist","fur","all","da"] ["a","head","full","of","dream"] ["hardwire","to","self","destruct"] ["when","we","all","fall","asleep","where","do","we","go"] ["love","is","here","to","stay"]
Observe the linguistic transformations to the data before indexed:
Transformation | Type |
---|---|
Hardwired...To → hardwire to | Tokenization - split terms on non-characters, here "..." |
Head → head | Lowercasing |
für → fur | Normalizing |
dreams → dream | Stemming |
Then, change from index to attribute in schemas/music.sd (and remove all bm25 settings):
@@ -13,8 +13,7 @@ } field album type string { - indexing: summary | index - index: enable-bm25 + indexing: summary | attribute } field year type int { @@ -51,7 +50,7 @@ query(user_profile) tensor<float>(cat{}) } first-phase { - expression: bm25(album) + 0.25 * sum(query(user_profile) * attribute(category_scores)) + expression: 0.25 * sum(query(user_profile) * attribute(category_scores)) } }
Run the tutorial again using the new schema, stop after feeding. Show tokens used for indexing the album field:
$ vespa query "select * from music where true" summary=my-debug-summary | \ jq -c '.root.children[].fields.album_tokens' ["a head full of dreams"] ["love is here to stay"] ["when we all fall asleep, where do we go?"] ["liebe ist für alle da"] ["hardwired...to self-destruct"]
The most important observation is that the strings are added as-is for attributes and matching considers the full value, including whitespace (no tokenization).
The only transformation is lowercasing, both query terms and attribute data are lowercased before matching unless the match
setting for the field has been set to cased
.
Read more about the attribute word match mode.
Use the prefix annotation to match string prefixes in attributes of type string:
field album type string { indexing: summary | attribute }
Note that regular index fields does not support prefix matching.
$ vespa query 'select * from music where album contains ({prefix: true}"a hea")'
The search-suggestions sample application uses prefix search, see README for a design discussion.
To prefix-match individual terms in a string, use an attribute with array of strings in addition to the index string field, e.g.:
schema company { document company { field company_name_string type string { indexing: index | summary } } field company_name_array type array<string> { indexing: input company_name_string | trim | split " +" | attribute | summary } }
Use the indexing-language to split the string, as shown above. Adding "Goldman" and "Sachs" will match query terms like "Gold" and "Sach".
Use fuzzy matching to match in string attributes with configurable edit distance. Field configuration:
field album type string { indexing: summary | attribute attribute: fast-search }
$ vespa query 'select * from music where album contains ({maxEditDistance: 1}fuzzy("A Head Full of Dreems"))'
Fuzzy matching is great for misspellings. See use of prefixLength and fast-search in the reference.
Character normalization is not performed for fuzzy matches.
By default, fuzzy
matches full strings against the query. For use-cases such as
type-ahead search this means a user query such as "Ahead Full" will fail to match the document string
"A Head Full of Dreams", both when using fuzzy matching (too many characters missing) as well
as regular, non-fuzzy prefix matching (prefixes do not exactly match).
Adding prefix:true
enables fuzzy prefix semantics. If a string has a prefix
that can match the query string within the specified maximum number of edits, it will be considered a match.
$ vespa query 'select * from music where album contains ({maxEditDistance: 1, prefix: true}fuzzy("Ahead Full"))'
This query will match strings such as "A Head Full of Dreams", "A Head Full of Clouds", "Ahead Full Steam" etc.
Exact prefix locking (prefixLength:n
) can be used alongside fuzzy prefix matching to constrain the
candidate set to strings that have prefix that exactly matches n characters of the query. Fuzzy prefix
matching then applies to the remainder (suffix) of the candidate string. This greatly speeds up dictionary scans
since only a subset of the dictionary needs to be considered.
{maxEditDistance:2,prefix:true}fuzzy("XY")
will
end up matching every document, since all possible strings can have their prefix transformed to "XY"
with at most 2 edits. This is the case for all fuzzy prefix queries where the length of the query string is equal to,
or lower than, maxEditDistance
. This should be taken into consideration when constructing queries
based on user input.
Using regular expressions is supported in attributes. There are however no optimizing data structures for query speed, it runs the expression over all attribute values.
field album type string { indexing: summary | attribute }
Example, matching from start of string:
$ vespa query 'select * from music where album matches "^a head fu[l]+ of dreams"'
A substring search:
$ vespa query 'select * from music where album matches "head"'
Character normalization is not performed for regular expression matches.
This is most often used in languages that are not tokenized (example: many Asian languages). Example - this will index "A Head Full of Dreams" to:
field album type string { indexing: summary | index match { gram gram-size: 3 } } a 1 ams 1 dre 1 ead 1 eam 1 ful 1 hea 1 of 1 rea 1 ull 1
This will enable matching of any 3-term substring. This is generally not useful for text searching, other than possibly for an extra field with these ngrams for increased recall.
Example - the album name translated to traditional Chinese, 滿腦子的夢想, will index the album title to:
field album type string { indexing: { "zh-hant" | set_language; summary | index } match { gram gram-size: 2 } } 子的 1 夢想 1 滿腦 1 的夢 1 腦子 1
This will enable matching of any 2-term substring, which makes more sense in traditional Chinese than in English.
What is the best way to index short word-length documents, like names of all locations/towns in the world, such that they:
To make this multilingual, use an array<string> field to store all the alternatives. One can also translate to a canonical single language used in indexing at query time, but in cases with very short documents, opt for doing it indexing time.
Alternatives for matching with spell checking:
3. will give the cheapest queries and exact control over misspelled matching, but a larger index, more work for the developer, and adjusting spell correction becomes more complicated. 1. will be most expensive, but maybe also most convenient There are currently no rank signals giving you the match quality. 2. Is in between, and will probably work best when incorporating ranking signals that use proximity (such as e.g. nativeRank but not bm25).
Read Simplify Search with Multilingual Embedding Models for semantic matching and ranking.
Adding trace.level=2 gives insight when testing queries - example attribute lowercasing (observe that queries with "Liebe" and "liebe" give the same result):
$ vespa config set target local $ vespa query 'select * from music where album contains "Liebe ist für alle da"' \ ranking=rank_albums \ trace.level=2
Also try query tracing to see how query parsing changes with index and attribute indexing modes.
Inspect generated configuration to understand or validate the match configuration. Run this to find the value of the -i argument used below:
$ docker exec vespa sh -c vespa-configproxy-cmd | grep IndexingProcessor vespa.configdefinition.ilscripts,default/docprocchains/chain/indexing/component/com.yahoo.docprocs.indexing.IndexingProcessor, ...
Start over, deploy with the indexing settings below and feed data. Note the difference for the artist (with exact matching) and album fields:
field artist type string { indexing: summary | index match : exact } field album type string { indexing: summary | index } $ docker exec vespa sh -c 'vespa-get-config \ -n vespa.configdefinition.ilscripts \ -i default/docprocchains/chain/indexing/component/com.yahoo.docprocs.indexing.IndexingProcessor' maxtermoccurrences 100 fieldmatchmaxlength 1000000 ilscript[0].doctype "music" ilscript[0].docfield[0] "artist" ilscript[0].docfield[1] "album" ilscript[0].docfield[2] "year" ilscript[0].docfield[3] "category_scores" ilscript[0].content[0] "clear_state | guard { input artist | exact | summary artist | index artist; }" ilscript[0].content[1] "clear_state | guard { input album | tokenize normalize stem:"BEST" | summary album | index album; }" ilscript[0].content[2] "clear_state | guard { input year | summary year | attribute year; }" ilscript[0].content[3] "clear_state | guard { input category_scores | summary category_scores | attribute category_scores; }"