Refer to the ranking introduction for Vespa ranking, and review the different match modes. See the text search and text search through ML tutorials. Also relevant is the guide for Semantic Retrieval for Question Answering Applications. Finally, refer to Linguistics for details.
Text search is normally best run using a string field in index mode:
field album type string { indexing: summary | index }
Below, find details on transformations to the text for text indexing and search. Use inspection tools to dump data from a content node, and query tracing to understand matching details.
The album field has index mode. For text fields, this enables the transformations below, and increases query recall. Use the quick start guide and stop after the feeding step. Make sure to feed all 5 albums with titles:
$ cat sample-apps/album-recommendation/src/test/resources/*.json | grep album "album": "A Head Full of Dreams", "album": "Hardwired...To Self-Destruct", "album": "Liebe ist für alle da", "album": "Love Is Here To Stay", "album": "When We All Fall Asleep, Where Do We Go?",
Flush and dump index data (file name can change for subsequent flushes):
$ docker exec vespa bash -c '/opt/vespa/bin/vespa-proton-cmd --local triggerFlush && \ /opt/vespa/bin/vespa-index-inspect dumpwords \ --indexdir /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready/index/index.flush.1 \ --field album' a 1 all 2 asleep 1 da 1 destruct 1 do 1 dream 1 fall 1 full 1 fur 1 go 1 hardwire 1 head 1 here 1 is 1 ist 1 lieb 1 love 1 of 1 self 1 stay 1 to 2 we 1 when 1 where 1
Observe the linguistic transformations to the data before indexed:
Transformation | Type |
---|---|
Hardwired...To → Hardwired To | Tokenization - split terms on non-characters |
Head → head | Lowercasing |
für → fur | Normalizing |
dreams → dream | Stemming |
Then, change from index to attribute in src/main/application/schemas/music.sd:
field album type string { - indexing: summary | index + indexing: summary | attribute }
Run the tutorial again using the new schema, this time dumping data from the attributes (snapshot name can change with flushes):
$ docker exec vespa bash -c '/opt/vespa/bin/vespa-proton-cmd --local triggerFlush && \ /opt/vespa/bin/vespa-attribute-inspect -p \ /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready/attribute/album/snapshot-10/album && \ cat /opt/vespa/var/db/vespa/search/cluster.music/n0/documents/music/0.ready/attribute/album/snapshot-10/album.out' doc 0: valueCount(1) 0: [] doc 1: valueCount(1) 0: [A Head Full of Dreams] doc 2: valueCount(1) 0: [Love Is Here To Stay] doc 3: valueCount(1) 0: [Hardwired...To Self-Destruct] doc 4: valueCount(1) 0: [Liebe ist für alle da] doc 5: valueCount(1) 0: [When We All Fall Asleep, Where Do We Go?]
The most important observation is that the strings are added as-is to attributes. Hence, values are matched in full, including whitespace. When searching attributes, both query terms and attribute data are lowercased before matching. Read more about the attribute word match mode.
Adding trace.level=2 gives insight when testing queries - example attribute lowercasing:
http://localhost:8080/search/?ranking=rank_albums&yql=select%20%2A%20from%20sources%20%2A%20where%20album%20contains%20%22Liebe+ist+f%C3%BCr+alle+da%22&trace.level=2 http://localhost:8080/search/?ranking=rank_albums&yql=select%20%2A%20from%20sources%20%2A%20where%20album%20contains%20%22liebe+ist+f%C3%BCr+alle+da%22&trace.level=2
Also try query tracing to see how query parsing changes with index and attribute indexing modes.
A prefix search will match the query term "hea" to documents with the term "head" in a string
field
with indexing: attribute
. Note that it is requires that the field matching is
specified with the annotation {prefix: true}
to enable prefix matching.
E.g. ... contains ([{prefix:true}]"hea")
The search-suggestions sample application uses prefix search, see README for a design discussion.
Use an attribute with array of strings in addition to the index string field. The array to support more terms - if only one term in the field, the array is not needed. The application must split the text into terms and add these to the array. This can be a viable approach for fields with few terms and semi-structured input - example:
schema company { document company { field company_name_string type string { indexing: summary } } field company_name_array type array<string> { indexing: input company_name_string | trim | split " +" | attribute } }Adding "Goldman" and "Sachs" will match query terms like "Gold" and "Sach".
This is most often used in languages that are not tokenized (example: many Asian languages). Example - this will index "A Head Full of Dreams" to:
field album type string { indexing: summary | index match { gram gram-size: 3 } } a 1 ams 1 dre 1 ead 1 eam 1 ful 1 hea 1 of 1 rea 1 ull 1
This will enable matching of any 3-term substring. This is generally not useful for text searching, other than possibly for an extra field with these ngrams for increased recall.
Example - the album name translated to traditional Chinese, 满脑子的梦想, will index the album title to:
field album type string { indexing: { "zh-hant" | set_language; summary | index } match { gram gram-size: 2 } } 子的 1 梦想 1 满脑 1 的梦 1 脑子 1
This will enable matching of any 2-term substring, which makes more sense in traditional Chinese than in English.
Using regular expressions is supported in attributes. There are however no optimizing data structures for query speed, it runs the expression over all attribute values.
The default ranking is the first-phase function nativeRank
, that is a function returning the value of
the nativeRank rank feature, and no second-phase.
An good simple alternative to nativeRank
for text ranking is using the
BM25 rank feature.
If the expression is written manually, it might be most convenient to
stick with using the fieldMatch(name)
feature for each field.
This feature combines the more basic fieldMatch features in a reasonable way.
A good way to combine the fieldMatch score of each field is to use a weighted average as explained above.
Another way is to combine the field match scores
using the fieldMatch(name).weight/significance/importance
features
which takes term weight or rareness or both into account
and allows a normalized score to be produced by simply summing the product of this feature
and any other normalized per-field score for each field.
In addition, some attribute value(s) must usually be included
to determine the a priori quality of each document.
For example, assuming the title field is more important than the body field,
create a ranking expression which gives more weight to that field, as in the example above.
Vespa contains some built-in convenience support for this -
weights can be set in the individual fields by weight: <number>
and the feature match
can be used to get a weighted average
of the fieldMatch scores of each field.
The overall ranking expression might contain other ranking dimensions than just text match,
like freshness, the quality of the document, or any other property of the document or query.
Modify the values of the match features from the query by sending weight, significance and connectedness with the query:
Weight |
Set query term weight.
Example:
Weight is used in |
---|---|
Significance |
Set query term significance -
how rare a particular term is in the corpus or the language.
This is sometimes valuable information because if a document matches a rare word,
it might mean the document is more important than one which matches a common word.
Significance is calculated automatically by Vespa during indexing,
but can also be overridden by setting the significance values on the query terms
in a Searcher component.
Significance is accessible in
|
Connectedness |
Signify the degree of connection between adjacent terms in the query -
set query term connectivity to another term.
For example, the query |
Sometimes it is useful to inspect generated configuration to understand or validate the match configuration. Run this to find the value of the -i argument below:
$ vespa-configproxy-cmd
Note the difference for the artist and album fields:
field artist type string { indexing: summary | index match : exact } field album type string { indexing: summary | index } $ vespa-get-config -n vespa.configdefinition.ilscripts \ -i container/docprocchains/chain/indexing/component/com.yahoo.docprocs.indexing.IndexingProcessor maxtermoccurrences 100 fieldmatchmaxlength 1000000 ilscript[0].doctype "music" ilscript[0].docfield[0] "artist" ilscript[0].docfield[1] "album" ilscript[0].docfield[2] "year" ilscript[0].docfield[3] "category_scores" ilscript[0].docfield[4] "tags" ilscript[0].content[0] "clear_state | guard { input artist | exact | summary artist | index artist; }" ilscript[0].content[1] "clear_state | guard { input album | tokenize normalize stem:"BEST" | summary album | index album; }" ilscript[0].content[2] "clear_state | guard { input year | summary year | attribute year; }" ilscript[0].content[3] "clear_state | guard { input category_scores | summary category_scores | attribute category_scores; }" ilscript[0].content[4] "input tags | passthrough tags"