Stemming

Stemming means translate a word to its base form (singular forms for nouns, infinitive for verbs), using a stemmer. Use of stemming increases recall when searching because the searcher is usually interested in documents containing query words regardless of the word form used. Stemming in Vespa is symmetric, i.e. words are converted to stems both when indexing and searching.

Examples of this is when text is indexed, the stemmer will convert the noun reports (plural) to report, and the latter will be stored in the index. Likewise, before searching, reports will be stemmed to report. Another example is that am, are and was all will be stemmed to be both in queries and indexes.

When bolding is enabled, all forms of the query term will be bolded. I.e. when searching for reports, both report, reported and reports will be bolded.

Theory

From a matching point of view, stemming takes all possible token strings and maps them into equivalence classes. So in the example above, the set of tokens { "report", "reports", "reported" } are in an equivalence class. To represent the class the linguistics library should pick the shortest element in the class. At query time, the text typed by a user will be tokenized, and then each token should be mapped to the most likely equivalence class, again represented by the shortest element that belongs to the class.

While the theory sounds pretty simple, in practice it is not always possible to figure out which equivalence class a token should belong to. A typical example is the string "number". In most cases we would guess this to mean a numerical entity of some kind, and the equivalence class would be { "number", "numbers" } - but it could also be a verb, with a different equivalence class { "number", "numbered", "numbering" } for example. These are of course closely related, and in practice they will be merged, so we'll have a slightly larger equivalence class { "number", "numbers", "numbered", "numbering" } and be happy with that. However, in a sentence such as "my legs keep getting number every day" the "number" token clearly does not have the semantics of a numerical entity, but should be in the equivalence class { "numb", "number", "numbest", "numbness" } instead. But blindly assigning "number" to the equivalence class "numb" is clearly not right, since the "more numb" meaning is much less likely than the "numerical entity" meaning.

The approach currently taken by the low-level linguistics library will often lead to problems in the "number"-like cases as described above. To give better recall, Vespa has implemented a "multiple" stemming option.

Languages

Stemming is currently available for English, other languages needs user contribution.

Configuration

By default, all words are stemmed to their shortest form in Vespa. Refer to the stemming reference for other stemming types. To change type, add:

stemming: [stemming-type]

Stemming can be set either for a field, a fieldset or as a default for all fields. Example: Disable stemming for the field title:

field title type string {
   indexing: summary | index
   stemming: none
}