Linguistics in Vespa

Vespa uses a linguistics module to process text in queries and documents during indexing and searching. The goal of linguistic processing is to increase recall (how many documents are matched) without hurting precision (the relevance of the documents matched) too much. It consists of such operations as

  • tokenizing text into chunks of known types such as words and punctuation
  • normalizing accents
  • finding the base form of words (stemming or lemmatization)
These operations can be turned on or off per field in the search definition.

The default linguistics module is OpenNlp.

Creating a custom linguistics implementation

A linguistics component is an implementation of com.yahoo.language.Linguistics. Refer to the com.yahoo.language.simple.SimpleLinguistics implementation (which can be subclassed for convenience).

SimpleLinguistics provides support for english stemming only. Try loading the com.yahoo.language.simple.SimpleLinguistics module, or providing another linguistics module.

The linguistics implementation must be configured as a component in container clusters doing linguistics processing.

As document processing for indexing is by default done by an autogenerated container cluster which cannot be configured, specify a container cluster for indexing explicitly.

This example shows how to configure SimpleLinguistics for linguistics using the same cluster for both query and indexing processing (if using different clusters, add the same linguistics component to all of them):

<services>

  <container version="1.0" id="mycontainer">
    <component id="com.yahoo.language.simple.SimpleLinguistics"/>
    <document-processing/>
    <search/>
    <nodes ...>
  </container>

  <content version="1.0">
    <redundancy>1</redundancy>
    <documents>
      <document type="mydocument" mode="index"/>
      <document-processing cluster="mycontainer"/>
    </documents>
    <nodes ...>
  </content>

</services>
If changing the linguistics component of a live system, recall can be reduced until all documents are re-written. This because documents will still be stored with tokens generated by the previous linguistics module.

Language handling

Vespa does not know the language of a document - this applies:

  1. The indexing processor is instructed on a per-field level what language to use when calling the underlying linguistics library
  2. The query processor is instructed on a per-query level what language to use
If no language is explicitly set in a document or a query, Vespa will run its configured language detector on the available text (the full content of a document field, or the full query= parameter value).

A document that contains the exact same word as a query might not be recallable if the language of the document field is detected differently from the query. Unless the query has explicitly declared a language, this can occur.

Indexing with language

The indexing process run by Vespa is a sequential execution of the indexing script of every field in the input document. At any point, the script may choose to set the language state of the processor using set_language. Example:

search book {
    document book {
        field language type string {
            indexing: set_language
        }
        field title type string {
            indexing: index
        }
    }
}
Indicating that every document in the input is expected to have its own language.

Because indexing scripts are executed in the order they are given in the search definition, and because the language state is never reset during the processing of a single document, all indexed string fields following the language field will be processed under the rules of that language.

The only thing that changes due to language is the output from normalize and tokenize. Now, because indexing: index implies tokenize for string fields, the field title is affected.

If either normalize or tokenize is invoked prior to set_language, the language detector is run on the input string.

The net result of this is that by calling set_language inside a document, the terms that end up in a tokenized index are changed. This means that at query-time, one must apply the same language settings before tokenizing the query terms to be able to match what was stored in the index. This also means that a single index may simultaneously contain terms of multiple languages.

Even if a document contains a string field used as input for the set_language indexing expression, there is no automation in storing this language in an index. To filter by language at some point, save this field as an attribute.

Querying with language

The content of an indexed string field is hence language-agnostic. One must therefore apply a symmetric tokenization on the query terms in order to match the content of that field.

The query parser subscribes to configuration that tells it what fields are indexed strings, and every query term that targets such a field are run through appropriate tokenization. The language query parameter is what controls the language state of these calls.

Because an index may simultaneously contain terms in any number of languages, one can have stemmed variants of one language match the stemmed variants of another. To work around this, store the language of a document in a separate attribute, and apply a filter against that attribute at query-time.

If no language parameter is given, the language detector is called to process the query string. The detector is likely to be confused by field names and query syntax, but it is a best-effort approach. This matches the language resolution of the index pipeline.

By default, there is no knowledge anywhere that captures what languages are used to generate the content of an index. The language parameter only affects the transformation of query terms that hit tokenized indexes.

Tokenization

Tokenization removes any non-word characters, and splits the string into tokens on each word boundary. In addition, CJK tokens are split using a segmentation algorithm. The resulting tokens are then searchable in the index.

Not supported in streaming search.

To index strings as-is (that is, avoid tokenization), use indexing-rewrite: none.

Also see N-gram matching.

Normalization

Normalization preserves case. Normalizing will cause accents and similar decorations which are often misspelled to be normalized the same way both in documents and queries.

Not supported in streaming search.

Normalizations:

à ⇒ a ç ⇒ c ð ⇒ d ù ⇒ u
â ⇒ a è ⇒ e ñ ⇒ n ú ⇒ u
á ⇒ a é ⇒ e ò ⇒ o û ⇒ u
ã ⇒ a ê ⇒ e ó ⇒ o ü ⇒ ue
ä ⇒ ae ë ⇒ e ô ⇒ o ý ⇒ y
å ⇒ aa ì ⇒ i õ ⇒ o ÿ ⇒ y
æ ⇒ ae í ⇒ i ö ⇒ oe ß ⇒ ss
î ⇒ i ø ⇒ oe þ ⇒ th
ï ⇒ i

Stemming

Stemming means translate a word to its base form (singular forms for nouns, infinitive for verbs), using a stemmer. Use of stemming increases search recall, because the searcher is usually interested in documents containing query words regardless of the word form used. Stemming in Vespa is symmetric, i.e. words are converted to stems both when indexing and searching.

Examples of this is when text is indexed, the stemmer will convert the noun reports (plural) to report, and the latter will be stored in the index. Likewise, before searching, reports will be stemmed to report. Another example is that am, are and was will be stemmed to be both in queries and indexes.

When bolding is enabled, all forms of the query term will be bolded. I.e. when searching for reports, both report, reported and reports will be bolded.

Theory

From a matching point of view, stemming takes all possible token strings and maps them into equivalence classes. So in the example above, the set of tokens { report, reports, reported } are in an equivalence class. To represent the class, the linguistics library should pick the best element in the class. At query time, the text typed by a user will be tokenized, and then each token should be mapped to the most likely equivalence class, again represented by the shortest element that belongs to the class.

While the theory sounds pretty simple, in practice it is not always possible to figure out which equivalence class a token should belong to. A typical example is the string number. In most cases we would guess this to mean a numerical entity of some kind, and the equivalence class would be { number, numbers } - but it could also be a verb, with a different equivalence class { number, numbered, numbering }. These are of course closely related, and in practice they will be merged, so we'll have a slightly larger equivalence class { number, numbers, numbered, numbering } and be happy with that. However, in a sentence such as my legs keep getting number every day, the number token clearly does not have the semantics of a numerical entity, but should be in the equivalence class { numb, number, numbest, numbness } instead. But blindly assigning number to the equivalence class numb is clearly not right, since the more numb meaning is much less likely than the numerical entity meaning.

The approach currently taken by the low-level linguistics library will often lead to problems in the number-like cases as described above. To give better recall, Vespa has implemented a multiple stemming option.

Configuration

By default, all words are stemmed to their best form. Refer to the stemming reference for other stemming types. To change type, add:

stemming: [stemming-type]
Stemming can be set either for a field, a fieldset or as a default for all fields. Example: Disable stemming for the field title:
field title type string {
    indexing: summary | index
    stemming: none
}

Note: Stemming is not applicable to streaming search.