Linguistics in Vespa

Vespa uses a linguistics module to process text in queries and documents during indexing and searching. The goal of linguistic processing is to increase recall (how many documents are matched) without hurting precision (the relevance of the documents matched) too much. It consists of such operations as tokenizing text into chunks of known types such as words and punctuation, and normalizing accents and finding the base form of words (stemming or lemmatization). These operations can be turned on or off per field in a search definition.

The default linguistics implementation - SimpleLinguistics, provides support for english stemming only. To support additional languages you can use OpenNlp linguistics instead by loading the module, or providing your own linguistics module.

Configuring a linguistics implementation

The linguistics implementation must be configured as a component in container clusters doing linguistics processing.

As document processing for indexing is by default done by an autogenerated container cluster which cannot be configured, specify a container cluster for indexing explicitly.

This example shows how to configure OpenNlp for linguistics using the same cluster for both query and indexing processing (if using different clusters, add the same linguistics component to all of them):


  <container version="1.0" id="mycontainer">
    <component id=""/>
    <nodes ...>

  <content version="1.0">
      <document type="mydocument" mode="index"/>
      <document-processing cluster="mycontainer"/>
    <nodes ...>


Note that if you change the linguistics component of a live system you may experience reduced recall until all documents are re-written as documents will still be stored with tokens generated by the previous linguistics module.

Creating a custom linguistics implementation

A linguistics component is an implementation of Refer to the implementation (which you can subclass for convenience).

Language handling

This section describes how language settings are applied in Vespa. This covers both the set_language indexing expression, as well as the language query parameter.

The single most important thing to note about language handling in Vespa, is that Vespa does not know the language of a document. Instead, 1) the indexing processor is instructed on a per-field level what language to use when calling the underlying linguistics library, and 2) the query processor is instructed on a per-query level what language to use. If no language is explicitly set in a document or a query, Vespa will run its configured language detector on the available text (the full content of a document field, or the full query= parameter value).

A document that contains the exact same word as a query might not be recallable if the language of the document field is detected differently from the query. Unless the query has explicitly declared a language, this has a high probability of occurring.

Indexing with language

The indexing process run by Vespa is nothing more than the sequential execution of the indexing script of every field in the input document. At any point, the script may choose to set the language state of the processor using set_language. Example:

search book {
    document book {
        field language type string {
            indexing: set_language
        field title type string {
            indexing: index
Indicating that every document in the input is expected to have its own language.

Because indexing scripts are executed in the order they are given in the search definition, and because the language state is never reset during the processing of a single document, all indexed string fields following the language field will be processed under the rules of that language.

The only thing that changes due to language is the output from normalize and tokenize. Now, because indexing: index implies tokenize for string fields, the field title is affected.

If either normalize or tokenize is invoked prior to set_language, the language detector is run on the input string.

The net result of this is that by calling set_language inside a document, you change the terms that end up in a tokenized index. This means that at query-time, you need to apply the same language settings before tokenizing the query terms to be able to match what was stored in the index. This also means that a single index may simultaneously contain terms of multiple languages.

Even if a document contains a string field used as input for the set_language indexing expression, there is no automation in storing this language in an index. If you wish to filter by language at some point, you would have to explicitly save this field as an attribute.

Querying with language

Now that we understand that the content of an indexed string field are language-agnostic, it should be clear that one must apply a symmetric tokenization on the query terms in order to match the content of that field. And this is exactly what Vespa's query parser does for you.

The query parser subscribes to a configuration file that tells it what fields are indexed strings, and every query term that targets such a field are run through appropriate tokenization. The language query parameter is what controls the language state of these calls.

Because an index may simultaneously contain terms in any number of languages, you might have stemmed variants of one language match the stemmed variants of another. If you need to work around this, you must store the language of a document in a separate attribute, and apply a filter against that attribute at query-time.

If no language parameter is given, the language detector is called to process the query string. The detector is likely to be confused by field names and query syntax, but it is a best-effort approach. This matches the language resolution of the index pipeline.

By default, there is no knowledge anywhere that captures what languages are used to generate the content of an index. The language parameter only affects the transformation of query terms that hit tokenized indexes.