Linguistics in Vespa

Vespa uses a linguistics module to process text in queries and documents during indexing and searching. The goal of linguistic processing is to increase recall (how many documents are matched) without hurting precision (the relevance of the documents matched) too much. It consists of such operations as tokenizing text into chunks of known types such as words and punctuation, normalizing accents and finding the base form of words (stemming or lemmatization). These operations can be turned on or off per field in a search definition.

Vespa comes with a reasonable linguistics implementation out of the box. If you want to provide your own, this document explains what you need to do.

Provide custom linguistics implementation

To use a custom linguistics implementation, create an implementation in the application and configure it as a component in all container clusters doing linguistics processing.

Implement custom linguistics

Create an implementation of the com.yahoo.language.Linguistics class. Refer to the com.yahoo.language.simple.SimpleLinguistics implementation.

Configure custom linguistics

To use the custom linguistics implementation, add it as a regular component to all container clusters which does either query or document processing. As document processing for indexing is by default done by an autogenerated container cluster which cannot be configured, specify a container cluster for indexing explicitly.

Below is an example with query and indexing processing in the same cluster (if using different clusters, make sure to add the same linguistics component to all of them):

<services>

  <container version="1.0" id="mycontainer">
    <component id="my.linguistics.implementing.Class"/>
    <document-processing/>
    <search/>
    <nodes ...>
  </container>

  <content version="1.0">
    <redundancy>1</redundancy>
    <documents>
      <document type="mydocument" mode="index"/>
      <document-processing cluster="mycontainer"/>
    </documents>
    <nodes ...>
  </content>

</services>

Language handling

This section describes how language settings are applied in Vespa. This covers both the set_language indexing expression, as well as the language query parameter.

The single most important thing to note about language handling in Vespa, is that Vespa does not know the language of a document. Instead, 1) the indexing processor is instructed on a per-field level what language to use when calling the underlying linguistics library, and 2) the query processor is instructed on a per-query level what language to use. If no language is explicitly set in a document or a query, Vespa will run its configured language detector on the available text (the full content of a document field, or the full query= parameter value).

A document that contains the exact same word as a query might not be recallable if the language of the document field is detected differently from the query. Unless the query has explicitly declared a language, this has a high probability of occurring.

Indexing with language

The indexing process run by Vespa is nothing more than the sequential execution of the indexing script of every field in the input document. At any point, the script may choose to set the language state of the processor using set_language. Example:

search book {
    document book {
        field language type string {
            indexing: set_language
        }
        field title type string {
            indexing: index
        }
    }
}
Indicating that every document in the input is expected to have its own language.

Because indexing scripts are executed in the order they are given in the search definition, and because the language state is never reset during the processing of a single document, all indexed string fields following the language field will be processed under the rules of that language.

The only thing that changes due to language is the output from normalize and tokenize. Now, because indexing: index implies tokenize for string fields, the field title is affected.

If either normalize or tokenize is invoked prior to set_language, the language detector is run on the input string.

The net result of this is that by calling set_language inside a document, you change the terms that end up in a tokenized index. This means that at query-time, you need to apply the same language settings before tokenizing the query terms to be able to match what was stored in the index. This also means that a single index may simultaneously contain terms of multiple languages.

Even if a document contains a string field used as input for the set_language indexing expression, there is no automation in storing this language in an index. If you wish to filter by language at some point, you would have to explicitly save this field as an attribute.

Querying with language

Now that we understand that the content of an indexed string field are language-agnostic, it should be clear that one must apply a symmetric tokenization on the query terms in order to match the content of that field. And this is exactly what Vespa's query parser does for you.

The query parser subscribes to a configuration file that tells it what fields are indexed strings, and every query term that targets such a field are run through appropriate tokenization. The language query parameter is what controls the language state of these calls.

Because an index may simultaneously contain terms in any number of languages, you might have stemmed variants of one language match the stemmed variants of another. If you need to work around this, you must store the language of a document in a separate attribute, and apply a filter against that attribute at query-time.

If no language parameter is given, the language detector is called to process the query string. The detector is likely to be confused by field names and query syntax, but it is a best-effort approach. This matches the language resolution of the index pipeline.

By default, there is no knowledge anywhere that captures what languages are used to generate the content of an index. The language parameter only affects the transformation of query terms that hit tokenized indexes.