OpenNLP Linguistics

The default Vespa linguistics implementation uses OpenNLP. The Apache OpenNLP language detection is also used, by default, even if you're using a different implementation. See Language handling for more information. OpenNLP has support for 103 languages.

OpenNLP language detection

The OpenNLP language detector gives a prediction with a confidence; with confidence typically increasing with more input. The threshold for using the prediction can be configured with a number typically from 1.0 (wild guess) to 6.0 (confident guess), with 2.0 as the default:

  <container id="..." version="1.0">
    ...
    <config name="ai.vespa.opennlp.open-nlp">
      <detectConfidenceThreshold>4.2</detectConfidenceThreshold>
    </config>

Default languages

OpenNLP tokenization and stemming supports these languages:

Arabic (ar)
Catalan (ca)
Danish (da)
Dutch (nl)
English (en)
Finnish (fi)
French (fr)
German (de)
Greek (el)
Hungarian (hu)
Indonesian (id)
Irish (ga)
Italian (it)
Norwegian (no)
Portuguese (pt)
Romanian (ro)
Russian (ru)
Spanish (es)
Swedish (sv)
Turkish (tr)

Other languages will use a fallback to English en.

English uses a simpler stemmer (kStem) by default, which produces fewer stems and therefore lower recall. To use OpenNlp stemming (Snowball) also for English add this config to your <container> element(s):

  <container id="..." version="1.0">
    ...
    <config name="ai.vespa.opennlp.open-nlp">
      <snowballStemmingForEnglish>true</snowballStemmingForEnglish>
    </config>

See Tokens OpenNLP models and text matching for examples and how to experiment with linguistics.

If you need support for more languages, you can consider replacing the default OpenNLP based linguistic integration with the Lucene Linguistics implementation which supports more languages.

Chinese

The default linguistics implementation does not segment Chinese into tokens, but this can be turned on by config:

  <container id="..." version="1.0">
    ...
    <config name="ai.vespa.opennlp.open-nlp">
      <cjk>true</cjk>
      <createCjkGrams>true</createCjkGrams>
    </config>

The createCjkGrams adds substrings of segments longer than 2 characters, which may increase recall.

Tokenization

Tokenization removes any non-word characters, and splits the string into tokens on each word boundary. In addition, CJK tokens are split using a segmentation algorithm. The resulting tokens are then searchable in the index.

Also see N-gram matching.

Normalization

An example normalization is à ⇒ a. Normalizing will cause accents and similar decorations which are often misspelled to be normalized the same way both in documents and queries.

Vespa uses java.text.Normalizer to normalize text, see SimpleTransformer.java. Normalization preserves case.

Refer to the nfkc query term annotation. Also see the YQL accentDrop annotation.

Stemming

Stemming means translate a word to its base form (singular forms for nouns, infinitive for verbs), using a stemmer. Use of stemming increases search recall, because the searcher is usually interested in documents containing query words regardless of the word form used. Stemming in Vespa is symmetric, i.e. words are converted to stems both when indexing and searching.

Examples of this is when text is indexed, the stemmer will convert the noun reports (plural) to report, and the latter will be stored in the index. Likewise, before searching, reports will be stemmed to report. Another example is that am, are and was will be stemmed to be both in queries and indexes.

When bolding is enabled, all forms of the query term will be bolded. I.e. when searching for reports, both report, reported and reports will be bolded.

See the stem query term annotation.

Theory

From a matching point of view, stemming takes all possible token strings and maps them into equivalence classes. So in the example above, the set of tokens { report, reports, reported } are in an equivalence class. To represent the class, the linguistics library should pick the best element in the class. At query time, the text typed by a user will be tokenized, and then each token should be mapped to the most likely equivalence class, again represented by the shortest element that belongs to the class.

While the theory sounds pretty simple, in practice it is not always possible to figure out which equivalence class a token should belong to. A typical example is the string number. In most cases we would guess this to mean a numerical entity of some kind, and the equivalence class would be { number, numbers } - but it could also be a verb, with a different equivalence class { number, numbered, numbering }. These are of course closely related, and in practice they will be merged, so we'll have a slightly larger equivalence class { number, numbers, numbered, numbering } and be happy with that. However, in a sentence such as my legs keep getting number every day, the number token clearly does not have the semantics of a numerical entity, but should be in the equivalence class { numb, number, numbest, numbness } instead. But blindly assigning number to the equivalence class numb is clearly not right, since the more numb meaning is much less likely than the numerical entity meaning.

The approach currently taken by the low-level linguistics library will often lead to problems in the number-like cases as described above. To give better recall, Vespa has implemented a multiple stemming option.

Configuration

By default, all words are stemmed to their best form. Refer to the stemming reference for other stemming types. To change type, add:

stemming: [stemming-type]

Stemming can be set either for a field, a fieldset or as a default for all fields. Example: Disable stemming for the field title:

field title type string {
    indexing: summary | index
    stemming: none
}

See andSegmenting for how to control re-segmenting when stemming.