Lucene Linguistics

Lucene Linguistics is a custom linguistics implementation of the Apache Lucene library. It provides a Lucene analyzer to handle text processing for a language with an optional variation per stemming mode.

Check sample apps to get started.

Crash course to Lucene text analysis

Lucene text analysis is a process of converting text into searchable tokens. This text analysis consists of a series of components applied to the text in order:

  • CharFilters: transform the text before it is tokenized, while providing corrected character offsets to account for these modifications.
  • Tokenizers: responsible for breaking up incoming text into tokens.
  • TokenFilters: responsible for modifying tokens that have been created by the Tokenizer.

A specific configuration of the above components is wrapped into an Analyzer object.

The text analysis works as follows:
  1. All char filters are applied in the specified order on the entire text string
  2. Token filters in the specified order are applied on each token.

Defaults language analysis

Lucene Linguistics out-of-the-box exposes the analysis components provided by the lucene-core and the lucene-analysis-common libraries. Other libraries with Lucene text analysis components (e.g. analysis-kuromoji) can be added to the application package as a Maven dependency.

Lucene Linguistics out-of-the-box provides analyzers for 40 languages:

  • Arabic
  • Armenian
  • Basque
  • Bengali
  • Bulgarian
  • Catalan
  • Chinese
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • Galician
  • German
  • Greek
  • Hindi
  • Hungarian
  • Indonesian
  • Irish
  • Italian
  • Japanese
  • Korean
  • Kurdish
  • Latvian
  • Lithuanian
  • Nepali
  • Norwegian
  • Persian
  • Portuguese
  • Romanian
  • Russian
  • Serbian
  • Spanish
  • Swedish
  • Tamil
  • Telugu
  • Thai
  • Turkish

The Lucene StandardAnalyzer is used for the languages that doesn't have a custom nor a default analyzer.

Linguistics key

Linguistics keys identify a configuration of text analysis. It can be made of two parts, separated by a semicolon, though you can omit one or the other. The two parts are:

The language key, in turn, has 2 parts: a mandatory language code and an optional stemming mode. The format is LANGUAGE_CODE[/STEM_MODE]. There are 5 stemming modes: NONE, DEFAULT, ALL, SHORTEST, BEST (they can be specified in the field schema).

Examples of linguistics key:

  • profile=whitespaceLowercase: a profile that applies to all languages. You can bind it to different fields by specifying their linguistics profiles in the schema.
  • profile=whitespaceLowercase;language=en: a profile that applies to the English language. You'd still bind it to fields via their linguistics profiles in the schema, but it will only be applied to the English texts (either at indexing or query time).
  • en: English language: applies to all English texts where no profile is specified (in the schema or in the query).
  • en/BEST: English language with the BEST stemming mode. Like the previous example, but only applies when stemming is set to BEST.

Customizing text analysis

Lucene linguistics provides multiple ways to customize text analysis per language:

  • LuceneLinguistics component configuration in the services.xml
  • ComponentsRegistry

LuceneLinguistics component configuration

In services.xml it is possible to construct an analyzer by providing configuration for the LuceneLinguistics component (from all text analysis components that are available on the classpath). Example for the English language:

  <component id="linguistics"
             class="com.yahoo.language.lucene.LuceneLinguistics"
             bundle="your-bundle-name">
    <config name="com.yahoo.language.lucene.lucene-analysis">
      <configDir>lucene-linguistics</configDir>
      <analysis>
        <item key="profile=standardStopStem;language=en">
          <tokenizer>
            <name>standard</name>
          </tokenizer>
          <tokenFilters>
            <item>
              <name>stop</name>
              <conf>
                <item key="words">en/stopwords.txt</item>
                <item key="ignoreCase">true</item>
              </conf>
            </item>
            <item>
              <name>englishMinimalStem</name>
            </item>
          </tokenFilters>
        </item>
      </analysis>
    </config>
  </component>

Notes:

  • item key="profile=standardStopStem;language=en" value is a linguistics key.
  • name values are the SPI names of the text analysis components. You'll typically find them in the Lucene analysis JavaDocs. For example, the name stop along with other options can be found in the StopFilterFactory JavaDoc.
  • The en/stopwords.txt file must be placed in your application package under the lucene-linguistics directory, which is referenced by the configDir option.
  • If configDir is not provided the files must be on the classpath.

Components registry

The ComponentsRegistry mechanism can be used to set a Lucene Analyzer for a language.

<component
    id="en"
    class="org.apache.lucene.analysis.core.SimpleAnalyzer"
    bundle="your-bundle-name" />

Where:

  • id must be a linguistics key;
  • class is the implementation class that extends the `Analyzer` class;
  • bundle is a name of the application package as specified in the pom.xml (or can be any bundle added to your components dir that contains the class).

For this to work, the class must provide only a constructor without arguments.

In case your analyzer class needs some initialization you must wrap the analyzer into a class that implements the Provider<Analyzer> class.

Custom text analysis components

The text analysis components are loaded via Java Service provider interface (SPI).

To use an external library that is properly prepared it is enough to add the library to the application package as a Maven dependency.

In case you need to create a custom component the steps are:

  1. Implement a component in a Java class
  2. Register the component class in the (e.g. a custom token filter) META-INF/services/org.apache.lucene.analysis.TokenFilterFactory file that is on the classpath.

Language Detection

Lucene Linguistics doesn't provide language detection. This means that for both feeding and searching you should provide a language parameter.

Indexing all stems

Some analyzers expand the input text into multiple tokens on the same position. For example, those based on the NGramTokenFilter. Here's a sample analyzer configuration:

<item key="profile=ngram;language=en">
  <tokenizer>
    <name>whitespace</name>
  </tokenizer>
  <tokenFilters>
    <item>
      <name>nGram</name>
      <conf>
        <item key="minGramSize">2</item>
        <item key="maxGramSize">2</item>
      </conf>
    </item>
  </tokenFilters>
</item>

This will take a text like dog and produce do and og as tokens, plus (by default) the original dog. However, Vespa only takes the first token (do) and writes it to the index, ignoring the other "stems". As a result, a search for og will not match documents that contain dog, which is the whole point of using letter n-grams.

To index all stems, you can use the stemming parameter in the schema definition of your field:

field title_grams type string {
  indexing: summary | index
  linguistics {
      profile: ngram
  }
  stemming: multiple
}

Now, Vespa will index all stems, and a search for og will match documents that contain dog.