Lucene Linguistics

Lucene Linguistics is a custom linguistics implementation on to of the Apache Lucene library. It allows to provide a Lucene analyzer to handle text processing for a language with an optional variation per stemming mode.

Check sample apps to get started.

Crash course on the Lucene text analysis

A Lucene text analysis is a process of converting text into searchable tokens. The text analysis consists of a series of components applied on the text in order. The components are:

  • CharFilters: transform the text before it is tokenized, while providing corrected character offsets to account for these modifications.
  • Tokenizers: responsible for breaking up incoming text into tokens.
  • TokenFilters: responsible for modifying tokens that have been created by the Tokenizer.

A specific configuration of the above components is a wrapped into an Analyzer object.

The text analysis works as follows:
  1. All char filters are applied in the specified order on the entire text string
  2. Token filters in the specified order are applied on each token.

Defaults language analysis

Lucene Linguistics by out-of-the-box exposes these analysis components provided by the lucene-core and the lucene-analysis-common libraries. Other libraries with Lucene text analysis components (e.g. analysis-kuromoji) can be added to the application package as a maven dependency.

Lucene Linguistics out-of-the-box provides configured analyzers for 40 languages:

  • Arabic
  • Armenian
  • Basque
  • Bengali
  • Bulgarian
  • Catalan
  • Chinese
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • Galician
  • German
  • Greek
  • Hindi
  • Hungarian
  • Indonesian
  • Irish
  • Italian
  • Japanese
  • Korean
  • Kurdish
  • Latvian
  • Lithuanian
  • Nepali
  • Norwegian
  • Persian
  • Portuguese
  • Romanian
  • Russian
  • Serbian
  • Spanish
  • Swedish
  • Tamil
  • Telugu
  • Thai
  • Turkish

The Lucene StandardAnalyzer is used for the languages that doesn't have neither a custom nor a default analyzer.

Linguistics key

Linguistics keys identify a configuration of text analysis. A key has 2 parts: a mandatory language code and an optional stemming mode. The format is LANGUAGE_CODE[/STEM_MODE]. There are 5 stemming modes: NONE, DEFAULT, ALL, SHORTEST, BEST (they can be specified in the field schema).

Examples of linguistics key:

  • en: English language.
  • en/BEST: English language with the BEST stemming mode.

Customizing text analysis

The Lucene linguistics provides multiple ways to customize the text analysis per language:

  • LuceneLinguistics component configuration in the services.xml
  • ComponentsRegistry

LuceneLinguistics component configuration

In the services.xml out of all text analysis components (that are available on the classpath) it is possible to construct an analyzer by providing configuration for the LuceneLinguistics component. Example for the English language:

  <component id="linguistics"
             class="com.yahoo.language.lucene.LuceneLinguistics"
             bundle="your-bundle-name">
    <config name="com.yahoo.language.lucene.lucene-analysis"/>
      <configDir>lucene-linguistics</configDir>
      <analysis>
        <item key="en">
          <tokenizer>
            <name>standard</name>
          </tokenizer>
          <tokenFilters>
            <item>
              <name>stop</name>
              <conf>
                <item key="words">en/stopwords.txt</item>
                <item key="ignoreCase">true</item>
              </conf>
            </item>
            <item>
              <name>englishMinimalStem</name>
            </item>
          </tokenFilters>
        </item>
      </analysis>
  </component>

Notes:

  • item key="en" value is a linguistics key.
  • the en/stopwords.txt file must be placed in your application package under the lucene-linguistics directory.
  • If the configDir is not provided the files must be on the classpath.

Components registry

The ComponentsRegistry mechanism can be used to set a Lucene Analyzer for a language.

<component
    id="en"
    class="org.apache.lucene.analysis.core.SimpleAnalyzer"
    bundle="your-bundle-name" />

Where:

  • id must be a linguistics key;
  • class is the implementation class that extends the `Analyzer` class;
  • bundle is a name of the application package as specified in the pom.xml (or can be any bundle added to your VAP components dir that contains the class).

For this to work, the class must provide only a constructor without arguments.

In case your analyzer class needs some initialization you must wrap the analyzer into a class that implements the Provider<Analyzer>.

Custom text analysis components

The text analysis components are loaded via Java Service provider interface (SPI).

To use an external library that is properly prepared it is enough to add the library to the application package as a maven dependency.

In case you need to create a custom components the steps are:

  1. implement a component in a Java class
  2. register the component class in the (e.g. a custom token filter) META-INF/services/org.apache.lucene.analysis.TokenFilterFactory file that is on the classpath.

Language Detection

Lucene Linguistics doesn't provide language detection. This means that for both feeding and searching you should provide a language parameter.