Lucene Linguistics is a custom linguistics implementation on to of the Apache Lucene library. It allows to provide a Lucene analyzer to handle text processing for a language with an optional variation per stemming mode.
Check sample apps to get started.
A Lucene text analysis is a process of converting text into searchable tokens. The text analysis consists of a series of components applied on the text in order. The components are:
A specific configuration of the above components is a wrapped into an Analyzer object.The text analysis works as follows:
Lucene Linguistics by out-of-the-box exposes these analysis components provided by the lucene-core and the lucene-analysis-common libraries. Other libraries with Lucene text analysis components (e.g. analysis-kuromoji) can be added to the application package as a maven dependency.
Lucene Linguistics out-of-the-box provides configured analyzers for 40 languages:
The Lucene StandardAnalyzer is used for the languages that doesn't have neither a custom nor a default analyzer.
Linguistics keys identify a configuration of text analysis.
A key has 2 parts: a mandatory
language code and an optional stemming mode.
The format is
There are 5 stemming modes:
NONE, DEFAULT, ALL, SHORTEST, BEST (they can be specified in the field schema).
Examples of linguistics key:
en: English language.
en/BEST: English language with the
The Lucene linguistics provides multiple ways to customize the text analysis per language:
LuceneLinguisticscomponent configuration in the
services.xml out of all text analysis components
(that are available on the classpath)
it is possible to construct an analyzer by providing
configuration for the
Example for the English language:
<component id="linguistics" class="com.yahoo.language.lucene.LuceneLinguistics" bundle="your-bundle-name"> <config name="com.yahoo.language.lucene.lucene-analysis"/> <configDir>lucene-linguistics</configDir> <analysis> <item key="en"> <tokenizer> <name>standard</name> </tokenizer> <tokenFilters> <item> <name>stop</name> <conf> <item key="words">en/stopwords.txt</item> <item key="ignoreCase">true</item> </conf> </item> <item> <name>englishMinimalStem</name> </item> </tokenFilters> </item> </analysis> </component>
item key="en"value is a linguistics key.
en/stopwords.txtfile must be placed in your application package under the
configDiris not provided the files must be on the classpath.
The ComponentsRegistry mechanism can be used to set a Lucene Analyzer for a language.
<component id="en" class="org.apache.lucene.analysis.core.SimpleAnalyzer" bundle="your-bundle-name" />
idmust be a linguistics key;
classis the implementation class that extends the `Analyzer` class;
bundleis a name of the application package as specified in the
pom.xml(or can be any bundle added to your VAP
componentsdir that contains the class).
For this to work, the class must provide only a constructor without arguments.
In case your analyzer class needs some initialization you must wrap the analyzer into a class
that implements the
The text analysis components are loaded via Java Service provider interface (SPI).
To use an external library that is properly prepared it is enough to add the library to the application package as a maven dependency.
In case you need to create a custom components the steps are:
META-INF/services/org.apache.lucene.analysis.TokenFilterFactoryfile that is on the classpath.
Lucene Linguistics doesn't provide language detection. This means that for both feeding and searching you should provide a language parameter.