Lucene Linguistics is a custom linguistics implementation on to of the Apache Lucene library. It allows to provide a Lucene analyzer to handle text processing for a language with an optional variation per stemming mode.
Check sample apps to get started.
A Lucene text analysis is a process of converting text into searchable tokens. The text analysis consists of a series of components applied on the text in order. The components are:
A specific configuration of the above components is a wrapped into an Analyzer object.
The text analysis works as follows:Lucene Linguistics by out-of-the-box exposes these analysis components provided by the lucene-core and the lucene-analysis-common libraries. Other libraries with Lucene text analysis components (e.g. analysis-kuromoji) can be added to the application package as a maven dependency.
Lucene Linguistics out-of-the-box provides configured analyzers for 40 languages:
The Lucene StandardAnalyzer is used for the languages that doesn't have neither a custom nor a default analyzer.
Linguistics keys identify a configuration of text analysis.
A key has 2 parts: a mandatory
language code and an optional stemming mode.
The format is LANGUAGE_CODE[/STEM_MODE]
.
There are 5 stemming modes: NONE, DEFAULT, ALL, SHORTEST, BEST
(they can be specified in the field schema).
Examples of linguistics key:
en
: English language.
en/BEST
: English language with the BEST
stemming mode.
The Lucene linguistics provides multiple ways to customize the text analysis per language:
LuceneLinguistics
component configuration in the services.xml
ComponentsRegistry
In the services.xml
out of all text analysis components
(that are available on the classpath)
it is possible to construct an analyzer by providing
configuration for the
LuceneLinguistics
component.
Example for the English language:
<component id="linguistics" class="com.yahoo.language.lucene.LuceneLinguistics" bundle="your-bundle-name"> <config name="com.yahoo.language.lucene.lucene-analysis"/> <configDir>lucene-linguistics</configDir> <analysis> <item key="en"> <tokenizer> <name>standard</name> </tokenizer> <tokenFilters> <item> <name>stop</name> <conf> <item key="words">en/stopwords.txt</item> <item key="ignoreCase">true</item> </conf> </item> <item> <name>englishMinimalStem</name> </item> </tokenFilters> </item> </analysis> </component>
Notes:
item key="en"
value is a linguistics key.
en/stopwords.txt
file must be placed in your application package under
the lucene-linguistics
directory.
configDir
is not provided the files must be on the classpath.
The ComponentsRegistry mechanism can be used to set a Lucene Analyzer for a language.
<component id="en" class="org.apache.lucene.analysis.core.SimpleAnalyzer" bundle="your-bundle-name" />
Where:
id
must be a linguistics key;
class
is the implementation class that extends the `Analyzer` class;
bundle
is a name of the application package as specified in the pom.xml
(or can be any bundle added to your VAP components
dir that contains the class).
For this to work, the class must provide only a constructor without arguments.
In case your analyzer class needs some initialization you must wrap the analyzer into a class
that implements the Provider<Analyzer>
.
The text analysis components are loaded via Java Service provider interface (SPI).
To use an external library that is properly prepared it is enough to add the library to the application package as a maven dependency.
In case you need to create a custom components the steps are:
META-INF/services/org.apache.lucene.analysis.TokenFilterFactory
file that is on the classpath.
Lucene Linguistics doesn't provide language detection. This means that for both feeding and searching you should provide a language parameter.