Lucene Linguistics is a custom linguistics implementation of the
Apache Lucene library.
It provides a Lucene analyzer to handle text processing for a language
with an optional variation per stemming mode.
Lucene text
analysis
is a process of converting text into searchable tokens.
This text analysis consists of a series of components applied to the text in order:
CharFilters:
transform the text before it is tokenized, while providing corrected character offsets to account for these
modifications.
Tokenizers:
responsible for breaking up incoming text into tokens.
TokenFilters:
responsible for modifying tokens that have been created by the Tokenizer.
A specific configuration of the above components is wrapped into an
Analyzer object.
The text analysis works as follows:
All char filters are applied in the specified order on the entire text string
Token filters in the specified order are applied on each token.
Defaults language analysis
Lucene Linguistics out-of-the-box exposes the analysis components provided
by the lucene-core
and the
lucene-analysis-common
libraries.
Other libraries with Lucene text analysis components
(e.g. analysis-kuromoji)
can be added to the application package as a Maven dependency.
Lucene Linguistics out-of-the-box provides analyzers for 40 languages:
Arabic
Armenian
Basque
Bengali
Bulgarian
Catalan
Chinese
Czech
Danish
Dutch
English
Estonian
Finnish
French
Galician
German
Greek
Hindi
Hungarian
Indonesian
Irish
Italian
Japanese
Korean
Kurdish
Latvian
Lithuanian
Nepali
Norwegian
Persian
Portuguese
Romanian
Russian
Serbian
Spanish
Swedish
Tamil
Telugu
Thai
Turkish
The Lucene
StandardAnalyzer
is used for the languages that doesn't have a custom nor a default analyzer.
Linguistics key
Linguistics keys identify a configuration of text analysis. It can be made of two parts,
separated by a semicolon, though you can omit one or the other. The two parts are:
The language key, in turn, has 2 parts: a mandatory
language code and an optional stemming mode.
The format is LANGUAGE_CODE[/STEM_MODE].
There are 5 stemming modes: NONE, DEFAULT, ALL, SHORTEST, BEST (they can be specified in the field schema).
Examples of linguistics key:
profile=whitespaceLowercase: a profile that applies to all languages. You can bind it to
different fields by specifying their linguistics profiles in the schema.
profile=whitespaceLowercase;language=en: a profile that applies to the English language.
You'd still bind it to fields via their linguistics profiles in the schema,
but it will only be applied to the English texts (either at indexing or query time).
en: English language: applies to all English texts where no profile is specified
(in the schema or in the query).
en/BEST: English language with the BEST stemming mode. Like the previous example,
but only applies when stemming is set to BEST.
Lucene linguistics provides multiple ways to customize text analysis per language:
LuceneLinguistics component configuration in the services.xml
ComponentsRegistry
LuceneLinguistics component configuration
In services.xml it is possible to construct an analyzer by providing
configuration for theLuceneLinguistics component (from all text analysis components that are available on the classpath).
Example for the English language:
The en/stopwords.txt file must be placed in your application package under
the lucene-linguistics directory, which is referenced by the configDir option.
If configDir is not provided the files must be on the classpath.
Components registry
The ComponentsRegistry
mechanism can be used to set a Lucene Analyzer for a language.
class is the implementation class that extends the `Analyzer` class;
bundle is a name of the application package as specified in the pom.xml
(or can be any bundle added to your components dir that contains the class).
For this to work, the class must provide only a constructor without arguments.
In case your analyzer class needs some initialization you must wrap the analyzer into a class
that implements the Provider<Analyzer> class.
Custom text analysis components
The text analysis components are loaded via Java Service provider interface (SPI).
To use an external library that is properly prepared it is enough to add the
library to the application package as a Maven dependency.
In case you need to create a custom component the steps are:
Implement a component in a Java class
Register the component class in the (e.g. a custom token filter) META-INF/services/org.apache.lucene.analysis.TokenFilterFactory
file that is on the classpath.
Language Detection
Lucene Linguistics doesn't provide language detection.
This means that for both feeding and searching you should provide a
language parameter.
Indexing all stems
Some analyzers expand the input text into multiple tokens on the same position.
For example, those based on the NGramTokenFilter.
Here's a sample analyzer configuration:
This will take a text like dog and produce do and og as tokens, plus (by default) the original dog.
However, Vespa only takes the first token (do) and writes it to the index, ignoring the other "stems". As a result,
a search for og will not match documents that contain dog, which is the whole point of using letter n-grams.
To index all stems, you can use the stemming parameter in the schema definition of your field:
field title_grams type string {
indexing: summary | index
linguistics {
profile: ngram
}
stemming: multiple
}
Now, Vespa will index all stems, and a search for og will match documents that contain dog.
Note:
Queries look for all stems by default (regardless of the schema configuration). For example, a search for
dog would expand to do and og as well, looking for all three terms.
Recipe: autocomplete with edge n-grams
Edge n-grams turn a term into its growing prefixes — coffee becomes
c, co, cof, coff, coffe, coffee —
which is a common way to implement prefix matching and search-as-you-type. Reach for Lucene Linguistics edge n-grams
when you also need text normalization that the built-in
gram match does not provide, for example
mapping & to and so that "Barnes & Noble" also matches a search for
and. If you do not need normalization, prefer the built-in
gram matching (see the
search-as-you-type
sample app) — it grams the query and the document consistently and avoids the pitfalls below.
Getting edge n-grams right with Lucene Linguistics requires three pieces to line up. The two warnings below
are the mistakes that are easy to make and hard to spot.
Generate the grams with edgeNGram as a token filter, not a tokenizer.
Use a separate, non-gramming profile for the query string (bound in the schema with
profile.index / profile.search).
Indexing: generate the grams
Warning:
Do not combine a length-changing
CharFilter
(for example a patternReplace that rewrites & to and) with an n-gram
tokenizer such as edgeNGram. The CharFilter rewrites character offsets, which makes the
n-gram tokenizer collapse several grams onto the same token position; only the first of those is then indexed, so grams
go silently missing. Generate n-grams with a token filter (after a regular tokenizer) instead, as shown
below.
Put the n-gram generation last in the tokenFilters chain, after a normal tokenizer and the
normalization filters. The standard tokenizer splits on whitespace and punctuation so each word is
grammed on its own; use the keyword tokenizer instead if you want prefixes of the whole field value.
The matching field must set stemming: multiple, otherwise only the first (shortest) gram of each word
is written to the index:
field name type string {
indexing: index | summary
linguistics {
profile: autocomplete
}
stemming: multiple
}
Note:
Verify what gets indexed with a
tokens document summary. In that output a
nested array represents several tokens that share one position (the grams of one word) — that is expected,
not a bug.
Keep minGramSize at 2 or higher unless you really need single-character prefixes: a
minGramSize of 1 makes every term emit a one-character gram, so those posting lists grow to
cover a large fraction of the corpus.
Querying: do not gram the query string
Warning:
Do not apply the n-gram analyzer to the query string. The query term is itself
expanded into its grams — coffee becomes WORD_ALTERNATIVES [co, cof, coff, coffe, coffee],
which is an OR — so the query matches any document containing any of those grams. A
search for coffee would then also match "cottage" through the shared gram co, and the
smaller the minGramSize the worse this gets. Gram at indexing time only, and analyze the query with a
non-gramming profile.
Define a second profile that is identical to the indexing one but without the edgeNGram token
filter, so a query term stays a single token:
Bind it as the query-side profile on the field by expanding profile into index and
search (see
Different processing for query strings). This
keeps the asymmetry in the schema, so ordinary queries do the right thing without any per-query annotation:
field name type string {
indexing: index | summary
linguistics {
profile {
index: autocomplete
search: autocomplete_query
}
}
stemming: multiple
}
A search for coffee now stays the single token coffee and matches only terms that start with
coffee, while the document side still holds every gram: