Vespa uses a linguistics module to process text in queries and documents during indexing and searching.
The goal of linguistic processing is to increase recall (how many documents are matched)
without hurting precision (the relevance of the documents matched) too much.
It consists of such operations as:
tokenizing text into chunks of known types such as words and punctuation.
normalizing accents.
finding the base form of words (stemming or lemmatization).
Linguistic processing is run when writing documents, and when querying:
The processing is run on string fields
with index indexing mode. Overview:
When writing documents, string fields with indexing: index are by default processed.
A field's language will configure this processing.
A document/fields can have the language set explicitly,
if not, it is detected.
The field's content is processed (e.g., tokenized, normalized, stemmed, etc.),
and the resulting terms are added to the index.
Note:
The language for the field is not persisted on the content node,
just the processed terms themselves
A query is also processed in a similar fashion. Typically through the same
linguistics profile as the field content,
producing the same terms from the same text.
The language of query strings is detected unless specified using
model.locale
or annotations like language.
Note:
This is a very common query problem -
it is hard to detect language precisely from short strings.
The processed query is evaluated on the content nodes,
and will only work as expected if both documents and queries produce the same terms.
These operations can be turned on or off per field in the schema.
See implicitTransforms
for how to enable/disable transforms per query term.
Linguistics implementations
Vespa comes with two linguistics variants out of the box:
OpenNLP and Lucene.
Check out the respective pages for more information on how to configure them.
You can also implement a custom Linguistics component.
The default linguistics variant is OpenNLP, but for the
rest of this page we'll go through common options, such as language handling, inherited by
all implementations.
Note:
Linguistics implementations only control how text is tokenized,
including positional information. These tokens are stored in the same way in the underlying
index. For example, if you use Lucene linguistics, Vespa does not store information such as
positions in Lucene segment files. Storage is the same as with OpenNLP,
only resulting tokens might differ.
Language handling
Vespa does not know the language of a document - this applies:
The indexing processor is instructed on a per-field level what language to
use when calling the underlying linguistics library
The query processor is instructed on a per-query level what language to use
If no language is explicitly set in a document or a query,
Vespa will run its configured language detector (by default, OpenNLP language detection)
on the available text (the full content of a document field, or the full query= parameter value).
A document that contains the exact same word as a query might not be recall-able
if the language of the document field is detected differently from the query.
Unless the query has explicitly declared a language,
this can occur.
Indexing with language
The indexing process run by Vespa is a sequential execution
of the indexing scripts of each field in the schema, in the declared order.
At any point, the script may set the language that will be used for indexing statements for subsequent fields,
using set_language.
Example:
schema doc {
document doc {
field language type string {
indexing: set_language
}
field title type string {
indexing: index
}
}
}
If a language has not been set when tokenization of a field is run, the language is determined by
language detection.
If all documents have the same language, the language can be hardcoded it the schema in this way:
schema doc {
field language type string {
indexing: "en" | set_language
}
document doc {
...
If the same document contains fields in multiple languages, set_language can be invoked multiple times, e.g.:
schema doc {
document doc {
field language_title1 type string {
indexing: set_language
}
field title1 type string {
indexing: index
}
field language_title2 type string {
indexing: set_language
}
field title2 type string {
indexing: index
}
}
}
Or, if fixed per field, use multiple indexing statements in each field:
schema doc {
document doc {
field my_english_field type string {
indexing {
"en" | set_language;
index;
}
}
field my_spanish_field type string {
indexing {
"es" | set_language;
index;
}
}
}
}
Field language detection
When indexing a document, if a field has unknown language (i.e. not set using set_language),
language detection is run on the field's content.
This means, language detection is per field, not per document.
See query language detection for detection confidence,
fields with little text will default to English.
Querying with language
The content of an indexed string field is language-agnostic.
One must therefore apply a compatible tokenization on the query terms (e.g., stemming for the same language)
in order to match the content of that field.
The query parser subscribes to configuration that tells it what fields are indexed strings,
and every query term that targets such a field are run through appropriate tokenization.
The language query parameter
controls the language state of these calls.
Because an index may simultaneously contain terms in any number of languages,
one can have stemmed variants of one language match the stemmed variants of another.
To work around this, store the language of a document in a separate attribute,
and apply a filter against that attribute at query-time.
By default, there is no knowledge anywhere that captures what
languages are used to generate the content of an index.
The language parameter only affects the transformation of query terms that hit tokenized indexes.
Query language detection
If no language parameter is used,
or the query terms are annotated,
the language detector is called to process the query string.
Queries are normally short, as a consequence, the detection confidence is low. Example:
$ vespa query "select * from music where userInput(@text)" \
tracelevel=3 text='Eine kleine Nachtmusik' | grep 'Stemming with language'
"message": "Stemming with language=ENGLISH"
$ vespa query "select * from music where userInput(@text)" \
tracelevel=3 text='Eine kleine Nachtmusik schnell' | grep 'Stemming with language'
"message": "Stemming with language=GERMAN"
See #24265 for details -
in short, with the current 0.02 confidence cutoff, queries with 3 terms or fewer will default to English.
Multiple languages
Vespa supports having documents in multiple languages in the same schema, but does not out-of-the-box
support cross-lingual retrieval (e.g., search using English and retrieve relevant documents written in German).
This is because the language of a query is determined
by the language of the query string and only one transformation can take place.
Approaches to overcome this limitation include:
Use semantic retrieval using a multilingual text embedding model (see blog post)
which has been trained on multilingual corpus and can be used to retrieve documents in multiple languages.
Stem and tokenize the query using the relevant languages,
build a query tree using weakAnd /
or
and using equiv per stem variant.
This is easiest done in a custom Searcher as mentioned in
#12154.
Example:
language=fr: machine learning => machin learn
language=en: machine learning => machine learn
Using weakAnd here as example as that technique is already mentioned in #12154:
We now retrieve using all possible stems/base forms with weakAnd,
and use the rank operator
to pass in the original query form, so that ranking can rank literal matches (original) higher.
Benefit of equiv is that it allows multiple term variants to share the same position,
so that proximity ranking does not become broken by this approach.
Note language=en there. It is optional: if it's not set,
the profile will be used for all languages. But you can have different
definitions for different languages on the same profile (e.g., different stemming).
Different processing for query strings
For some use cases, you may want to process the query string differently than the document content.
Synonyms are a good example. If you expand dog to dog,puppy at query time,
it will match either term in the document anyway - no need to expand it at write-time.
To do this, you'd define a different profile for the query string. Like:
<itemkey="profile=whitespaceLowercaseSynonyms;language=en"><tokenizer><name>whitespace</name></tokenizer><tokenFilters><item><name>lowercase</name></item><item><name>synonymGraph</name><conf><!--
Synonyms file should contain something like:
dog,puppy
--><itemkey="synonyms">en/synonyms.txt</item></conf></item></tokenFilters></item>
Then, in the schema, expand profile to profile.index and profile.search:
field title type string {
indexing: summary | index
linguistics {
profile {
index: whitespaceLowercase
search: whitespaceLowercaseSynonyms
}
}
}
At this point, where synonyms_test contains 'dog' will match a document containing puppy.
Overriding profile for query strings
At query time, you can force Vespa to use a specific profile to process the query string via grammar.profile.
This works with userInput() or text() operators.
For example, to use the whitespaceLowercase profile for the query string:
where title contains ({grammar.profile: 'whitespaceLowercase'}text('dog'))
Equivalent expression via userInput():
where {defaultIndex:'title', grammar.profile: 'whitespaceLowercase', grammar: 'linguistics'}userInput('dog')
Note:
You should use grammar=linguistics (like in the example above) with grammar.profile to ensure that there is no
additional processing (e.g., tokenization) besides what is already defined in the profile.
Troubleshooting linguistics processing
If your documents don't match as expected, there are two ways to get more information.
First, you can get the tokenized text for a field by using tokens
in the document summary.
For example, to get the original text and tokens for the title field:
document-summary debug-text-tokens {
summary title {}
summary title_tokens {
source: title
tokens
}
from-disk
}
Then, at query time, you can also get the tokens of the query string
by increasing the trace level:
{"yql":"select * from sources * where title contains \"dog\"","presentation.summary":"debug-text-tokens","model.locale":"en","trace.level":2}