Significance is a measure of how rare a term is in a collection of documents. Rare terms like "neurotransmitter" are weighted higher during ranking than common terms like "the". Significance is often calculated as the inverse document frequency (IDF):
\[IDF(t, N) = log(\frac{N}{n_t})\]where:
Variations of IDF are used in bm25 and nativeRank.
Significance model provides the data necessary to calculate IDF, i.e. \(n_t\) for each term and \(N\) for the document collection. We distinguish between local and global significance models. A local model is node-specific and a global model is shared across nodes.
For string
fields indexed with bm25 or nativeRank,
Vespa creates a local significance model on each content node.
Each node uses its own local model for the queries it processes.
Different nodes can have different significance values for the same term. In large collections, this difference is usually small and doesn’t affect ranking quality.
One issue with the local models is that ranking is non-deterministic in the following cases:
Another issue is that local significance models are not available in streaming search because inverted indexes are not constructed so IDF values can't be extracted. All significance values are set to 1, which is the default value for unknown terms. The lack of significance values may degrade the ranking quality.
A global significance model addresses these issues.
In a global significance model, significance values are shared across nodes and don’t change when new documents are added. There are two ways to provide a global model:
Document frequency and document count can be specified in YQL, e.g.:
select * from example where content contains ({documentFrequency: {frequency: 13, count: 101}}"colors")
Alternatively, significance values can be specified in YQL directly and used instead of computed IDF values, e.g.:
select * from example where content contains ({significance:0.9}"neurotransmitter")
Document frequency and significance values can be also set in a custom searcher:
private void setDocumentFrequency(WordItem item, long frequency, long numDocuments) {
var word = item.getWord();
word.setDocumentFrequency(new DocumentFrequency(frequency, numDocuments));
}
private void setSignificance(WordItem item, float significance) {
var word = item.getWord();
word.setSignificance(significance);
}
significance
element in services.xml specifies one or more models:
<container version="1.0">
<search>
<significance>
<model model-id="significance-en-wikipedia-v1"/>
<model url="https://some/uri/mymodel.multilingual.json" />
<model path="models/mymodel.no.json.zst" />
</significance>
</search>
</container>
Vespa Cloud users have access to pre-built models, identified by model-id
.
In addition, all users can specify their own models by providing a url
to an external resource or a path
to a model file within the application package.
Vespa provides a command line tool to generate model files from documents.
The order in which the models are specified determines the model precedence, see model resolution for details.
In addition to adding models in services.xml,
the significance
feature must be enabled in the rank-profile
section of the schema, e.g.
schema example {
document example {
field content type string {
indexing: index | summary
index: enable-bm25
}
}
rank-profile default {
significance {
use-model: true
}
}
}
The model will be applied to all query terms except those that already have significance values from the query.
Specifying significance models in services.xml is available in Vespa as of version 8.426.8.
The significance model file is a JSON file that contains term document frequencies and document count for one or more languages, e.g.
{
"version": 1,
"id": "wikipedia",
"description": "Some optional description",
"languages": {
"en": {
"description": "Some optional description for English model",
"document-count": 1000,
"document-frequencies": {
"and": 500,
"car": 100,
...
}
},
"no": {
"description": "Some optional description for Norwegian model",
"document-count": 800,
"document-frequencies": {
"bil": 80,
"og": 400,
...
}
}
}
}
A significance model file can be compressed with zstandard when included in the application package or made available via a URL.
Vespa provides a CLI tool for generating model files from Vespa documents. It is uses the same linguistic module as in query processing to extract tokens and their document frequencies.
Model resolution selects a model from the models specified in services.xml based on the language of the query. The language can be either explicitly tagged or implicitly detected.
The resolution logic is as follows: