Significance Model

Significance is a measure of how rare a term is in a collection of documents. Rare terms like "neurotransmitter" are weighted higher during ranking than common terms like "the". Significance is often calculated as the inverse document frequency (IDF):

\[IDF(t, N) = log(\frac{N}{n_t})\]

where:

  • \(N\) is the total number of documents in the collection
  • \(n_t\) is the number of documents containing the term \(t\)

Variations of IDF are used in bm25 and nativeRank.

Significance model provides the data necessary to calculate IDF, i.e. \(n_t\) for each term and \(N\) for the document collection. We distinguish between local and global significance models. A local model is node-specific and a global model is shared across nodes.

Local significance model

For string fields indexed with bm25 or nativeRank, Vespa creates a local significance model on each content node. Each node uses its own local model for the queries it processes.

Different nodes can have different significance values for the same term. In large collections, this difference is usually small and doesn’t affect ranking quality.

One issue with the local models is that ranking is non-deterministic in the following cases:

  1. When new documents are added, local models on affected content nodes are updated.
  2. When the content cluster redistributes documents across nodes, e.g. adding, removing nodes for scaling and failure recovery, the models change on the nodes involved.
  3. When using grouped distribution, queries can return different results depending on which group processes them.

Another issue is that local significance models are not available in streaming search because inverted indexes are not constructed so IDF values can't be extracted. All significance values are set to 1, which is the default value for unknown terms. The lack of significance values may degrade the ranking quality.

A global significance model addresses these issues.

Global significance model

In a global significance model, significance values are shared across nodes and don’t change when new documents are added. There are two ways to provide a global model:

  1. Include significance values in a query.
  2. Set significance values in a searcher.
  3. Specify models in services.xml.

Significance values in a query

Document frequency and document count can be specified in YQL, e.g.:

select * from example where content contains ({documentFrequency: {frequency: 13, count: 101}}"colors")

Alternatively, significance values can be specified in YQL directly and used instead of computed IDF values, e.g.:

select * from example where content contains ({significance:0.9}"neurotransmitter")

Significance values in a searcher

Document frequency and significance values can be also set in a custom searcher:

private void setDocumentFrequency(WordItem item, long frequency, long numDocuments) {
    var word = item.getWord();
    word.setDocumentFrequency(new DocumentFrequency(frequency, numDocuments));
}

private void setSignificance(WordItem item, float significance) {
    var word = item.getWord();
    word.setSignificance(significance);
}

Significance models in services.xml

significance element in services.xml specifies one or more models:

<container version="1.0">
    <search>
        <significance>
            <model model-id="significance-en-wikipedia-v1"/>
            <model url="https://some/uri/mymodel.multilingual.json" />
            <model path="models/mymodel.no.json.zst" />
        </significance>
    </search>
</container>

Vespa Cloud users have access to pre-built models, identified by model-id. In addition, all users can specify their own models by providing a url to an external resource or a path to a model file within the application package. Vespa provides a command line tool to generate model files from documents. The order in which the models are specified determines the model precedence, see model resolution for details.

In addition to adding models in services.xml, the significance feature must be enabled in the rank-profile section of the schema, e.g.

schema example {
    document example {
        field content type string {
            indexing: index | summary
            index: enable-bm25
        }
    }

    rank-profile default {
        significance {
            use-model: true
        }
    }
}

The model will be applied to all query terms except those that already have significance values from the query.

Specifying significance models in services.xml is available in Vespa as of version 8.426.8.

Significance model file

The significance model file is a JSON file that contains term document frequencies and document count for one or more languages, e.g.

{
  "version": 1,
  "id": "wikipedia",
  "description": "Some optional description",
  "languages": {
    "en": {
      "description": "Some optional description for English model",
      "document-count": 1000,
      "document-frequencies": {
        "and": 500,
        "car": 100,
        ...
      }
    },
    "no": {
      "description": "Some optional description for Norwegian model",
      "document-count": 800,
      "document-frequencies": {
        "bil": 80,
        "og": 400,
        ...
      }
    }
  }
}

A significance model file can be compressed with zstandard when included in the application package or made available via a URL.

Vespa provides a CLI tool for generating model files from Vespa documents. It is uses the same linguistic module as in query processing to extract tokens and their document frequencies.

Model resolution

Model resolution selects a model from the models specified in services.xml based on the language of the query. The language can be either explicitly tagged or implicitly detected.

The resolution logic is as follows:

  • When language is explicitly tagged
    • Select the last specified model that has the tagged language. Fail if none are available.
    • If the language is tagged as “un” (unknown), select the model for “un” first, fall back to “en” (English). Fail if none are available.
  • When language is implicitly detected
    • Select the last specified model with the detected language. If not available, try “un” and then “en” languages. Fail if none are available.