• [+] expand all

Getting started with ranking

Learn how ranking works in Vespa by using the open query API of vespa-documentation-search. In this article, find a set of queries invoking different rank-profiles, which is the ranking definition.

Ranking is the user-defined computation that scores documents to a query, here configured in doc.sd, also see schema documentation. This schema has a set of (contrived) ranking functions, to help learn Vespa ranking.

Ranking using document features only

Let's start with something simple: Irrespective of the query, score all documents by the number of in-links to it. That is, for any query, return the documents with most in-links first in the result set (these queries are clickable!):

yql=select * from doc where true&ranking=inlinks

The score, named relevance in query results, is the size of the inlinks attribute array in the document, as configured in the expression:

rank-profile inlinks {
    first-phase {
        expression: attribute(inlinks).count
    }
    summary-features {
        attribute(inlinks).count
    }
}

Count the number of entries in inlinks in the result and compare with relevance - it will be the same. Observe that the ranking expression does not use any features from the query, it only uses attribute(inlinks).count, which is a document feature.

Observing values used in ranking

When developing ranking expressions, it is useful to observe the input values. Output the input values using summary-features. In this experiment, we will use another rank function, still counting in-links but scoring older documents lower:

$$ num\_inlinks * {decay\_const}^{doc\_age\_seconds/3600} $$

Notes:

yql=select * from doc where true&ranking=inlinks_age

rank-profile inlinks_age {
    first-phase {
        expression: rank_score
    }
    summary-features {
        attribute(inlinks).count
        attribute(last_updated)
        now
        doc_age_seconds
        age_decay
        num_inlinks
        rank_score
    }
    constants {
        decay_const: 0.9
    }
    function doc_age_seconds() {
        expression: now - attribute(last_updated)
    }
    function age_decay() {
        expression: pow(decay_const, doc_age_seconds/3600)
    }
    function num_inlinks() {
        expression: attribute(inlinks).count
    }
    function rank_score() {
        expression: num_inlinks * age_decay
    }
}

In the query results, here we observe a document with 27 in-links, 9703 seconds old, get at relevance at 20.32 (the age of documents will vary with query time):

"relevance": 20.325190122213748,
...
"summaryfeatures": {
    "attribute(inlinks).count": 27.0,
    "attribute(last_updated)": 1.615971522E9,
    "now": 1.615981225E9,
    "rankingExpression(age_decay)": 0.7527848193412499,
    "rankingExpression(doc_age_seconds)": 9703.0,
    "rankingExpression(num_inlinks)": 27.0,
    "rankingExpression(rank_score)": 20.325190122213748,
}

Using summary-features makes it easy to validate and develop the ranking expression.

Ranking with query features

Let's assume we want to find similar documents, and we define document similarity as having the same number of words. From most perspectives, this is a poor similarity function, better functions are described later.

The documents have a term_count field - so let's add an input.query() for term count:

yql=select * from doc where true;&ranking=term_count_similarity&input.query(q_term_count)=1000

$$ 1 - \frac{fabs(attribute(term\_count) - query(q\_term\_count))}{1 + attribute(term\_count) + query(q\_term\_count)} $$

rank-profile term_count_similarity {
    first-phase {
        expression {
            1 -
            fabs(    attribute(term_count) - query(q_term_count) ) /
                (1 + attribute(term_count) + query(q_term_count) )
        }
    }
    summary-features {
        attribute(term_count)
        query(q_term_count)
    }
}

This rank function will score documents [0-1>, closer to 1 is more similar:

"relevance": 0.9985029940119761,
...    
"summaryfeatures": {
    "attribute(term_count)": 1003.0,
    "query(q_term_count)": 1000.0,
}

The key learning here is how to transfer ranking features in the query, using input.query(). Use different names for more query features.

Ranking with a query tensor

Another similarity function can be overlap in in-links. We will map the inlinks weightedset into a tensor, query with a tensor of same type and create a scalar using a tensor product as the rank score. We use a mapped query tensor, where the document name is the address in the tensor, using a value of 1 for each in-link:

{
    {links:/en/query-profiles.html}:1,
    {links:/en/page-templates.html}:1,
    {links:/en/overview.html}:1
}

As the in-link data is represented in a weightedset, we use the tensorFromWeightedSet rank feature to transform it into a tensor named links:

rank-profile inlink_similarity  {
    inputs {
        query(links) tensor<float>(links{})
    }
    first-phase {
        expression: sum(tensorFromWeightedSet(attribute(inlinks), links) * query(links))
    }
    summary-features {
        query(links)
        tensorFromWeightedSet(attribute(inlinks), links)
    }
}

yql=select * from doc where true&ranking=inlink_similarity&input.query(links)={ {links:/en/query-profiles.html}:1, {links:/en/page-templates.html}:1, {links:/en/overview.html}:1 }

Inspect relevance and summary-features:

"relevance": 2.0
...
"summaryfeatures": {
  "query(links)": {
    "type": "tensor<float>(links{})",
    "cells": [
      { "address": { "links": "/en/query-profiles.html" }, "value": 1 },
      { "address": { "links": "/en/page-templates.html" }, "value": 1 },
      { "address": { "links": "/en/overview.html" },       "value": 1 } ]
  },
  "tensorFromWeightedSet(attribute(inlinks),links)": {
    "type": "tensor(links{})",
    "cells": [
      { "address": { "links": "/en/page-templates.html" },             "value": 1 },
      { "address": { "links": "/en/jdisc/container-components.html" }, "value": 1 },
      { "address": { "links": "/en/query-profiles.html" },             "value": 1 } ]
  }
}

Here, the tensors have one dimension, so they are vectors - the sum of the tensor product is hence the doc product. As all values are 1, all products are 1 and the sum is 2:

document query value
/en/jdisc/container-components.html0
/en/overview.html0
/en/page-templates.html/en/page-templates.html1
/en/query-profiles.html/en/query-profiles.html1

Change values in the query tensor to see difference in rank score, setting different weights for links.

Summary: The problem of comparing two lists of links is transformed into a numerical problem of multiplying two occurrence vectors, summing co-occurrences and ranking by this sum:

sum(tensorFromWeightedSet(attribute(inlinks), links) * query(links))

Notes:

  • Query tensors can grow large. Applications will normally create the tensor in code using a Searcher, also see example.
  • Here the document tensor is created from a weighted set - a better way would be to store this in a tensor in the document to avoid the transformation.

Retrieval and ranking

So far in this guide, we have run the ranking function over all documents. This is a valid use case for many applications. However, ranking documents is generally CPU-expensive, optimizing by reducing the candidate set will increase performance. Example query using text matching, dumping calculated rank features:

yql=select * from doc where title contains "document"&ranking.listFeatures

See the long list of rank features calculated per result. However, the query filters on documents with "document" in the title, so the features are only calculated for the small set of matching documents.

Running a filter like this is document retrieval. Another good example is web search - the user query terms are used to retrieve the candidate set cheaply (from billions of documents), then one or more ranking functions are applied to the much smaller candidate set to generate the ranked top-ten. Another way to look at it is:

  • In the retrieval (recall) phase, find all relevant documents
  • In the ranking phase, show only relevant documents.

Still, the candidate set after retrieval can be big, a query can hit all documents. Ranking all candidates is not possible in many applications.

Splitting the ranking into two phases is another optimization - use an inexpensive ranking expression to sort out the least promising candidates before spending most resources on the highest ranking candidates. In short, use increasingly more power per document as the candidate set shrinks:

Retrieval and ranking]

Let's try the same query again, with a two-phase rank-profile that also does an explicit rank score cutoff:

yql=select * from doc where title contains "attribute"&ranking=inlinks_twophase

rank-profile inlinks_twophase inherits inlinks_age {
    first-phase {
        keep-rank-count       : 50
        rank-score-drop-limit : 10
        expression            : num_inlinks
    }
    second-phase {
        expression            : rank_score
    }
}

Note how using rank-profile inherits is a smart way to define functions once, then use in multiple rank-profiles. Read more about schema inheritance. Here, num_inlinks and rank_score are defined in a rank profile we used earlier:

    function num_inlinks() {
        expression: attribute(inlinks).count
    }

In the results, observe that no document has a rankingExpression(num_inlinks) less than or equal to 10.0, meaning all such documents were purged in the first ranking phase due to the rank-score-drop-limit. Normally, the rank-score-drop-limit is not used, as the keep-rank-count is most important. Read more in the reference.

For a dynamic limit, pass a ranking feature like query(threshold) and use an if statement to check if the score is above the threshold or not - if below, assign -1 (something lower than the rank-score-drop-limit) and have it dropped. Read more in ranking expressions.

Two-phased ranking is a performance optimization - this guide is about functionality, so the rest of the examples will only be using one ranking phase. Read more in first-phase.

Retrieval: AND, OR, weakAnd

This guide will not go deep in query operators in the retrieval phase, see query-api for details.

Consider a query like "vespa documents about ranking and retrieval". A query AND-ing these terms hits less than 3% of the document corpus, missing some of the documents about ranking and retrieval:

yql=select * from doc where (default contains "vespa" AND default contains "documents" AND default contains "about" AND default contains "ranking" AND default contains "and" AND default contains "retrieval")

Alternatively, OR-ing the terms hits more than 95% of the documents, unable to filter out irrelevant documents in the retrieval phase:

yql=select * from doc where (default contains "vespa" OR default contains "documents" OR default contains "about" OR default contains "ranking" OR default contains "and" OR default contains "retrieval")

Using a "weak AND" can address the problems of too few (AND) or too many (OR) hits in the retrieval phase. Think of it as an optimized OR, where the least relevant candidates are discarded from further evaluation. To find the least relevant candidates, a simple scoring function is used:

rank_score = sum_n(term(n).significance * term(n).weight)

As the point of weakAnd is to early discard the worst candidates, totalCount is an approximation:

yql=select * from doc where {scoreThreshold: 0, targetHits: 10}weakAnd( default contains "vespa", default contains "documents", default contains "about", default contains "ranking", default contains "and", default contains "retrieval")

Note that this blurs the distinction between filtering (retrieval) and ranking a little - here the weakAnd does both filtering and ranking to optimize the number of candidates for the later rank phases. The default rank-profile is used:

rank-profile documentation inherits default {
    inputs {
        query(titleWeight): 2.0
        query(contentsWeight): 1.0
    }
    first-phase {
        expression: query(titleWeight) * bm25(title) + query(contentsWeight) * bm25(content)
    }
}

Observe we are here using text matching rank features, which fits well with weakAnd's scoring function that also uses text matching features.

Read more in using weakAnd with Vespa.

Next steps