Learning to Rank with Vespa

This document describes how learning to rank or machine learned ranking (MLR) could be implemented with Vespa. In this document we cover the following topics:

  • Choosing and designing ranking features
  • Gathering features as computed by Vespa
  • Training a ranking model outside of Vespa using XGBoost with a GBDT model
  • Representing the trained model in Vespa
  • Evaluating the model in Vespa

A few commonly used dataset and resources on MLR:

Choosing features

Vespa's rank feature set contains a large set of low level features, as well as some higher level features. Depending on the algorithm, it can be a good idea to leave out the un-normalized features to avoid spending learning power on having to learn to normalize these features and determine that they really represent the same information as some of the normalized features. In various Learn to rank challenges one is given a fixed set of features, e.g in the Microsoft Learning to Rank Datasets each document,query pair is represented by a 136 dimensional feature vector. A few samples of features used in the mslr dataset:

  • Feature 1 : query term matches in body
  • Feature 132 : QualityScore

With Vespa ranking framework you get to chose which features to use and also design your own features on top of Vespa's rich set of built in rank features and also Vespa's tensor framework for ranking.

Custom features

When using machine learned ranking, we are searching a function space which is much more limited than the space of functions supported by ranking expressions. We can increase the space of functions available to MLR because the primitive features used in MLR training do not need to be primitive features in Vespa - they can just as well be ranking expression snippets. If there are certain mathematical combinations of features believed to be useful in an application, these can be calculated from the actual primitive features of Vespa and given to the training as primitives. Such primitives can then be replaced textually by the corresponding rank expression snippet, before the learned expression is deployed on Vespa.

Vespa supports expression functions. Functions having zero arguments can be used as summary- and rank-features. Some examples of user defined features as functions:

rank-profile training inherits default {
    function titlematch() {
      expression: fieldMatch(title).completeness * pow(0 - fieldMatch(title).earliness, 2)
    function qualityScore() {
      expression: if(isNan(attribute(qualityScore) == 0,0,attribute(qualityScore)))
    function realtimeCTR() {
      expression: if(isNan(attribute(ctr) == 0,0,attribute(ctr)))
    function userInterestScore() {
      expression: sum(query(tensor)*attribute(tensor_attribute))

In the above examples we have designed a title text matching feature using the built-int fieldMatch(name), in addition there is a quality score feature which is reading a quality score from a document attribute (document feature only) and also a tensor dot product calculated between a query tensor and a document tensor.

Collecting Features for Training

After designing which features to use we need to have Vespa compute these features over the set of documents we have judgments for.

"All" ranking features can be included in the results for each document by adding ranking.listFeatures to the query. Since the set of actual feature computable are in general infinite, "all" features really means a large default set. If more rank features than is available in the default set is wanted, they can be added to the set in the rank profile:
rank-features {
It is also possible to take full control over which features will be dumped by adding
to the rank profile. This will make the explicitly listed rank features the only ones dumped when requesting rankfeatures in the query.

A simplified rank-profile with only a few custom feature functions is shown below. We name this profile training and we inherit the default first phase expression of Vespa which is nativeRank

rank-profile training inherits default  {
  function averageTermSignificance() {

  function maxTermSignificance() {

  function averageTitleTermMatch() {

  function averageDescriptionTermMatch() {

  rank-features {


In this example there are four dcustom rank features (functions) and six native Vespa rank features. In this example there are no document only or query only features which a real model probably would have had.

When you are lucky enough to have a training set containing judgments for certain documents, it is useful to select those documents in the query by adding a filter term matching the document id, but without impacting the values of any rank features. To do this, use the recall parameter to query for the document id which there exist a judgement for.

A query api request example for a sample query 'hotels close to san jose airport' for a document with a id 8444. The docid is a custom field defined in the schema file. In this example use the ranking profile training as defined above

  "yql" : "select rankfeatures from sources * where ([{\"grammar\": \"any\"}]userInput(@userQuery));",
  "userQuery": "hotels close to san jose airport",
  "recall:": "+id:8444",
  "hits": 1,
  "format": "json",
  "timeout": "5s",
  "ranking" : {
    "profile" : "training",
    "listFeatures": true
We limit the recall to the document id by the recall parameter and we select only rankfeatures field. We also use grammar 'any' instead of the default 'and' mode when parsing the user input query as the document will not be recalled if not all terms are found in the document. Vespa will return a response like this per hit:

  "fields": {
    "rankfeatures": {
      nativeFieldMatch(brand): 0,
      nativeFieldMatch(description): 0.19769285365597042,
      nativeFieldMatch(title): 0.12901910388045487,
      nativeProximity(brand): 0 
      nativeProximity(description): 0.0002378785403068704,
      nativeProximity(title): 0.03705999262914951,
      nativeRank(brand): 0,
      nativeRank(description): 0.08788992146268762,
      nativeRank(title): 0.06145960090566322,
      rankingExpression(averageDescriptionTermMatch): 0.875,
      rankingExpression(averageTermSignificance): 0.19415280564476844,
      rankingExpression(averageTitleTermMatch): 0.125,
      rankingExpression(maxTermSignificance): 0.6454215031943953
  "id": "index:mycontent/0/cfcd2084e4a5fe79265e37ad",
  "relevance": 0.1,
  "source": "mycontent"

Now we have one feature sample but we need to repeat the above routine for all our documents where we have labels or judgments.

Training the Model

We now need to convert the feature data into a format read by MLR frameworks, in this example we use XGBoost. We need to map the feature names defined in the ranking profile declared above into a format that XGBoost understands. First we define a featureMap.txt:

0   nativeFieldMatch(brand) q
1   nativeFieldMatch(description) q
2   nativeFieldMatch(title) q
3   nativeProximity(brand) q
4   nativeProximity(description) q
5   nativeProximity(title) q
6   nativeRank(brand) q
7   nativeRank(description) q
8   nativeRank(title) q
9   rankingExpression(averageDescriptionTermMatch) q
10  rankingExpression(averageTermSignificance) q,
11  rankingExpression(averageTitleTermMatch) q,
12  rankingExpression(maxTermSignificance) q

The map defines the feature index (e.g 0) and the feature name. Using the map and the scraped rankfeatures and the label (relevancy judgement) we can produce a training data set. XGBoost accepts the LibSVM text vector representation. Given a relevancy judgment of 3.0 for the above sample document and our feature map definition our text vector becomes:

3.0 0:0 1:0.19769285365597042 2:0.12901910388045487 3:0 4:0.0002378785403068704 5:0.03705999262914951 6:0 7:0.08788992146268762 8:0.06145960090566322 9:0.875 10:0.19415280564476844 11:0.125 12:0.6454215031943953

Note that all feature indices are present as Vespa does currently not support the missing split condition of XGBoost, see Github issue 9646.

With sufficient set of vectors set we can train a model. In this example we use the XGBoost python api and we have split the set of vectors into a training set train.txt, and test set test.txt and our feature map is saved in feature-map.txt. Evaluation Methodology (Cross-Validation/Training set size/Test size etc) and XGBoost parameter tuning is out of scope for this document. Sample training snippet:

import numpy as np
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from math import sqrt

dtrain = xgb.DMatrix('train.txt')
dtest = xgb.DMatrix('test.txt')
param = {'base_score':0,'max_depth':3, 'eta':0.5, 'objective':'reg:squarederror', 'eval_metric':'rmse'}
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 10
bst = xgb.train(param, dtrain, num_round, watchlist)
bst.dump_model("trained-model.json",fmap='feature-map.txt', with_stats=False, dump_format='json')

We dump the trained model into trained-model.json and now the next step is to deploy this model file.

Representing the model in Vespa

The above XGBoost model can be represented in Vespa by adding a new rank profile which inherits from the training ranking profile so that all our custom functions defined in the training rank profile is available for the evaluation

rank-profile evaluation inherits training {
  first-phase {

After deploying the model we can search using it by choosing the rank profile in the search request ranking.profile=evaluation. The above will evaluate the trained model for all matching documents which might be computationally expensive. By using phased ranking we can use the trained model in a second phase expression and a simplified expression in the first phase over all matching documents. In the example we have used the two most important features (as learned by XGBoost using the get_fscore api):

rank-profile phased-evaluation inherits training {
  first-phase {
    expression: (180/225)*nativeRank(title) + (45/225)*nativeFieldMatch(description)
  second-phase {


  • Some native features might be computationally expensive and some of sub features like fieldMatch(name).earliness also requires full computation of all other sub-features of the fieldMatch(name).proximity. This is in general true for all native rank features of Vespa
  • When collecting feature data from Vespa, it's important that the index has a representable corpus and not only limited to the documents there are judgments for. This ensures that corpus dependent features like term(n).significance are comparable with the production size corpus. Vespa allows overriding any rank feature including term(i).significance using ranking.features