Multivalue Query Operators

This article is a follow up to the ranking introduction. Some use cases in this guide are better solved using tensors.

dotProduct and wand

It might make sense to start out using dotProduct and later switch to wand if the performance gain outweighs the reduction in flexibility and correctness. Also note that benchmarking should be involved in such a switch, to quantify the possible gain in performance (which might be negative).

wand (aka Parallel Wand) is a search operator that can be used for efficient top-k retrieval. It implements the Weak AND/Weighted AND algorithm as described by Broder et al in Efficient query evaluation using a two-level retrieval process. See using Wand with Vespa for details.

wand, dotProduct and weightedSet model very similar operations in the search core. They have recall behavior similar to or, but only return a limited number of documents.

dotProduct is the brute force equivalent to wand. They are both used to search for documents where weighted tokens in a field matches a subset of weighted tokens in the query. The raw scores produced by dotProduct are equivalent to those produced by wand. In some simple cases dotProduct is preferable to wand.

The difference is that wand will perform local optimizations in order to retrieve the top-k results that would be returned by dotProduct. This optimization will only yield correct results if the overall ranking is equal to the score produced by the dotProduct itself.

Cases where dotProduct might be better / more correct than wand:

  • Might be more efficient with few tokens or few total results
  • Works better with arbitrary rank expressions and compound queries
  • Works better with grouping
  • Scales better when partitioning the problem space

Use cases for a Wand operator arise when you have many query terms that you would like to see matched with OR semantics, but you are only interested in a small number of top hits for the query. With the normal OR operator the cost rises very quickly with the number of terms, so five terms inside OR may already be "many": a simple rule of thumb is that the evaluation cost is proportional to the number of hits the OR yields. A Wand operator, in contrast, has a target for the number of hits you want it to produce, and may throw away hits that are not good enough once it has produced enough hits.

If you are only interested in the best 100 hits (or less) from an OR, a Wand operator should be a good match. The basic idea is that after seeing enough hits, the Wand algorithm will require that a document match more terms (or more relevant terms) in the query. The operator may go all the way to requiring match on all terms if that gives many hits. Note that since the behavior is adaptive you will always get at least the target number of hits, usually many more. Example use cases:

  • Make recommendations based on user history
  • Find similar documents
  • Find extra content for a web page that is relevant to the text on the page
Note that in the cases above you would extract a large number of words from user history, from a document, or from some text you are going to display, and then make a query with all those words as weak criteria, looking for documents that contain as many of those words as possible, optionally with extra weight on those words that are considered more important.

dotProduct example

Refer to the dotProduct reference. dotProduct calculates the dot product of a weighted set in the query and a weighted set in a field - and stores the result in raw scores, which is used in ranking expressions.

Use a weighted set field (use attribute with fast-search for higher performance) in the document to store the tokens:

field features type weightedset<string> {
    indexing: summary | attribute
    attribute: fast-search
}
The query needs to be prepared by a custom searcher or sent using YQL. The code below shows the relevant part. If using multiple dot products in the same query it is a good idea to label them. This enables us to use individual dot product scores when ranking results later.
Item makeDotProduct(String label, String field, Map<String, Integer> token_map) {
    DotProductItem item = new DotProductItem(field);
    item.setLabel(label);
    for (Map.Entry<String, Integer> entry : token_map.entrySet()) {
        item.addToken(entry.getKey(), entry.getValue());
    }
    return item;
}
dotProduct produces raw scores that can be used in a ranking expression. The simplest approach is to use the sum of all raw scores for the field containing the tokens:
rank-profile default {
    first-phase {
        expression: rawScore(features)
    }
}
For better control, label each dot product in the query and use their scores individually:
rank-profile default {
    first-phase {
        expression: itemRawScore(dp1) + itemRawScore(dp2)
    }
}

weightedSet example

Refer to the weightedSet reference. The use cases for weightedSet are for limiting the search result to documents with specific properties that can have a large number of distinct values, like:

  • We know who the user is, and want to restrict to documents written by one of the user's friends
  • We have the topic area the user is interested in, and want to restrict to the top-ranked authors for this topic
  • We have recorded the documents that have been clicked by users in the last 10 minutes, and want to search only in these
Note that in most actual use cases, the field we are searching in is some sort of user ID, topic ID, group ID, or document ID and can often be modeled as a number - usually in a field of type long (or array<long> if multiple values are needed). If a string field is used, it will usually also be some sort of ID; if you have data in a string field intended for searching with WeightedSetItem, then using match: word for that field is recommended.

The decision to use a WeightedSetItem must be taken by application-specific logic. This must be in the form of a Container plugin where the query object can be manipulated as follows:

  • Decide if WeightedSetItem should be used
  • Create a new WeightedSetItem for the field you want to use as filter
  • Find the tokens and optionally weights to insert into the set
  • Combine new WeightedSetItem with the original query by using an AndItem
A simple code example adding a hardcoded filter containing 10 tokens:
private Result hardCoded(Query query, Execution execution) {
    WeightedSetItem filter = new WeightedSetItem("author");
    filter.addToken("magazine1", 2);
    filter.addToken("magazine2", 2);
    filter.addToken("magazine3", 2);
    filter.addToken("tv", 3);
    filter.addToken("tabloid1", 1);
    filter.addToken("tabloid2", 1);
    filter.addToken("tabloid3", 1);
    filter.addToken("tabloid4", 1);
    filter.addToken("tabloid5", 1);
    filter.addToken("tabloid6", 1);
    QueryTree tree = query.getModel().getQueryTree();
    Item oldroot = tree.getRoot();
    AndItem newtop = new AndItem();
    newtop.addItem(oldroot);
    newtop.addItem(filter);
    tree.setRoot(newtop);
    query.trace("FriendFilterSearcher added hardcoded filter: ", true, 2);
    return execution.search(query);
}
The biggest challenge here is finding the tokens to insert; normally the incoming search request URL should not contain all the tokens directly. For example, the search request could contain the user id, and a lookup (in a database or a Vespa index) would fetch the friends list.

Since the tokens are inserted directly into the query without going through the Search Container query parsing and query handling, they won't be subject to transforms such as lowercasing, stemming, or phrase generation. This means that if the field is a string field you'll need to insert lowercased tokens only, and only single tokens in the index can be matched.

For more examples on how the code might look there is container javadoc available.

Raw scores and query item labeling

Vespa ranking is flexible and relatively decoupled from document matching. The output from the matching pipeline typically indicates how the different words in the query matches a specific document and lets the ranking framework figure out how this translates to match quality.

However, some of the more complex match operators will produce scores directly, rather than expose underlying match information. A good example is the wand operator. During ranking, a wand will look like a single word that has no detailed match information, but rather a numeric score attached to it. This is called a raw score, and can be included in ranking expressions using the rawScore feature.

The rawScore feature takes a field name as parameter and gives the sum of all raw scores produced by the query for that field. If more fine-grained control is needed (the query contains multiple operators producing raw scores for the same field, but we want to handle those scores separately in the ranking expression), the itemRawScore feature may be used. This feature takes a query item label as parameter and gives the raw score produced by that item only.

Query item labeling is a generic mechanism that can be used to attach symbolic names to query items. A query item is labeled by using the setLabel method on a query item in the search container query API.