• [+] expand all

Multivalue Query Operators

This article is a followup to the ranking introduction. Some use cases in this guide are better solved using tensors.

dotProduct and wand

wand (aka Parallel Wand) is a search operator that can be used for efficient top-k retrieval. It implements the Weak AND/Weighted AND algorithm as described by Broder et al in Efficient query evaluation using a two-level retrieval process. See using Wand with Vespa for details.

dotProduct is the brute force equivalent to wand. They are both used to search for documents where weighted tokens in a field matches a subset of weighted tokens in the query. The raw scores produced by dotProduct are equivalent to those produced by wand.

The difference is that wand will perform local optimizations in order to retrieve the top-k targetHits results that would be returned by inner maximum dotproduct. Which one of these are most cost-efficient is complex as it depends on the size of the vocabulary (features) and:

  • Number of query terms and their weight distribution
  • Number of document terms and their weight distribution

It is easy to compare the two approaches. One can run benchmarks using either and compare latency and total number of hits, if on average, total number of hits approaches the total number of documents matching the other filters in the query, it is cheaper to use tensor dot product.

dotProduct example

Refer to the dotProduct reference. dotProduct calculates the dot product of a weighted set in the query and a weighted set in a field - and stores the result in raw scores, which is used in ranking expressions.

Use a weighted set field (use attribute with fast-search for higher performance) in the document to store the tokens:

field features type weightedset<string> {
    indexing: summary | attribute
    attribute: fast-search
}

The query needs to be prepared by a custom searcher or sent using YQL. The code below shows the relevant part. If using multiple dot products in the same query it is a good idea to label them. This enables us to use individual dot product scores when ranking results later.

Item makeDotProduct(String label, String field, Map<String, Integer> token_map) {
    DotProductItem item = new DotProductItem(field);
    item.setLabel(label);
    for (Map.Entry<String, Integer> entry : token_map.entrySet()) {
        item.addToken(entry.getKey(), entry.getValue());
    }
    return item;
}

dotProduct produces raw scores that can be used in a ranking expression. The simplest approach is to use the sum of all raw scores for the field containing the tokens:

rank-profile default {
    first-phase {
        expression: rawScore(features)
    }
}

For better control, label each dot product in the query and use their scores individually:

rank-profile default {
    first-phase {
        expression: itemRawScore(dp1) + itemRawScore(dp2)
    }
}

weightedSet example

Refer to the weightedSet reference. The use cases for weightedSet are for limiting the search result to documents with specific properties that can have a large number of distinct values, like:

  • We know who the user is, and want to restrict to documents written by one of the user's friends
  • We have the topic area the user is interested in, and want to restrict to the top-ranked authors for this topic
  • We have recorded the documents that have been clicked by users in the last 10 minutes, and want to search only in these

Using a weightedSet is more performant than a big OR expression:

select * from data where category = 'cat1' OR category = 'cat2'..

See multi-lookup set filtering for details.

Note that in most actual use cases, the field we are searching in is some sort of user ID, topic ID, group ID, or document ID and can often be modeled as a number - usually in a field of type long (or array<long> if multiple values are needed). If a string field is used, it will usually also be some sort of ID; if you have data in a string field intended for searching with WeightedSetItem, then using match: word for that field is recommended.

The decision to use a WeightedSetItem must be taken by application-specific logic. This must be in the form of a Container plugin where the query object can be manipulated as follows:

  • Decide if WeightedSetItem should be used
  • Create a new WeightedSetItem for the field you want to use as filter
  • Find the tokens and optionally weights to insert into the set
  • Combine new WeightedSetItem with the original query by using an AndItem

A simple code example adding a hardcoded filter containing 10 tokens:

private Result hardCoded(Query query, Execution execution) {
    WeightedSetItem filter = new WeightedSetItem("author");
    filter.addToken("magazine1", 2);
    filter.addToken("magazine2", 2);
    filter.addToken("magazine3", 2);
    filter.addToken("tv", 3);
    filter.addToken("tabloid1", 1);
    filter.addToken("tabloid2", 1);
    filter.addToken("tabloid3", 1);
    filter.addToken("tabloid4", 1);
    filter.addToken("tabloid5", 1);
    filter.addToken("tabloid6", 1);
    QueryTree tree = query.getModel().getQueryTree();
    Item oldroot = tree.getRoot();
    AndItem newtop = new AndItem();
    newtop.addItem(oldroot);
    newtop.addItem(filter);
    tree.setRoot(newtop);
    query.trace("FriendFilterSearcher added hardcoded filter: ", true, 2);
    return execution.search(query);
}

The biggest challenge here is finding the tokens to insert; normally the incoming search request URL should not contain all the tokens directly. For example, the search request could contain the user id, and a lookup (in a database or a Vespa index) would fetch the friends list.

Since the tokens are inserted directly into the query without going through the Search Container query parsing and query handling, they won't be subject to transforms such as lowercasing, stemming, or phrase generation. This means that if the field is a string field you'll need to insert lowercased tokens only, and only single tokens in the index can be matched.

For more examples on how the code might look there is container javadoc available.

Raw scores and query item labeling

Vespa ranking is flexible and relatively decoupled from document matching. The output from the matching pipeline typically indicates how the different words in the query matches a specific document and lets the ranking framework figure out how this translates to match quality.

However, some of the more complex match operators will produce scores directly, rather than expose underlying match information. A good example is the wand operator. During ranking, a wand will look like a single word that has no detailed match information, but rather a numeric score attached to it. This is called a raw score, and can be included in ranking expressions using the rawScore feature.

The rawScore feature takes a field name as parameter and gives the sum of all raw scores produced by the query for that field. If more fine-grained control is needed (the query contains multiple operators producing raw scores for the same field, but we want to handle those scores separately in the ranking expression), the itemRawScore feature may be used. This feature takes a query item label as parameter and gives the raw score produced by that item only.

Query item labeling is a generic mechanism that can be used to attach symbolic names to query items. A query item is labeled by using the setLabel method on a query item in the search container query API.