Weighted Set Query Term

WeightedSetItem is a query term available in the container that can be used to search for a collection of tokens with individual weights. It has similar semantics to EQUIV, since it acts as a single term in the query. However, the restriction dictating that it contains a collection of weighted tokens directly enables specific back-end optimizations that improves performance for large sets of tokens compared to using the generic EQUIV or OR operators.

Use Case

The actual use cases for this feature are for limiting the search result to documents with specific properties that can have a large number of distinct values. Some concrete use cases are:

  • We know who the user is, and want to restrict to documents written by one of the user's friends
  • We have the topic area the user is interested in, and want to restrict to the top-ranked authors for this topic
  • We have recorded the documents that have been clicked by users in the last 10 minutes, and want to search only in these
Note that in most actual use cases the field we are searching in is some sort of user ID, topic ID, group ID, or document ID and can often be modeled as a number – usually in a field of type long (or array<long> if multiple values are needed). If a string field is used, it will usually also be some sort of ID; if you have data in a string field intended for searching with WeightedSetItem then using match: word for that field is recommended.

How to Introduce WeightedSetItem terms in Your Application

The decision to use a WeightedSetItem must be taken by application-specific logic. This must be in the form of a Container plugin where the query object can be manipulated as follows:

  • Decide if WeightedSetItem should be used
  • Create a new WeightedSetItem for the field you want to use as filter
  • Find the tokens and optionally weights to insert into the set
  • Combine new WeightedSetItem with the original query by using an AndItem
Here is a simple code example adding a hardcoded filter containing 10 tokens:
private Result hardCoded(Query query, Execution execution) {
    WeightedSetItem filter = new WeightedSetItem("author");
    filter.addToken("magazine1", 2);
    filter.addToken("magazine2", 2);
    filter.addToken("magazine3", 2);
    filter.addToken("tv", 3);
    filter.addToken("tabloid1", 1);
    filter.addToken("tabloid2", 1);
    filter.addToken("tabloid3", 1);
    filter.addToken("tabloid4", 1);
    filter.addToken("tabloid5", 1);
    filter.addToken("tabloid6", 1);
    QueryTree tree = query.getModel().getQueryTree();
    Item oldroot = tree.getRoot();
    AndItem newtop = new AndItem();
    newtop.addItem(oldroot);
    newtop.addItem(filter);
    tree.setRoot(newtop);
    query.trace("FriendFilterSearcher added hardcoded filter: ", true, 2);
    return execution.search(query);
}
The biggest challenge here is finding the tokens to insert; normally the incoming search request URL should not contain all the tokens directly. For example, the search request could contain the user id and a lookup (in a database or a Vespa index) would fetch the friends list.

Since the tokens are inserted directly into the query without going through the Search Container query parsing and query handling, they won't be subject to transforms such as lowercasing, stemming, or phrase generation. This means that if the field is a string field you'll need to insert lowercased tokens only, and only single tokens in the index can be matched.

For more examples on how the code might look there is container javadoc available.