Using Wand with Vespa

This document describes how to use Weak And/WAND for complex type any queries. This document uses conceps and data set from the blog search tutorial

Both Wand query operator implementations in Vespa

  • accepts the targeted number of hits to be returned
  • accepts a score threshold

weakAnd

The weakAnd query operator accepts terms searching over multiple fields and also logical conjuctions using OR/AND etc. It is a search operator that implements the Weak AND/Weighted AND algorithm.

WeakAnd is also called Vespa Wand in this documentation.

When using weakAnd via YQL or a Java Searcher plugin, specify the target for minimum number of hits the operator should produce. The effect of tuning targetHits may not be very intuitive. To ensure that you get the best hits possible with a weakAnd, set the target number somewhat higher than the number of hits returned to the user; setting it 10 times higher should be more than enough. The reason for increasing the target number is that weakAnd uses a very simple ranking function internally to filter away bad hits. If the normal rank expression correlates poorly with this internal filtering formula, increase the target number. The easiest is probably to use the default number (100) and see if that gives good results, only tuning if you see a problem. Also note that if the rank correlates poorly with the filtering criteria, weakAnd may not be a good operator.

Anything similar to classic vector model ranking will work well with weakAnd. We suggest using the nativeFieldMatch or nativeDotProduct feature as a starting point. Note that because weakAnd relies on feedback identifying which hits are used for first phase ranking to increase its threshold for what's considered a good hit, the special unranked rank profile (which turns off ranking completely) may cause weakAnd queries to become slower than a more normal ranking.

With the complete blog search data set, we could imagine a user query blogs about adidas joggers or white nike's or reebook which could be represented in the Vespa Query Language as:

select * from blog_post where (
  default contains "blog" OR default contains "about" OR 
  default contains "adidas" OR default contains "joggers" OR default contains "or" OR 
  default contains "white" OR 
  default contains phrase("nike", "s") OR default contains "or" OR default contains "reebook"
);
The query would recall about 462,857 documents out of a total corpus of 1,196,110 documents, which equals roughly 39% of the total corpus. Vespa will rank all those recalled documents using the configured first-phase ranking function. Using the weakAnd query operator the query could look like:
select * from blog_post where (
  [{"targetHits": 200}] 
    weakAnd(
      default contains "blog", default contains "about", default contains "adidas", 
      default contains "joggers", default contains "or", default contains "white", 
      default contains phrase("nike", "s"), default contains "reebook"
    )
);
We specify that the target number of hits (top k) should be 200 (Default 100) and this number is used per content node. In the case of weakAnd, Vespa returns a total of 3342 documents (out of the actual 462,857 matching the logical disjunction). Those 3342 documents are evaluated using the configured first-phase ranking expression as configured in the rank-profile. The weakAnd implementation scores documents by a simplified scoring function, which uses two core text rank features: Both features could be overridden in the query using annotations:
select * from blog_post where (
  [{"targetHits": 200}] 
    weakAnd(
      default contains "blog", default contains "about", default contains ([{"weight":200}]"adidas")
    )
);
The inner scoring function of the weakAnd operator is:
rank_score = sum_n(term(n).significance * term(n).weight) 
Documents that could not potentially compete with any of the hits already in heap (size targetHits) of top hits are skipped, while the weakAnd implementation still exposes all hits which were evaluated to the first phase ranking function, and not only the top k hits.

Limitations

In the blog search tutorial, a document popularity ranking feature is introduced and which is used in the first-phase ranking function in addition to text matching features:

rank-profile post_popularity inherits default {
    first-phase {
        expression: nativeRank(title, content) + 10 * if(isNan(attribute(popularity)), 0, attribute(popularity))
    }
}
As described earlier, the weakAnd inner ranking function is only considering the term(n).weight and term(n).significance. It has no knowledge of the first-phase function in use. With the above ranking function, the correlation between the simple inner weakAnd scoring and the first phase function would be weak, and the most popular blog posts might not appear at all in the top k weakAnd scored hits.

Benefits

  • It's accepts any field and query terms can search different fields and has great performance.
  • It's integrated with linguistics so whatever stemming and linguistic settings is configured is applied including also bolding of query terms.
  • It is using the same query terms for other more advanced ranking expressions like nativeRank/nativeFieldMatch.

wand

The wand query operator works over a single field only and it accepts a list of features or pre-processed text tokens with weights. It allows full control over both query side weights and document side weights and it's guaranteed that it will find the top K best hits ranked by the dotProduct between the query and the document side feature vectors.

WeakAnd is also called Vespa Wand in this documentation.

See multivalue query operators for how to use wand. Note that this operator does not integrate with linguistic processing. The Vespa rank query operator can be used to create a query tree, which includes a bag of features used in the wand for recall and also other query terms to produce ranking features for first-phase/second-phase ranking. The example below uses the rank operator to pass a query term searching the title field for phrase. This phrase does not impact recall, but creates ranking features for first-phase or second-phase ranking.

select ignoredfield from ignoredsource where rank(
    ([ {"targetHits": 25} ]
      wand(car_types, {"pagani":400,"lamborghini":300,"maserati":250,"ferrari":150,"lancia":50,"alfa":40,"fiat":30})),
      title contains phrase("italian", "car","makers"));

wand example

The field to search must be a weighted set attribute with fast-search. Refer to the wand reference.

Lets imagine we have an application where we would like to search for popular blogs about cars and we have the following schema file. The weighted set attribute car_types is used to tag an article with which car types that are discussed in that article, while the integer attribute popularity is used to track the static popularity of that article. We want to search the car_types field using wand.

We also have two rank profiles: dotproductonly uses the raw dot product score that is calculated by wand, while combined_score combines the dot product score with the static popularity score:

schema article {
  document article {
    field car_types type weightedset<string> {
      indexing: attribute
      attribute: fast-search
    }
    field popularity type int {
      indexing: attribute | summary
    }
  }
  rank-profile dotproductonly {
    first-phase {
      expression: rawScore(car_types)
    }
  }
  rank-profile combined_score {
    first-phase {
      expression: rawScore(car_types) + attribute(popularity)
    }
  }
}
Lets imagine a particular user that is interested in Italian cars, especially sports cars. This user would mainly like to get blogs discussing Italian sports cars, but blogs about more usual Italian cars could also be returned. In the following example we see how to use YQL to create a wand for the car_types field with 25 as the target number of hits to produce. Notice that the weights for sports cars are higher than for others.
yql=select ignoredfield from ignoredsource where [ {"targetHits": 25} ]
 wand(car_types, {"pagani":400,"lamborghini":300,"maserati":250,"ferrari":150,"lancia":50,"alfa":40,"fiat":30});
The same can be added in a Searcher plugin using WandItem:
import com.yahoo.prelude.query.*;
import com.yahoo.search.Query;
import com.yahoo.search.Result;
import com.yahoo.search.query.QueryTree;
import com.yahoo.search.searchchain.Execution;

private Result hardCoded(Query query, Execution execution) {
    WandItem filter = new WandItem("car_types", 25);
    filter.addToken("pagini", 400);
    filter.addToken("lamborghini", 300);
    filter.addToken("maserati", 250);
    filter.addToken("ferrari", 150);
    filter.addToken("lancia", 50);
    filter.addToken("alfa", 40);
    filter.addToken("fiat", 30);
    QueryTree tree = query.getModel().getQueryTree();
    Item oldroot = tree.getRoot();
    AndItem newtop = new AndItem();
    newtop.addItem(oldroot);
    newtop.addItem(filter);
    tree.setRoot(newtop);
    query.trace("added hardcoded filter: ", true, 2);
    return execution.search(query);
}

Ranking notes

Note that when using the default rank profile in this example, we are guaranteed to get the top-k hits (according to the raw dot product score) in the final search result. This is however not the case when using the combined_score rank profile. In this case the top-k hits (according to the raw dot product score) are also returned from the wand in the match phase, but in the ranking phase we also consider the static popularity score which may alter what is the final top-k hits according to the expression in the rank profile. This means that when combining the raw dot product score of the wand with something else (in either a first phase or second phase rank expression), you must ensure that these rank scores correlate somewhat. When combining scores like this you also should tune the targetHits used by the wand.

If combining wand with a ranking expression that does not use its raw score, the tuning of targetHits should follow the weakAnd guide lines. A score threshold can be specified when using wand. The internal dot product score for a document must be larger than scoreThreshold in order to be considered a match. Default value is 0.0. Refer to the reference for details.