Search Coverage Degradation

Ideally one want to search all data indexed in a Vespa cluster within the specified timeout but that might not be possible due to several different reasons:

  • The system might be overloaded due to capacity constraints and queries does does not complete within the timeout as they are sitting in the queue (Little's Law)
  • A complex query might take longer time to execute than the specified timeout or the timeout is too low given the complexity of the query and available capacity
This document describes how Vespa could gracefully degrade the result set if the query cannot be completed within the timeout specified.

Definitions:

  • Coverage The percentage of documents indexed which were searched by the query. Ideally we want coverage to be 100%
  • Timeout The total time a query is allowed to run for. Vespa is a distributed system where multiple components are involved in the query execution
  • Soft Timeout Soft timeout is a Vespa configuration switch which allows coverage to be less than 100% but bigger than 0% if the query is going to time out. E.g return what you got when time budget is used

Query timeout and soft timeout

Vespa's default query timeout is 500 ms in Vespa 7 and 5000 ms in Vespa 6 and soft timeout is enabled per default in Vespa 7.

Reference documentation:

The timeout specifies the overall timeout of the query execution and can be configured and also overridden by the timeout run time request parameter or defined in a query profile so that different classes of queries can have a different latency budget/timeout.

The softtimeout setting controls what the content nodes should do in the case where the latency budget has almost been used (timeout times a factor). Return the documents recalled and ranked with the first phase function within the time used or simply don't produce a result. Without softtimeout enabled the Vespa container will return a 504 timeout without any results while with enabled it will return the documents matched and ranked up until the timeout was reached with a 200 OK response along with the reason the result set was degraded. The container might respond with a timeout error with http response code 504 even with softtimeout enabled if the timeout is set so low that that the query does not make it to the content nodes or the container does not have any time left after input and query processing to dispatch the query to the content clusters.


How to tell if the query was degraded and coverage was reduced?

The default JSON renderer template will always render a coverage element below the root element which will contain a degraded element if the query execution was degraded in some way and the coverage field will be less than 100. An example is given below for a request with a query timeout of 200 ms and ranking.softtimeout.enable=true:

/search/?searchChain=vespa&yql=select * from sources * where foo contains bar;&presentation.format=json&timeout=200ms&ranking.softtimeout.enable=true

{
    "root": {
        "coverage": {
            "coverage": 99,
            "degraded": {
                "adaptive-timeout": false,
                "match-phase": false,
                "non-ideal-state": false,
                "timeout": true
            },
            "documents": 167006201,
            "full": false,
            "nodes": 11,
            "results": 1,
            "resultsFull": 0
        },
        "fields": {
            "totalCount": 16469732
        }
    }
}

In this example the result was delivered in 200 ms but the query was degraded as coverage is less than 100 (not all available documents searched and ranked within the specified timeout of 200ms). In this case 167006201 out of x documents where searched and 16469732 documents where matched and ranked using the first-phase ranking expression in the default ranking profile. The degraded field contains the following fields which explains why the result had coverage less than 100:

  • adaptive-timeout is true if Adaptive coverage has been enabled and one or more nodes fail to produce a result at all within the timeout. This could be caused by nodes with degraded hardware making them slower than others in the cluster
  • match-phase is true if the ranking profile in use has defined match phase ranking degradation match-phase can be used to control which documents gets ranked within the timeout
  • non-ideal-state is true in cases where the system is not in ideal state
  • timeout is true if softtimeout was enabled and not all documents could be matched & ranked with the first phase ranking function withing the timeout

Note that the degraded reasons are not mutually exclusive. In our example the softtimeout was triggered and only 99% of of the documents where searched before the time budget was used. One could imagine scenarios where 10 out of 11 nodes involved in the query execution were healthy and triggered soft timeout and delivered a result while the last node was in a bad state (e.g hw issues) and could not produce a result at all and that would cause both timeout and adaptive-timeout to be true.

If you are working on Results in a custom custom searcher you can get the coverage information programmatically:

    @Override
    public Result search(Query query, Execution execution) {
        Result result = execution.search(query);
        Coverage  coverage = result.getCoverage(false);
        if (coverage != null && coverage.isDegraded()) {
            logger.warning("Got a degraded result for query " + query + " : " + coverage.getResultPercentage() + "% was searched");
        }
        return result;
    }