• [+] expand all

Metrics and Health API reference

This is the reference documentation for the metric and health APIs in Vespa. Use the example overview of two nodes running Vespa for where the APIs are set up and how they interact:

Metrics interfaces

/metrics/v2/values

The API is found on all nodes running a Container node at http://host:port/metrics/v2/values

Port is the same as the container's query/feed endpoint, default 8080.

The Vespa container exposes a selected set of metrics for every service on all nodes for the application. The metrics API can, for example, be used to pull Vespa metrics to Cloudwatch using an AWS lambda function.

The metrics API exposes a selected set of metrics for the whole application, or for a single node, to allow integration with graphing and alerting services.

The response is a nodes list (see example output below), where each element represents a node in the application and contains:

  • The node's hostname.
  • The node's role in the Vespa application.
  • A node element containing the node's system metrics, e.g. cpu usage.
  • A services list containing metrics for the node's services. The format of this list is described below.
$ curl http://localhost:8080/metrics/v2/values
{
    "nodes": [
        {
            "hostname": "vespa-container",
            "role": "hosts/vespa-container",
            "services": [
                {
                    "name": "vespa.container",
                    "timestamp": 1634127924,
                    "status": {
                        "code": "up",
                        "description": "Data collected successfully"
                    },
                    "metrics": [
                        {
                            "values": {
                                "memory_virt": 3685253120,
                                "memory_rss": 1441259520,
                                "cpu": 29.1900152827305
                            },
                            "dimensions": {
                                "serviceId": "container"
                            }
                        },
                        {
                            "values": {
                                "jdisc.gc.ms.average": 0
                            },
                            "dimensions": {
                                "gcName": "G1OldGeneration",
                                "serviceId": "container"
                            }
                        },

/prometheus/v1/values

Vespa provides a node metrics API on each node at http://host:port/prometheus/v1/values

Port is the same as the container's query/feed endpoint, default 8080.

The prometheus API on each node exposes metrics in a text based format that can be scraped by Prometheus. The metrics is the same as in /metrics/v2/values. See monitoring for a Prometheus / Grafana example.

/metrics/v1/values

Vespa provides a node metrics API on each node at http://host:19092/metrics/v1/values

This API can be used for monitoring, using products like Prometheus and DataDog. The response contains a selected set of metrics from each service running on the node. The output is a list of service elements, with name, status and metrics for that service - example:

{
    "services": [
        {
            "name": "vespa.logd",
            "timestamp": 1561469256,
            "status": {
                "code": "up",
                "description": "Data collected successfully"
            },
            "metrics": [
                {
                    "values": {
                        "memory_virt": 111796224,
                        "memory_rss": 14086144,
                        "cpu": 1.0631117111036
                    },
                    "dimensions": {
                        "metrictype": "system",
                        "instance": "logd",
                        "vespaVersion": "7.0.0"
                    }
                }
            ]
        },
        ....
    ]
}

The status for each service is either up, down or (in rare cases) unknown. The unknown status is for example used if the service seems to be alive, but does not report any metrics.

/state/v1/metrics

Per-process metrics are found at http://host:port/state/v1/metrics

Metrics are reported in snapshots, where the snapshot specifies the time window the metrics are gathered from. Typically, the service will aggregate metrics as they are reported, and after each snapshot period, a snapshot is taken of the current values, and they are reset. Using this approach, min and max values are tracked, and enables values like 95% percentile for each complete snapshot period.

The from and to times are specified in seconds since 1970. Milliseconds or microseconds can be added as decimals.

Vespa supports custom metrics.

Example:

{
    "status" : {
        "code" : "up",
        "message" : "Everything ok here"
    },
    "metrics" : {
        "snapshot" : {
            "from" : 1334134640.089,
            "to" : 1334134700.088,
        },
        "values" : [
            {
                "name" : "queries",
                "description" : "Number of queries executed during snapshot interval",
                "values" : {
                    "count" : 28,
                    "rate" : 0.4667
                },
                "dimensions" : {
                    "searcherid" : "x"
                }
            },
            {
                "name" : "query_hits",
                "description" : "Number of documents matched per query during snapshot interval",
                "values" : {
                    "count" : 28,
                    "rate" : 0.4667,
                    "average" : 128.3,
                    "min" : 0,
                    "max" : 10000,
                    "sum" : 3584,
                    "median" : 124.0,
                    "std_deviation": 5.43
                },
                "dimensions" : {
                    "searcherid" : "x"
                }
            }
        ]
    }
}

A flat list of metrics is returned. Each metric value reported by a component should be a separate metric. For related metrics, prefix metric names with common parts and dot separate the names - e.g. memory.free and memory.virtual. Each metric have one or more values set:

Value Description
count Number of times metric has been set. For instance in a count metric counting number of operations done, it will annotate the number of operations added for that snapshot period. For a value metric, for instance setting latency of operations, the count will set how many times latencies have been added to the metric.
average The average of all the values gotten during a snapshot period, typically sum divided by count.
rate count/s.
min The smallest value seen in this snapshot period.
max The largest value seen in this snapshot period.
sum The total value seen in this snapshot period.

Metrics from custom components

  1. Add a com.yahoo.metrics.simple.MetricReceiver instance to the constructor of the component - it is injected by the container.
  2. Declare the gauges and counters using the declare methods on the metric receiver. Optionally set arbitrary metric dimensions to default values at declaration time - refer to the javadoc for details.
  3. Each time there is some data to measure, invoke the sample method on gauges or the add method on counters. When sampling data, any dimensions can optionally be set.

The gauges and counters declared are inherently thread-safe. Example:

package com.yahoo.example;

import java.util.Optional;
import com.yahoo.metrics.simple.Gauge;
import com.yahoo.metrics.simple.MetricSettings;
import com.yahoo.metrics.simple.MetricReceiver;
import com.yahoo.search.Query;
import com.yahoo.search.Result;
import com.yahoo.search.Searcher;
import com.yahoo.search.searchchain.Execution;

public class HitCountSearcher extends Searcher {
    private static final String LANGUAGE_DIMENSION_NAME = "query_language";
    private static final String EXAMPLE_METRIC_NAME = "example_hitcounts";
    private final Gauge hitCountMetric;

    public HitCountSearcher(MetricReceiver receiver) {
        this.hitCountMetric = receiver.declareGauge(EXAMPLE_METRIC_NAME, Optional.empty(),
                new MetricSettings.Builder().histogram(true).build());
    }

    @Override
    public Result search(Query query, Execution execution) {
        Result result = execution.search(query);
        hitCountMetric
            .sample(result.getTotalHitCount(),
                    hitCountMetric.builder()
                        .set(LANGUAGE_DIMENSION_NAME, query.getModel().getParsingLanguage().languageCode())
                        .build());
        return result;
    }
}

Then look at the metrics where the new event example_hitcounts is available in the list of metrics. The histograms for the last five minutes of logged data are available as CSV per dimension at http://host:port/state/v1/metrics/histograms. In the example, that would include the estimated total hit counts for queries, grouped by language. The underlying implementation of the histograms is HdrHistogram, and the CSV is simply what that library generates itself.

Container Metrics

A few metrics are emitted with under multiple names, for compatibility with different metrics frameworks.

Generic Container Metrics

These metrics are output for the server as a whole and are not specific to HTTP.

Metric nameDescription
serverStartedMillis Time since server started
mem.heap.total Total heap size
mem.heap.free Free heap size
mem.heap.used Used heap size

Thread Pool Metrics

Metrics for the container thread pools. The jdisc.thread_pool.* metrics have a dimension threadpool with thread pool name, e.g default-pool for the container's default thread pool. See Container Tuning for details.

Metric nameDescription
jdisc.thread_pool.size Size of the thread pool
jdisc.thread_pool.active_threads Number of threads that are active
jdisc.thread_pool.max_allowed_size The maximum allowed number of threads in the pool
jdisc.thread_pool.rejected_tasks Number of tasks rejected by the thread pool
jdisc.thread_pool.unhandled_exceptions Number of exceptions thrown by tasks
jdisc.thread_pool.work_queue.capacity Capacity of the task queue
jdisc.thread_pool.work_queue.size Size of the task queue
jdisc.http.jetty.threadpool.thread.max Jetty thread pool: configured maximum number of threads
jdisc.http.jetty.threadpool.thread.min Jetty thread pool: configured minimum number of threads
jdisc.http.jetty.threadpool.thread.reserved Jetty thread pool: configured number of reserved threads or -1 for heuristic
jdisc.http.jetty.threadpool.thread.busy Jetty thread pool: number of threads executing internal and transient jobs
jdisc.http.jetty.threadpool.thread.total Jetty thread pool: current number of threads
jdisc.http.jetty.threadpool.queue.size Jetty thread pool: current size of the job queue

HTTP Specific Metrics

These are metrics specific for HTTP. Those metrics that are specific to a connector will have a dimension containing the TCP listen port.

Metric nameDescription
jdisc.http.requests.status Number of requests to the built-in status handler
http.status.1xx Number of responses with a 1xx status
http.status.2xx Number of responses with a 2xx status
http.status.3xx Number of responses with a 3xx status
http.status.4xx Number of responses with a 4xx status
http.status.5xx Number of responses with a 5xx status
serverNumConnections The total number of connections opened
serverNumOpenConnections The current number of open connections
serverConnectionsOpenMax The max number of open connections
serverConnectionDurationMean, -Max, -StdDev The mean/max/stddev of connection duration in ms
serverNumRequests, jdisc.http.requests Number of requests received by the connector
serverNumSuccessfulResponses Number of successful responses sent by the connector
serverNumFailedResponses Number of error responses sent by the connector
serverNumSuccessfulResponseWrites Number of HTTP response chunks that have been successfully written to the network.
serverNumFailedResponseWrites Number of HTTP response chunks that have not been successfully written to the network, due to some kind of I/O error.
serverBytesReceived Number of bytes the connector has received
serverBytesSent Number of bytes the connector has sent
serverTimeToFirstByte Time to first byte of response body is sent
serverTotalSuccessfulResponseLatency Time to complete successful responses
serverTotalFailedResponseLatency Time to complete failed responses

/state/v1/health

Per-process health status is found at http://host:port/state/v1/health

Health API

The Health API is most commonly used for heartbeating. Example:

{
    "status" : {
        "code" : "up",
        "message" : "Everything ok here"
    }
}

The status code is one of (see StateMonitor):

  • initializing
  • up
  • down

Assume status down if the page cannot be downloaded.

Containers with the query API enabled return initializing while waiting for content nodes to start (see example). up means that the service is fully up.

The message part is optional - it is normally empty if the service is up, while it is set to a textual reason for why it is unavailable, if that is the case.