Metrics

Vespa exposes APIs for service metrics and health. The node metrics API allows integration with graphing and alerting services.

Each service also exposes health status and all its metrics from the process health api and process metrics api, respectively.

Node Metrics API

To integrate with monitoring systems, Vespa runs a metrics proxy on all nodes. The metrics proxy exposes a selected set of metrics from each service running on the node, at http://host:19092/metrics/v1/values.

The output from the node metrics API is a list of service elements, with name, status and metrics for that service - example:

{
  "services": [
    {
      "name": "vespa.logd",
      "timestamp": 1561469256,
      "status": {
        "code": "up",
        "description": "Data collected successfully"
      },
      "metrics": [
        {
          "values": {
            "memory_virt": 111796224,
            "memory_rss": 14086144,
            "cpu": 1.0631117111036
          },
          "dimensions": {
            "metrictype": "system",
            "instance": "logd",
            "vespaVersion": "7.0.0"
          }
        }
      ]
    },
    ....
  ]
}
The status for each service is either up, down or (in rare cases) unknown. The unknown status is for example used if the service seems to be alive, but does not report any metrics.

Scraping metrics with Prometheus

The node metrics API exposes metrics in a text based format that can be scraped by Prometheus at http://host:19092/prometheus/v1/values.

Datadog Integration

A Vespa Datadog integration prototype is currently in beta. It is not included in the Datadog Agent package. To try it out, clone integrations-extras, checkout the vespa branch - then follow the steps in the README.

Process Health API

The Health API is most commonly used for heartbeating. Health status for each process is found at http://host:port/state/v1/health

Example:

{
  "status" : {
    "code" : "up",
    "message" : "Everything ok here"
  }
}
The status code can either be up or down. Status up means that the service is fully up, ready for serving traffic. If the page cannot be downloaded, a state of down is typically assumed. The message part is optional - it is normally empty if the service is up, while it is set to a textual reason for why it is unavailable, if that is the case.

Process Metrics API

Per-process metrics are found at http://host:port/state/v1/metrics

To find metrics ports, use vespa-model-inspect services to find running services in a cluster, then vespa-model-inspect service <service name> to find ports for the given service (e.g. searchnode).

Metrics are reported in snapshots, where the snapshot specifies the time window the metrics are gathered from. Typically, the service will aggregate metrics as they are reported, and after each snapshot period, a snapshot is taken of the current values and they are reset. Using this approach, min and max values are tracked, and enables values like 95% percentile for each complete snapshot period.

The from and to times are specified in seconds since 1970. Milliseconds or microseconds can be added as decimals.

Vespa supports custom metrics.

Example:

{
  "status" : {
    "code" : "up",
    "message" : "Everything ok here"
  },
  "metrics" : {
    "snapshot" : {
      "from" : 1334134640.089,
      "to" : 1334134700.088,
    },
    "values" : [
      {
        "name" : "queries",
        "description" : "Number of queries executed during snapshot interval",
        "values" : {
          "count" : 28,
          "rate" : 0.4667
        },
        "dimensions" : {
          "searcherid" : "x"
        }
      },
      {
        "name" : "query_hits",
        "description" : "Number of documents matched per query during snapshot interval",
        "values" : {
          "count" : 28,
          "rate" : 0.4667,
          "average" : 128.3,
          "min" : 0,
          "max" : 10000,
          "sum" : 3584,
          "median" : 124.0,
          "std_deviation": 5.43
        },
        "dimensions" : {
          "searcherid" : "x"
        }
      }
    ]
  }
}
A flat list of metrics is returned. Each metric value reported by a component should be a separate metric. For related metrics, prefix metric names with common parts and dot separate the names - e.g. memory.free and memory.virtual. Each metric have one or more values set - valid values:

count Number of times metric has been set. For instance in a count metric counting number of operations done, it will annotate the number of operations added for that snapshot period. For a value metric, for instance setting latency of operations, the count will set how many times latencies have been added to the metric.
average The average of all the values gotten during a snapshot period. Typically sum divided by count.
rate count/s.
min The smallest value seen in this snapshot period.
max The largest value seen in this snapshot period.
sum The total value seen in this snapshot period.

Metrics from custom components

  1. Add a MetricReceiver ( com.yahoo.metrics.simple.MetricReceiver) instance to the constructor of the component in question - it is injected by the container
  2. Declare the gauges and counters using the declare methods on the metric receiver. Optionally set arbitrary metric dimensions to default values at declaration time - refer to the javadoc for details
  3. Each time there is some data to measure, invoke the sample method on gauges or the add method on counters. When sampling data, any dimensions can optionally be set

The gauges and counters declared are inherently thread safe. Example:

package com.yahoo.example;

import java.util.Optional;
import com.yahoo.metrics.simple.Gauge;
import com.yahoo.metrics.simple.MetricSettings;
import com.yahoo.metrics.simple.MetricReceiver;
import com.yahoo.search.Query;
import com.yahoo.search.Result;
import com.yahoo.search.Searcher;
import com.yahoo.search.searchchain.Execution;

public class HitCountSearcher extends Searcher {
    private static final String LANGUAGE_DIMENSION_NAME = "query_language";
    private static final String EXAMPLE_METRIC_NAME = "example_hitcounts";
    private final Gauge hitCountMetric;

    public HitCountSearcher(MetricReceiver receiver) {
        this.hitCountMetric = receiver.declareGauge(EXAMPLE_METRIC_NAME, Optional.empty(),
                new MetricSettings.Builder().histogram(true).build());
    }

    @Override
    public Result search(Query query, Execution execution) {
        Result result = execution.search(query);
        hitCountMetric
                .sample(result.getTotalHitCount(),
                        hitCountMetric.builder()
                                .set(LANGUAGE_DIMENSION_NAME, query.getModel().getParsingLanguage().languageCode())
                                .build());
        return result;
    }
}
Then look at the Metrics API where the new event example_hitcounts is available in the list of metrics. The histograms for the last five minutes of logged data are available as CSV per dimension at http://host:port/state/v1/metrics/histograms. In the example, that would include the estimated total hit counts for queries, grouped by language. The underlying implementation of the histograms is HdrHistogram, and the CSV is simply what that library generates itself.

Http Server metrics

The metrics from the built-in HTTP server are available in JSON using the metrics API.

The Container HTTP server is based on Jetty. Some of the metrics are gathered from the Jetty StatisticsHandler or ConnectorStatistics, with names familiar to those used to working with Jetty. Other metrics are Container specific. Some of the metrics are emitted with two separate names, for compatibility with different metrics frameworks.

In services.xml-terminology, a server means a Jetty connector, while there is always only one Jetty server per node in a Container cluster.

Server-wide Metrics

These metrics are output for the server as a whole, across all connectors, as opposed to the per-connector metrics listed in the next section.

Metric nameDescription
serverStartedMillis Time since server started
mem.heap.total Total heap size
mem.heap.free Free heap size
mem.heap.used Used heap size
serverThreadPoolSize Size of the thread pool for request processing
serverActiveThreads Number of threads that are active processing requests
serverRejectedRequests Number of requests rejected by the thread pool
jdisc.http.requests.status Number of requests to the built-in status handler
http.status.1xx Number of responses with a 1xx status
http.status.2xx Number of responses with a 2xx status
http.status.3xx Number of responses with a 3xx status
http.status.4xx Number of responses with a 4xx status
http.status.5xx Number of responses with a 5xx status

Per-connector Metrics

These metrics are output for each connector in the Jetty server.

Metric nameDescription
serverNumConnections See Jetty ConnectorStatistics
serverNumOpenConnections See Jetty ConnectorStatistics
serverConnectionsOpenMax See Jetty ConnectorStatistics
serverConnectionDurationMean, -Max, -StdDev See Jetty ConnectorStatistics
serverNumRequests, jdisc.http.requests Number of requests received by the connector
serverNumSuccessfulResponses Number of successful responses sent by the connector
serverNumFailedResponses Number of error responses sent by the connector
serverNumSuccessfulResponseWrites Number of HTTP response chunks that have been successfully written to the network.
serverNumFailedResponseWrites Number of HTTP response chunks that have not been successfully written to the network, due to some kind of I/O error.
serverBytesReceived Number of bytes the connector has received
serverBytesSent Number of bytes the connector has sent
serverTimeToFirstByte Time to first byte of response body is sent
serverTotalSuccessfulResponseLatency Time to complete successful responses
serverTotalFailedResponseLatency Time to complete failed responses