This is the reference documentation for the metric and health APIs in Vespa. Use the example overview of two nodes running Vespa for where the APIs are set up and how they interact:
/metrics/v1/valuesis the node metrics api, and aggregates metrics for processes running on the node. Each Vespa node has a metrics-proxy process running for this API, default port 19092.
/state/v1/metricsis the process metrics api, and exposes all metrics from an individual service - here each node runs a container and a content node.
/metrics/v2/valuesis an aggregation of
/metrics/v1/values, for all nodes. Served on the metrics-proxy port.
/prometheus/v1/valuesis the same as
/metrics/v2/values, in prometheus format. Served on the metrics-proxy port.
/metrics/v2/valuesare also replicated on the container port, default 8080.
The API is found on all nodes running a Container node at http://host:port/metrics/v2/values
Port is the same as the container's query/feed endpoint, default 8080.
The Vespa container exposes a selected set of metrics for every service on all nodes for the application. The metrics API can, for example, be used to pull Vespa metrics to Cloudwatch using an AWS lambda function.
The response is a
nodes list (see example output below),
where each element represents a node in the application and contains:
rolein the Vespa application.
nodeelement containing the node's system metrics, e.g. cpu usage.
serviceslist containing metrics for the node's services. The format of this list is described below.
$ curl http://localhost:8080/metrics/v2/values
Vespa provides a node metrics API on each node at http://host:port/prometheus/v1/values
Port is the same as the container's query/feed endpoint, default 8080.
The prometheus API on each node exposes metrics in a text based
format that can be
scraped by Prometheus.
The metrics is the same as in
See monitoring for a Prometheus / Grafana example.
Vespa provides a node metrics API on each node at http://host:19092/metrics/v1/values
This API can be used for monitoring, using products like
The response contains a selected set of metrics from each service running on the node.
The output is a list of
service elements, with name, status and metrics for that service - example:
The status for each service is either
down or (in rare cases)
unknown status is for example used if the service seems to be alive,
but does not report any metrics.
Per-process metrics are found at http://host:port/state/v1/metrics
Metrics are reported in snapshots, where the snapshot specifies the time window the metrics are gathered from. Typically, the service will aggregate metrics as they are reported, and after each snapshot period, a snapshot is taken of the current values, and they are reset. Using this approach, min and max values are tracked, and enables values like 95% percentile for each complete snapshot period.
The from and to times are specified in seconds since 1970. Milliseconds or microseconds can be added as decimals.
Vespa supports custom metrics.
A flat list of metrics is returned.
Each metric value reported by a component should be a separate metric.
For related metrics, prefix metric names with common parts and dot separate the names -
Each metric have one or more values set:
|count||Number of times metric has been set. For instance in a count metric counting number of operations done, it will annotate the number of operations added for that snapshot period. For a value metric, for instance setting latency of operations, the count will set how many times latencies have been added to the metric.|
|average||The average of all the values gotten during a snapshot period, typically sum divided by count.|
|min||The smallest value seen in this snapshot period.|
|max||The largest value seen in this snapshot period.|
|sum||The total value seen in this snapshot period.|
The gauges and counters declared are inherently thread-safe. Example:
Then look at the metrics where the new event example_hitcounts is available in the list of metrics. The histograms for the last five minutes of logged data are available as CSV per dimension at http://host:port/state/v1/metrics/histograms. In the example, that would include the estimated total hit counts for queries, grouped by language. The underlying implementation of the histograms is HdrHistogram, and the CSV is simply what that library generates itself.
A few metrics are emitted with under multiple names, for compatibility with different metrics frameworks.
These metrics are output for the server as a whole and are not specific to HTTP.
|serverStartedMillis||Time since server started|
|mem.heap.total||Total heap size|
|mem.heap.free||Free heap size|
|mem.heap.used||Used heap size|
Metrics for the container thread pools.
jdisc.thread_pool.* metrics have a dimension
threadpool with thread pool name,
e.g default-pool for the container's default thread pool.
See Container Tuning for details.
|jdisc.thread_pool.size||Size of the thread pool|
|jdisc.thread_pool.active_threads||Number of threads that are active|
|jdisc.thread_pool.max_allowed_size||The maximum allowed number of threads in the pool|
|jdisc.thread_pool.rejected_tasks||Number of tasks rejected by the thread pool|
|jdisc.thread_pool.unhandled_exceptions||Number of exceptions thrown by tasks|
|jdisc.thread_pool.work_queue.capacity||Capacity of the task queue|
|jdisc.thread_pool.work_queue.size||Size of the task queue|
|jdisc.http.jetty.threadpool.thread.max||Jetty thread pool: configured maximum number of threads|
|jdisc.http.jetty.threadpool.thread.min||Jetty thread pool: configured minimum number of threads|
|jdisc.http.jetty.threadpool.thread.reserved||Jetty thread pool: configured number of reserved threads or -1 for heuristic|
|jdisc.http.jetty.threadpool.thread.busy||Jetty thread pool: number of threads executing internal and transient jobs|
|jdisc.http.jetty.threadpool.thread.total||Jetty thread pool: current number of threads|
|jdisc.http.jetty.threadpool.queue.size||Jetty thread pool: current size of the job queue|
These are metrics specific for HTTP. Those metrics that are specific to a connector will have a dimension containing the TCP listen port.
|jdisc.http.requests.status||Number of requests to the built-in status handler|
|http.status.1xx||Number of responses with a 1xx status|
|http.status.2xx||Number of responses with a 2xx status|
|http.status.3xx||Number of responses with a 3xx status|
|http.status.4xx||Number of responses with a 4xx status|
|http.status.5xx||Number of responses with a 5xx status|
|serverNumConnections||The total number of connections opened|
|serverNumOpenConnections||The current number of open connections|
|serverConnectionsOpenMax||The max number of open connections|
|serverConnectionDurationMean, -Max, -StdDev||The mean/max/stddev of connection duration in ms|
|serverNumRequests, jdisc.http.requests||Number of requests received by the connector|
|serverNumSuccessfulResponses||Number of successful responses sent by the connector|
|serverNumFailedResponses||Number of error responses sent by the connector|
|serverNumSuccessfulResponseWrites||Number of HTTP response chunks that have been successfully written to the network.|
|serverNumFailedResponseWrites||Number of HTTP response chunks that have not been successfully written to the network, due to some kind of I/O error.|
|serverBytesReceived||Number of bytes the connector has received|
|serverBytesSent||Number of bytes the connector has sent|
|serverTimeToFirstByte||Time to first byte of response body is sent|
|serverTotalSuccessfulResponseLatency||Time to complete successful responses|
|serverTotalFailedResponseLatency||Time to complete failed responses|
Per-process health status is found at http://host:port/state/v1/health
The Health API is most commonly used for heartbeating. Example:
The status code is one of (see StateMonitor):
Assume status down if the page cannot be downloaded.
The message part is optional - it is normally empty if the service is up, while it is set to a textual reason for why it is unavailable, if that is the case.