This is the reference documentation for the metric and health APIs in Vespa.
Use the example overview of two nodes running Vespa for where the APIs are set up and how they interact:
/metrics/v1/values is the node metrics api,
and aggregates metrics for processes running on the node.
Each Vespa node has a metrics-proxy process running for this API, default port 19092.
/state/v1/metrics is the process metrics api,
and exposes all metrics from an individual service -
here each node runs a container and a content node.
Note:
refer to the multinode
and multinode-HA
sample applications for a practical example of using the APIs.
These apps also include examples for how to find ports used by using
vespa-model-inspect.
/metrics/v2/values
The API is found on all nodes running a Container node at http://host:port/metrics/v2/values
Port is the same as the container's query/feed endpoint, default 8080.
The Vespa container exposes a selected set of metrics for every service on all nodes for the application.
The metrics API can, for example, be used to
pull Vespa metrics to Cloudwatch using an AWS lambda function.
The metrics API exposes a
selected
set of metrics for the whole application, or for a single node,
to allow integration with graphing and alerting services.
The response is a nodes list (see example output below),
where each element represents a node in the application and contains:
The node's hostname.
The node's role in the Vespa application.
A node element containing the node's system metrics, e.g. cpu usage.
A services list containing metrics for the node's services.
The format of this list is described below.
Vespa provides a node metrics API on each node at http://host:port/prometheus/v1/values
Port is the same as the container's query/feed endpoint, default 8080.
The prometheus API on each node exposes metrics in a text based
format that can be
scraped by Prometheus.
The metrics is the same as in /metrics/v2/values.
See monitoring for a Prometheus / Grafana example.
/metrics/v1/values
Vespa provides a node metrics API on each node at
http://host:19092/metrics/v1/values
This API can be used for monitoring, using products like
Prometheus and
DataDog.
The response contains a selected set of metrics from each service running on the node.
The output is a list of service elements, with name, status and metrics for that service - example:
The status for each service is either up,
down or (in rare cases) unknown.
The unknown status is for example used if the service seems to be alive,
but does not report any metrics.
/state/v1/metrics
Per-process metrics are found at http://host:port/state/v1/metrics
Metrics are reported in snapshots, where the snapshot specifies the
time window the metrics are gathered from.
Typically, the service will aggregate metrics as they are reported, and after each snapshot period,
a snapshot is taken of the current values, and they are reset.
Using this approach, min and max values are tracked,
and enables values like 95% percentile for each complete snapshot period.
The from and to times are specified in seconds since 1970.
Milliseconds or microseconds can be added as decimals.
{"status":{"code":"up","message":"Everything ok here"},"metrics":{"snapshot":{"from":1334134640.089,"to":1334134700.088,},"values":[{"name":"queries","description":"Number of queries executed during snapshot interval","values":{"count":28,"rate":0.4667},"dimensions":{"searcherid":"x"}},{"name":"query_hits","description":"Number of documents matched per query during snapshot interval","values":{"count":28,"rate":0.4667,"average":128.3,"min":0,"max":10000,"sum":3584,"median":124.0,"std_deviation":5.43},"dimensions":{"searcherid":"x"}}]}}
A flat list of metrics is returned.
Each metric value reported by a component should be a separate metric.
For related metrics, prefix metric names with common parts and dot separate the names -
e.g. memory.free and memory.virtual.
Each metric have one or more values set:
Value
Description
count
Number of times metric has been set. For instance in a count
metric counting number of operations done, it will annotate the
number of operations added for that snapshot period. For a value
metric, for instance setting latency of operations, the count
will set how many times latencies have been added to the
metric.
average
The average of all the values gotten during a snapshot period, typically sum divided by count.
Declare the gauges and counters using the declare methods on the metric receiver.
Optionally set arbitrary metric dimensions to default values at declaration time - refer to the javadoc for details.
Each time there is some data to measure,
invoke the sample method on gauges or the add method on counters.
When sampling data, any dimensions can optionally be set.
The gauges and counters declared are inherently thread-safe. Example:
Then look at the metrics where the new event example_hitcounts is available in the list of metrics.
The histograms for the last five minutes of logged data are available as CSV per
dimension at http://host:port/state/v1/metrics/histograms.
In the example, that would include the estimated total hit counts for queries, grouped by language.
The underlying implementation of the histograms is HdrHistogram,
and the CSV is simply what that library generates itself.
Container Metrics
A few metrics are emitted with under multiple names, for compatibility with different metrics frameworks.
Generic Container Metrics
These metrics are output for the server as a whole and are not specific to HTTP.
Metric name
Description
serverStartedMillis
Time since server started
mem.heap.total
Total heap size
mem.heap.free
Free heap size
mem.heap.used
Used heap size
Thread Pool Metrics
Metrics for the container thread pools.
The jdisc.thread_pool.* metrics have a dimension threadpool with thread pool name,
e.g default-pool for the container's default thread pool.
See Container Tuning for details.
Metric name
Description
jdisc.thread_pool.size
Size of the thread pool
jdisc.thread_pool.active_threads
Number of threads that are active
jdisc.thread_pool.max_allowed_size
The maximum allowed number of threads in the pool
jdisc.thread_pool.rejected_tasks
Number of tasks rejected by the thread pool
jdisc.thread_pool.unhandled_exceptions
Number of exceptions thrown by tasks
jdisc.thread_pool.work_queue.capacity
Capacity of the task queue
jdisc.thread_pool.work_queue.size
Size of the task queue
jdisc.http.jetty.threadpool.thread.max
Jetty thread pool: configured maximum number of threads
jdisc.http.jetty.threadpool.thread.min
Jetty thread pool: configured minimum number of threads
jdisc.http.jetty.threadpool.thread.reserved
Jetty thread pool: configured number of reserved threads or -1 for heuristic
jdisc.http.jetty.threadpool.thread.busy
Jetty thread pool: number of threads executing internal and transient jobs
jdisc.http.jetty.threadpool.thread.total
Jetty thread pool: current number of threads
jdisc.http.jetty.threadpool.queue.size
Jetty thread pool: current size of the job queue
HTTP Specific Metrics
These are metrics specific for HTTP.
Those metrics that are specific to a connector will have a dimension containing the TCP listen port.
Metric name
Description
jdisc.http.requests.status
Number of requests to the built-in status handler
http.status.1xx
Number of responses with a 1xx status
http.status.2xx
Number of responses with a 2xx status
http.status.3xx
Number of responses with a 3xx status
http.status.4xx
Number of responses with a 4xx status
http.status.5xx
Number of responses with a 5xx status
serverNumConnections
The total number of connections opened
serverNumOpenConnections
The current number of open connections
serverConnectionsOpenMax
The max number of open connections
serverConnectionDurationMean, -Max, -StdDev
The mean/max/stddev of connection duration in ms
serverNumRequests, jdisc.http.requests
Number of requests received by the connector
serverNumSuccessfulResponses
Number of successful responses sent by the connector
serverNumFailedResponses
Number of error responses sent by the connector
serverNumSuccessfulResponseWrites
Number of HTTP response chunks that have been successfully written to the network.
serverNumFailedResponseWrites
Number of HTTP response chunks that have not been successfully written to the network,
due to some kind of I/O error.
serverBytesReceived
Number of bytes the connector has received
serverBytesSent
Number of bytes the connector has sent
serverTimeToFirstByte
Time to first byte of response body is sent
serverTotalSuccessfulResponseLatency
Time to complete successful responses
serverTotalFailedResponseLatency
Time to complete failed responses
/state/v1/health
Per-process health status is found at http://host:port/state/v1/health
The Health API is most commonly used for heartbeating. Example:
{"status":{"code":"up","message":"Everything ok here"}}
Assume status down if the page cannot be downloaded.
Containers with the query API enabled return initializing
while waiting for content nodes to start (see
example).
up means that the service is fully up.
The message part is optional - it is normally empty if the service is up,
while it is set to a textual reason for why it is unavailable, if that is the case.