Metrics
Vespa provides various HTTP APIs to expose service metrics and health in json format:
- The metrics API exposes a selected set of metrics for the whole application, or for a single node, to allow integration with graphing and alerting services.
- Each service exposes its health status via the process health api.
- The process metrics api exposes all metrics from an individual service.
Metrics API
The Vespa container (or your endpoint in hosted Vespa) exposes a selected set of metrics for every service on all nodes for the application. The metrics API can, for example, be used to pull Vespa metrics to Cloudwatch using an AWS lambda function.
For self-hosted Vespa, the URL is: http://<container-host>:<port>/metrics/v2/values, where the port is the same as for searching, e.g. 8080.
For hosted Vespa, just append /metrics/v2/values to your endpoint URL.
The response is a nodes
list, where each element
represents a node in the application and contains:
- The node's
hostname
. - The node's
role
in the Vespa application. - A
node
element containing the node's system metrics, e.g. cpu usage. - A
services
list containing metrics for the node's services. The format of this list is described below.
{ "nodes": [ { "hostname": "x1234.aws-us-east-1c.vespa-external.aws.oath.cloud", "role": "content/music/0/0", "node": { "timestamp": 1581325707, "metrics": [ { "values": { "cpu.util": 5.0390590408805, "cpu.sys.util": 1.69256381798, "cpu.vcpus": 2, }, "dimensions": { "applicationId": "user.album.default", "host": "x1234.aws-us-east-1c.vespa-external.aws.oath.cloud", "zone": "aws-us-east-1c", "clusterId": "content/music" } } ] }, "services": [ { "name": "vespa.distributor", "timestamp": 1581325707, "status": { "code": "up", "description": "Data collected successfully" }, "metrics": [ { "values": { "serverActiveThreads.average": 8, "mem.heap.free.average": 33107668 }, "dimensions": { "zone": "dev.aws-us-east-1c", "applicationId": "user.album.default", "serviceId": "container-clustercontroller", "clusterId": "content/music" } } ] // end metrics } ] // end services } ] // end nodes }Comments are added for clarity, although not legal json.
Metrics for a single node
Vespa provides a node metrics API on each node at
http://<host>:19092/metrics/v1/values. This API can be used for monitoring
self hosted Vespa in e.g. Prometheus and
DataDog. The response contains a selected set of
metrics from each service running on the node. The output is a list of
service
elements, with name, status and metrics for that service - example:
{ "services": [ { "name": "vespa.logd", "timestamp": 1561469256, "status": { "code": "up", "description": "Data collected successfully" }, "metrics": [ { "values": { "memory_virt": 111796224, "memory_rss": 14086144, "cpu": 1.0631117111036 }, "dimensions": { "metrictype": "system", "instance": "logd", "vespaVersion": "7.0.0" } } ] }, .... ] }The status for each service is either
up
,
down
or (in rare cases) unknown
.
The unknown
status is for example used if the service seems to be alive,
but does not report any metrics.
Process Health API
The Health API is most commonly used for heartbeating. Health status for each process is found at http://host:port/state/v1/health
Example:
{ "status" : { "code" : "up", "message" : "Everything ok here" } }The status code can either be
up
or down
.
Status up
means that the service is fully up, ready for serving traffic.
If the page cannot be downloaded, a state of down is typically assumed.
The message part is optional - it is normally empty if the service is up,
while it is set to a textual reason for why it is unavailable, if that is the case.
Process Metrics API
Per-process metrics are found at http://host:port/state/v1/metrics
To find metrics ports, use
vespa-model-inspect services
to find running services in a cluster, then
vespa-model-inspect service <service name>
to find ports for the given service - examples:
$ vespa-model-inspect service searchnode $ vespa-model-inspect service distributorMetrics are reported in snapshots, where the snapshot specifies the time window the metrics are gathered from. Typically, the service will aggregate metrics as they are reported, and after each snapshot period, a snapshot is taken of the current values and they are reset. Using this approach, min and max values are tracked, and enables values like 95% percentile for each complete snapshot period.
The from and to times are specified in seconds since 1970. Milliseconds or microseconds can be added as decimals.
Vespa supports custom metrics.
Example:
{ "status" : { "code" : "up", "message" : "Everything ok here" }, "metrics" : { "snapshot" : { "from" : 1334134640.089, "to" : 1334134700.088, }, "values" : [ { "name" : "queries", "description" : "Number of queries executed during snapshot interval", "values" : { "count" : 28, "rate" : 0.4667 }, "dimensions" : { "searcherid" : "x" } }, { "name" : "query_hits", "description" : "Number of documents matched per query during snapshot interval", "values" : { "count" : 28, "rate" : 0.4667, "average" : 128.3, "min" : 0, "max" : 10000, "sum" : 3584, "median" : 124.0, "std_deviation": 5.43 }, "dimensions" : { "searcherid" : "x" } } ] } }A flat list of metrics is returned. Each metric value reported by a component should be a separate metric. For related metrics, prefix metric names with common parts and dot separate the names - e.g.
memory.free
and memory.virtual
.
Each metric have one or more values set - valid values:
count | Number of times metric has been set. For instance in a count metric counting number of operations done, it will annotate the number of operations added for that snapshot period. For a value metric, for instance setting latency of operations, the count will set how many times latencies have been added to the metric. |
---|---|
average | The average of all the values gotten during a snapshot period. Typically sum divided by count. |
rate | count/s. |
min | The smallest value seen in this snapshot period. |
max | The largest value seen in this snapshot period. |
sum | The total value seen in this snapshot period. |
Metrics from custom components
- Add a MetricReceiver ( com.yahoo.metrics.simple.MetricReceiver) instance to the constructor of the component in question - it is injected by the container
- Declare the gauges and counters using the declare methods on the metric receiver. Optionally set arbitrary metric dimensions to default values at declaration time - refer to the javadoc for details
- Each time there is some data to measure, invoke the sample method on gauges or the add method on counters. When sampling data, any dimensions can optionally be set
The gauges and counters declared are inherently thread safe. Example:
package com.yahoo.example; import java.util.Optional; import com.yahoo.metrics.simple.Gauge; import com.yahoo.metrics.simple.MetricSettings; import com.yahoo.metrics.simple.MetricReceiver; import com.yahoo.search.Query; import com.yahoo.search.Result; import com.yahoo.search.Searcher; import com.yahoo.search.searchchain.Execution; public class HitCountSearcher extends Searcher { private static final String LANGUAGE_DIMENSION_NAME = "query_language"; private static final String EXAMPLE_METRIC_NAME = "example_hitcounts"; private final Gauge hitCountMetric; public HitCountSearcher(MetricReceiver receiver) { this.hitCountMetric = receiver.declareGauge(EXAMPLE_METRIC_NAME, Optional.empty(), new MetricSettings.Builder().histogram(true).build()); } @Override public Result search(Query query, Execution execution) { Result result = execution.search(query); hitCountMetric .sample(result.getTotalHitCount(), hitCountMetric.builder() .set(LANGUAGE_DIMENSION_NAME, query.getModel().getParsingLanguage().languageCode()) .build()); return result; } }Then look at the Metrics API where the new event example_hitcounts is available in the list of metrics. The histograms for the last five minutes of logged data are available as CSV per dimension at http://host:port/state/v1/metrics/histograms. In the example, that would include the estimated total hit counts for queries, grouped by language. The underlying implementation of the histograms is HdrHistogram, and the CSV is simply what that library generates itself.
Prometheus
The metrics API on each host exposes metrics in a text based
format that can be
scraped by Prometheus at
http://host:19092/prometheus/v1/values
.
See the quick-start for a Prometheus / Grafana example.
Http Server metrics
The metrics from the built-in HTTP server are available in JSON using the metrics API.
The Container HTTP server is based on Jetty.
Some of the metrics are gathered from the Jetty StatisticsHandler
or ConnectorStatistics
, with names familiar to those used to working with Jetty.
Other metrics are Container specific.
Some of the metrics are emitted with two separate names,
for compatibility with different metrics frameworks.
In services.xml-terminology, a server
means a Jetty connector,
while there is always only one Jetty server per node in a Container cluster.
Server-wide Metrics
These metrics are output for the server as a whole, across all connectors, as opposed to the per-connector metrics listed in the next section.
Metric name | Description |
---|---|
serverStartedMillis | Time since server started |
mem.heap.total | Total heap size |
mem.heap.free | Free heap size |
mem.heap.used | Used heap size |
serverThreadPoolSize | Size of the thread pool for request processing |
serverActiveThreads | Number of threads that are active processing requests |
serverRejectedRequests | Number of requests rejected by the thread pool |
jdisc.http.requests.status | Number of requests to the built-in status handler |
http.status.1xx | Number of responses with a 1xx status |
http.status.2xx | Number of responses with a 2xx status |
http.status.3xx | Number of responses with a 3xx status |
http.status.4xx | Number of responses with a 4xx status |
http.status.5xx | Number of responses with a 5xx status |
Per-connector Metrics
These metrics are output for each connector in the Jetty server.
Metric name | Description |
---|---|
serverNumConnections | See Jetty ConnectorStatistics |
serverNumOpenConnections | See Jetty ConnectorStatistics |
serverConnectionsOpenMax | See Jetty ConnectorStatistics |
serverConnectionDurationMean, -Max, -StdDev | See Jetty ConnectorStatistics |
serverNumRequests, jdisc.http.requests | Number of requests received by the connector |
serverNumSuccessfulResponses | Number of successful responses sent by the connector |
serverNumFailedResponses | Number of error responses sent by the connector |
serverNumSuccessfulResponseWrites | Number of HTTP response chunks that have been successfully written to the network. |
serverNumFailedResponseWrites | Number of HTTP response chunks that have not been successfully written to the network, due to some kind of I/O error. |
serverBytesReceived | Number of bytes the connector has received |
serverBytesSent | Number of bytes the connector has sent |
serverTimeToFirstByte | Time to first byte of response body is sent |
serverTotalSuccessfulResponseLatency | Time to complete successful responses |
serverTotalFailedResponseLatency | Time to complete failed responses |