Note:
refer to the multinode
and multinode-HA
sample applications for a practical example of using the APIs.
These apps also include examples for how to find ports used by using
vespa-model-inspect.
See the metrics guide for how to get a metric using /metrics/v1/values
and /prometheus/v1/values.
This guide also documents use of custom metrics and histograms.
Metrics proxy
Each Vespa node has a metrics-proxy process running for this API, default port 19092.
It aggregates metrics from all processes on the node, and across nodes:
The metrics-proxy normally listens on port 19092 -
use vespa-model-inspect to validate.
See the metrics guide for the metrics interfaces hosted by the metrics proxy.
Metric-proxies intercommunicate to build a metric cache served on the internal applicationmetrics/v1/ API.
This is replicated on the container on /metrics/v2/values for easy access to all metrics for an application.
The metrics-proxy is started by the config-sentinel and is not configurable.
The metrics-proxy process looks like:
Per-process health status is found at http://host:port/state/v1/health
/state/v1/health is most commonly used for heartbeating,
see the reference for details. Example:
{"status":{"code":"up","message":"Everything ok here"}}
/state/v1/metrics
Per-process metrics are found at http://host:port/state/v1/metrics
Internally, Vespa aggregates metrics in the APIs above from the per-process metrics and health APIs.
While most users would use the aggregated APIs,
the per-process metric APIs could be used for specific cases.
Metrics are reported in snapshots, where the snapshot specifies the
time window the metrics are gathered from.
Typically, the service will aggregate metrics as they are reported, and after each snapshot period,
a snapshot is taken of the current values, and they are reset.
Using this approach, min and max values are tracked,
and enables values like 95% percentile for each complete snapshot period.
{"status":{"code":"up","message":"Everything ok here"},"metrics":{"snapshot":{"from":1334134640.089,"to":1334134700.088,},"values":[{"name":"queries","description":"Number of queries executed during snapshot interval","values":{"count":28,"rate":0.4667},"dimensions":{"chain":"vespa"}},{"name":"hits_per_query","description":"Number of hits returned for queries during snapshot interval","values":{"count":28,"rate":0.4667,"average":128.3,"min":0,"max":1000,"sum":3584,"last":72,"95percentile":849.1,"99percentile":672.0,},"dimensions":{"chain":"vespa"}}]}}
A flat list of metrics is returned.
Each metric value reported by a component should be a separate metric.
For related metrics, prefix metric names with common parts and dot separate the names -
e.g. memory.free and memory.virtual.
/metrics/v1/values
This API can be used for monitoring, using products like
Prometheus and DataDog.
The response contains a selected set of metrics from each service running on the node,
see the reference for details.
Example:
A container service on the same node as the metrics proxy might forward /metrics/v2/values
on its own port, normally 8080.
/metrics/v2/values exposes a selected set of metrics for every service on all nodes for the application.
For example, it can be used to
pull Vespa metrics to Cloudwatch using an AWS lambda function.
The metrics API exposes a
selected
set of metrics for the whole application, or for a single node,
to allow integration with graphing and alerting services.
The response is a nodes list with metrics (see example output below),
see the reference for details.
Vespa provides a node metrics API on each node at http://host:port/prometheus/v1/values
Port and content is the same as /metrics/v1/values.
The prometheus API on each node exposes metrics in a text based
format that can be
scraped by Prometheus.
See below for a Prometheus / Grafana example.
Pulling metrics from Vespa
All pull-based solutions use Vespa's metrics API,
which provides metrics in JSON format, either for the full system or for a single node.
The polling frequency should be limited to max once every 30 seconds as more frequent polling would
not give increased granularity but only lead to unnecessary load on your systems.
Service
Description
CloudWatch
Metrics can be pulled into CloudWatch from both Vespa Cloud and self-hosted Vespa.
The recommended solution is to use an AWS lambda function, as described in
Pulling Vespa metrics to Cloudwatch.
Datadog
The Vespa team has created a Datadog Agent integration
to allow real-time monitoring of Vespa in Datadog.
The Datadog Vespa integration
is not packaged with the agent, but is included in Datadog's
integrations-extras repository.
Clone it and follow the steps in the
README.
Note:
The Datadog Agent integration currently works for self-hosted Vespa only.
Prometheus
Vespa exposes metrics in a text based
format that can be
scraped by Prometheus.
For Vespa Cloud, append /prometheus/v1/values
to your endpoint URL. For self-hosted Vespa the URL is:
http://<container-host>:<port>/prometheus/v1/values, where
the port is the same as for searching, e.g. 8080. Metrics for each individual host
can also be retrieved at http://host:19092/prometheus/v1/values.
See the below for a Prometheus / Grafana example.
Pushing metrics to CloudWatch
Note: This method currently works for self-hosted Vespa only.
This is presumably the most convenient way to monitor Vespa in CloudWatch.
Steps / requirements:
An IAM user or IAM role that only has the putMetricData permission.
Store the credentials for the above user or role in a
shared credentials file on each Vespa node.
If a role is used, provide a mechanism to keep the credentials file updated when keys are rotated.
Configure Vespa to push metrics to CloudWatch -
example configuration for the admin section in services.xml:
This configuration sends the default set of Vespa metrics to the CloudWatch namespace
my-vespa-metrics in the us-east-1 region.
Refer to the
metric list
for default metric set.
Monitoring with Grafana
Follow these steps to set up monitoring with Grafana for a Vespa instance.
This guide builds on the quick start
by adding three more Docker containers and connecting these in the Docker monitoring network:
Run the Quick Start:
Complete steps 1-7 (or 1-10), but skip the removal step.
Clone repository:
$ git clone --depth 1 https://github.com/vespa-engine/sample-apps.git && \
cd sample-apps/examples/operations/monitoring/album-recommendation-monitoring
Create a network and add the vespa container to it:
Prometheus is a time-series database,
which holds a series of values associated with a timestamp.
Open Prometheus at
http://localhost:9090/.
One can easily find what data Prometheus has, the input box auto-completes,
e.g. enter feed_operations_rate and click Execute.
Also explore the Status dropdown.
This launches Grafana.
Grafana is a visualisation tool that can be used to easily make representations of important metrics surrounding Vespa.
Open
http://localhost:3000/ and find the Grafana login screen - log in with admin/admin (skip changing password).
From the list on the left, click Browse under Dashboards (the symbol with 4 blocks),
then click the Vespa Detailed Monitoring Dashboard.
The dashboard displays detailed Vespa metrics - empty for now.
$ docker build album-recommendation-random-data --no-cache --tag random-data-feeder -o out
$ ls out/usr/local/lib/app.jar
This builds the
Random Data Feeder -
it generates random sets of data and puts them into the Vespa instance.
Also, it repeatedly runs queries, for Grafana visualisation.
Compiling the Random Data Feeder takes a few minutes.
Graphs will now show up in Grafana and Prometheus - it might take a minute or two.
The Grafana dashboard is fully customisable.
Change the default modes of Grafana and Prometheus by editing the configuration files in
album-recommendation-monitoring.