The Vespa Cloud Console has dashboards for insight into performance metrics, use the METRICS tab in the application zone view.
These metrics can also be pulled into external monitoring tools using the Prometheus metrics API.
The Vespa Cloud metrics dashboard (the METRICS tab in the application zone view) is organized around a symptom → layer → resource workflow, so an investigation that starts from "latency is up" can land on "this specific layer is the bottleneck" without scanning every chart.
The dashboard is organized into seven tabs:
| Tab | What it shows | When to use it |
|---|---|---|
| Overview | Health indicators, request rates, QoS, latency summary, HTTP status codes, resource utilization | Daily health check, first stop during incidents |
| Query | Container- and content-node query latency, per-rank-profile breakdown, match/docsum executors | Investigating read latency, query quality issues |
| Feed | Feed operation rates and latency at each layer, feed blocking | Investigating write latency or throughput issues |
| Nearest Neighbor Search | NNS distance computations, visit efficiency | Tuning HNSW parameters (hidden when not in use) |
| Content Node | Document counts, Proton resource usage, executor utilization, maintenance jobs | Deep investigation of search engine internals |
| Resources | CPU, memory, disk, GPU, JVM, thread pools | Sizing and scaling decisions |
| Health | Cluster state, data consistency, restarts, reindexing, resource limits | Stability monitoring, post-incident review |
Filters at the top apply across all tabs:
Query, Feed, Content Node, Resources, and Health tabs group metrics per cluster — you see all metrics for one cluster before scrolling to the next. Container metrics are grouped per container cluster, content metrics per content cluster.
Annotations are vertical lines drawn on every chart that mark operational events. When a latency or throughput anomaly lines up with an annotation, you get the context for the change without having to infer it from the graph alone.
| Annotation | Triggered by | Why it matters |
|---|---|---|
| Feed blocked in cluster | A content node crosses its disk/memory feed-block limit | Writes are paused cluster-wide until remediated |
| Vespa upgrade | A new Vespa version is rolled out | Brief rolling-restart latency spikes are expected around this marker |
| Data migration | Bucket merges pending exceed a threshold | Explains elevated CPU/IO and latency during redistribution |
| Document re-indexing | A reindexing job is running | Explains elevated CPU and search-side load |
| Auto-scaling | The autoscaler changed the cluster shape | Brief capacity drop during reshuffle |
| Service restart | delta(sentinel_totalRestarts[10m]) > 0 — a Vespa service process restarted on one or more nodes |
Unexpected restarts usually indicate a crash, OOM, or forced stop; outside of planned upgrades these are always worth investigating |
| Core dump | delta(coredumps_processed[1h]) > 0 — a process core-dumped |
Signals a crash; cross-reference with Service restart. Should be extremely rare |
The Overview tab is the fastest place to answer "is anything obviously broken?" and provides everything needed for daily monitoring at a glance.
The Overview tab opens with a dedicated Health Indicators row — five stat panels designed to surface stability issues in a single glance. A row of green zeros is the signal to stop; a non-zero value tells you which tab to visit next.
| Indicator | What it counts | Healthy value |
|---|---|---|
| Core Dumps (1h) | Core dumps processed across all clusters in the last hour | 0 — any non-zero value is a crash to investigate |
| Restarts (1h) | Vespa service restarts across all clusters in the last hour | 0 during steady state; brief spikes are normal during upgrades |
| Feed Blocked | Nodes currently above a feed-block resource limit | 0 — non-zero means writes are being rejected cluster-wide |
| Content: Groups/Nodes Down | Content groups with at least one node down | 0 during steady state. 1 group down is normal during rolling restarts or maintenance; 2 or more should be investigated |
| Container: Services Down | Active container nodes where some service isn't running | 0 during steady state; brief spikes during deployments are expected |
QoS (Quality of Service) shows the percentage of successful requests. Read and write QoS are shown separately; a healthy application should be above 99.9%. If QoS drops, consult the HTTP Response Code Reference row (collapsed by default) for a table explaining every observed status code and its meaning in Vespa context. 4xx responses are client errors; 5xx responses are server errors and should be investigated immediately.
Latency summary separates query and feed latency into read and write rows. Compare averages with p99 — a large gap indicates tail latency that won't show up in averages. As a rule of thumb, if p99 is more than 5× the average, investigate the tail.
The bottom row gives a quick view of CPU, memory, and disk across all clusters. Any resource consistently above 80% warrants attention.
When query latency increases, the Query tab helps find the cause layer-by-layer. Metrics are grouped per container cluster (for container-level metrics) and per content cluster (for content-node metrics).
A query flows through multiple layers, each with its own latency metric:
Client
→ HTTP Read Latency (end-to-end including network I/O)
→ Query Container Latency (time in the container itself)
→ Query Latency (container-observed total, excluding HTTP overhead)
→ Search Protocol Latency (time on each content node)
→ Rank Profile Latency (per rank-profile breakdown)
Start with the Query Rate & Latency row:
If HTTP latency is much higher than query latency, the bottleneck is network or payload size. If search protocol latency dominates, the bottleneck is on the content nodes.
The Query Quality row shows:
The Query tab groups per-rank-profile metrics into four sub-rows, all filterable by the Rank Profile dropdown:
Things to look for:
See Latency tracking below for a worked example, and the rank profiles documentation for background.
The Query tab also includes Match Executor and Docsum Executor sub-rows (queue size + accepted rate) so you can see whether the content-node thread pools feeding the query and summary paths are saturated. These are not attributable to a rank profile, but often explain tail-latency spikes that aren't visible in rank-profile metrics.
When feed latency increases or throughput drops, the Feed tab shows where in the write path the slowdown occurs. A write operation flows through:
Client
→ HTTP Write Latency (end-to-end)
→ Container Feed Latency (document processing chains, embedders)
→ Distributor Latency (routing based on bucket distribution)
→ Content: Storage Latency(persistence, per document replica)
→ Commit Latency (transaction log)
Start from the top and find where latency increases. If container feed latency is normal but HTTP write latency is high, the bottleneck is network/payload. If distributor latency is high, check for node state issues in the Health tab. If storage latency is high, check disk I/O in the Resources tab.
Feed Blocked is the most critical feed metric. When a content node exceeds its
disk or memory resource limit, feeding is paused for
the entire cluster. HTTP clients receive 507 Insufficient Storage.
If feed is being blocked:
The Health tab includes a Resource Limits Reference panel explaining the default limits, the blocking mechanism, and how to remediate.
This tab only appears when the application uses approximate nearest neighbor search — it is automatically hidden when no NNS distance computations are detected.
Vespa supports two NNS modes:
approximate-threshold (default 0.02).Key metrics:
Tuning parameters (set per
rank profile):
approximate-threshold, filter-first-threshold,
target-hits-max-adjustment-factor, exploration-slack.
If the exact NNS ratio is high, consider increasing approximate-threshold
or restructuring filters to be less restrictive.
The Content Node tab shows internals of the Proton search engine running on each content node. All metrics are grouped per content cluster.
Disk and memory usage from Proton's internal accounting. This is distinct from node-level metrics in the Resources tab — these are the values Vespa uses for feed-blocking decisions.
Proton uses several thread pools (executors):
Typical healthy values:
The dashboard renders avg as a solid green line and max as a dashed yellow line, making it easy to spot whether the maximum tracks the average or has concerning spikes.
Proton runs background maintenance jobs that manage data structures. The dashboard includes a reference panel (collapsed) explaining each job and its resource impact:
| Job | Resource impact |
|---|---|
| Attribute Flush | Low |
| Memory Index Flush | Moderate |
| Disk Index Fusion | High — temporary 2× disk usage |
| Document Store Compaction | High — holds file in memory |
| Bucket Move | High — competes with feeding |
| LID-Space Compaction | Moderate |
Latency spikes that correlate with active maintenance are expected but may indicate the cluster needs more headroom.
The Resources tab is the primary tool for sizing decisions. Node-level resources (CPU, memory, disk) are grouped per cluster. Container-specific metrics (JVM, thread pools, GPU, network) are grouped per container cluster.
| Resource | Healthy | Concerning | Action needed |
|---|---|---|---|
| CPU | < 70% | 70–85% | > 85% sustained |
| CPU IOWait | < 5% | 5–10% | > 10% (I/O bottleneck) |
| Memory | < 70% | 70–80% | Approaching feed-block limit |
| Disk | < 70% | 70–80% | Approaching feed-block limit |
| JVM GC Overhead | < 5% | 5–15% | > 15% (severe latency impact) |
| Threadpool utilization | < 70% | 70–90% | Rejected tasks = requests dropped |
Content nodes need extra headroom because maintenance jobs (especially disk index fusion) temporarily increase resource usage.
Which thread pools exist on a container depends on which elements are configured
in services.xml:
| Thread pool | Present when |
|---|---|
default-handler-common | Always (handler executor used by anything without its own pool) |
search-handler | <search> element is present |
feedapi-handler | <document-api> element is present |
To keep the dashboard free of empty panels, the Resources tab contains three threadpool rows — one per container configuration case — and each row repeats per container cluster that falls into that case:
<search> but no feed APIClassification is automatic: hidden variables derive the cluster list per case, so only relevant rows render for a given deployment. Each pool gets three panels — Utilization, Work Queue Size, Work Queue Utilization — with avg as a solid green line and max as a dashed yellow line.
The Resources tab's JVM row separates the three layers of container memory:
When overall node memory is high but heap and direct look normal, the native layer is usually the answer. This is common on container nodes running embedder or local-LLM components: model weights are memory-mapped and only partially resident, but KV cache and compute buffers are allocated upfront as native memory.
The Health tab tracks cluster stability and data consistency, grouped per content cluster.
Nodes are distributed across states: up (serving), down (unreachable), initializing (starting up), maintenance (temporarily out), retired (being removed). During normal operation: all up, zero down. See content node states.
After scaling events, expect buckets out of sync and pending merges. These should converge back to zero. If they don't, investigate.
Both signals surface in three complementary ways: as per-cluster time series on this tab (for historical context), as at-a-glance counters in the Health Indicators row on the Overview tab, and as Service restart/Core dump annotations drawn as vertical lines on every chart.
Shows memory and disk utilization vs. configured limits. When utilization exceeds the limit, feeding is blocked. The dashboard includes a Resource Limits Reference panel (collapsed) explaining the default limits (disk 80%, memory 80%), the blocking mechanism, and what to do about it.
services.xml.When monitoring latency in clusters with mixed loads, it is useful to use rank profiles to separate them. As an example, an application might have user queries mixed with agentic, batch-oriented queries. Tracking the Container-level query latencies might look like:
Using Content node level metrics, separated by ranking profile, we see:
From this, we see that query latency varies with the rank profile used. Relevant metrics to export to your monitoring system include:
In short, when debugging latency, look for changes, per rank profile:
The above metrics is a subset or the available metrics. It is a good idea to set a query profile per class of query, and in each query profile, select a distinct rank profile. With this, you can change the rank profile for a given query class by configuration only (no need to change the clients) - a good example is having a lightweight rank profile to use in overload situations. This makes it easier to track the individual query classes, per rank profile.
Prometheus metrics are found at $ENDPOINT/prometheus/v1/values:
$ curl -s --cert data-plane-public-cert.pem --key data-plane-private-key.pem \ 'https://b6718765.b68a1234.z.vespa-app.cloud/prometheus/v1/values'
The metrics can be fed into e.g. your Grafana Cloud or self-hosted Grafana instance. See the Vespa metrics documentation for more information.
This section explains how to set up Grafana to consume Vespa metrics using the Prometheus API.
Prometheus is configured using prometheus.yml, find sample config in
prometheus.
See prometheus-cloud.yml,
which is designed to be easy to set up with any Vespa Cloud instance.
Replace <Endpoint> and <SERVICE_NAME> with the endpoint
for the application and the service name, respectively.
In addition, the path to the private key and public cert
that is used for the data plane to the endpoint need to be provided -
refer to security.
Then, configure the Prometheus instance to use this configuration file.
The Prometheus instance will now start retrieving the metrics from Vespa Cloud.
If the Prometheus instance is used for multiple services,
append the target configuration for Vespa to scrape_configs.
Use the provisioning folder as a baseline for further configuration.
In the provisioning folder there are a few different files that all help for configuring Grafana locally.
These work as good examples of default configurations,
but the most important is the file named Vespa-Engine-Advanced-Metrics-External.json.
This is a default dashboard, based upon the metrics the Vespa team use to monitor performance.
Click the + button on the side and go to import. Upload the file to the Grafana instance. This should automatically load in the dashboard for usage. For now, it will not display any data as no data sources are configured yet.
The Prometheus data source has to be added to the Grafana instance for the visualisation. Click the cog on the left and then "Data Sources". Click "Add data source" and choose Prometheus from the list. Add the URL for the Prometheus instance with appropriate bindings for connecting. The configuration for the bindings will depend on how the Prometheus instance is hosted. Once the configuration details have been entered, click Save & Test at the bottom and ensure that Grafana says "Data source is working".
To verify the data flow, navigate back to the Vespa Metrics dashboard by clicking the dashboard symbol on the left (4 blocks) and clicking manage and then click Vespa Metrics. Data should now appear in the Grafana dashboard. If no data shows up, edit one of the data sets and ensure that it has the right data source selected. The name of the data source the dashboard is expecting might be different from what your data source is named. If there is still no data appearing, it either means that the Vespa instance is not being used or that some part of the configuration is wrong.
To pull metrics from your Vespa application into AWS Cloudwatch, refer to the metrics-emitter documentation for how to set up an AWS Lambda.
The Vespa Grafana Terraform template provides a set of dashboards and alerts. If you are using a different monitoring service and want to set up an equivalent alert set, you can follow this table:
| Metric name | Threshold | Dimension aggregation |
|---|---|---|
| content_proton_resource_usage_disk_average | > 0.9 | max by(applicationId, clusterId, zone) |
| content_proton_resource_usage_memory_average | > 0.8 | max by(applicationId, zone, clusterId) |
| cpu_util | > 90 | max by(applicationId, zone, clusterId) |
| content_proton_resource_usage_feeding_blocked_last | >= 1 | N/A |
All metrics are from the default metric set. Metrics are using the naming scheme from the Prometheus metrics API. Dimension aggregation is optional, but reduces alerting noise - e.g. in the case where an entire cluster goes bad. It is recommended to filter all alerts on zones in the prod environment.
Below is a sample request with sample response for prometheus metrics for a minimal application on Vespa Cloud:
$ curl -s --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
'https://b6718765.b68a1234.z.vespa-app.cloud/prometheus/v1/values'
...
jdisc_thread_pool_work_queue_size_min{threadpool="default-pool",zone="dev.aws-us-east-1c",applicationId="mytenant.myapp.default",serviceId="logserver-container",clusterId="admin/logserver",hostname="h97490a.dev.us-east-1c.aws.vespa-cloud.net",vespa_service="vespa_logserver_container",} 0.0 1733139324000
jdisc_thread_pool_work_queue_size_min{threadpool="default-handler-common",zone="dev.aws-us-east-1c",applicationId="mytenant.myapp.default",serviceId="logserver-container",clusterId="admin/logserver",hostname="h97490a.dev.us-east-1c.aws.vespa-cloud.net",vespa_service="vespa_logserver_container",} 0.0 1733139324000
# HELP content_proton_documentdb_matching_rank_profile_rerank_time_average
# TYPE content_proton_documentdb_matching_rank_profile_rerank_time_average untyped
content_proton_documentdb_matching_rank_profile_rerank_time_average{rankProfile="rank_albums",documenttype="music",zone="dev.aws-us-east-1c",applicationId="mytenant.myapp.default",serviceId="searchnode",clusterId="content/music",hostname="h104562a.dev.us-east-1c.aws.vespa-cloud.net",vespa_service="vespa_searchnode",} 0.0 1733139324000
content_proton_documentdb_matching_rank_profile_rerank_time_average{rankProfile="unranked",documenttype="music",zone="dev.aws-us-east-1c",applicationId="mytenant.myapp.default",serviceId="searchnode",clusterId="content/music",hostname="h104562a.dev.us-east-1c.aws.vespa-cloud.net",vespa_service="vespa_searchnode",} 0.0 1733139324000
content_proton_documentdb_matching_rank_profile_rerank_time_average{rankProfile="default",documenttype="music",zone="dev.aws-us-east-1c",applicationId="mytenant.myapp.default",serviceId="searchnode",clusterId="content/music",hostname="h104562a.dev.us-east-1c.aws.vespa-cloud.net",vespa_service="vespa_searchnode",} 0.0 1733139324000
...
Relevant labels include:
chain This is the name on the search chain in the container that is used for a set of query requests.
This is typically used to get separate metrics, such as latency and the number of queries for each chain over time.
documenttype This is the name of the document type for which a set of queries are run in the content clusters.
This is typically used to get separate content layer metrics, such as latency and the number of queries for each chain over time.
groupId This is the id of the cluster group for which the metric measurement is done.
This is typically used to get separate metrics aggregates per group in a content cluster.
The label is most relevant for metrics from the content clusters running multiple content groups,
see Content Cluster Elasticity.
The value is in the format group 0, group 1, group 2, etc.
rankProfile This label is present for a subset of metrics from the content clusters,
with names starting with content_proton_documentdb_matching_rank_profile_.
The label is typically used in cases where you use multiple rank profiles
and want to analyse performance differences between the different rank profiles,
or to better understand certain types of performance issues and need to narrow down the candidate set.
source This is a label applied on container metrics for classifying query failures by the content cluster
where the failure was triggered.
How you will use labels to separate different kinds of queries depends on the observability backend you use, but you will typically compute weighted averages for query latency and query volume, and split graphs by the relevant labels to better understand system performance and bottlenecks.
For the container level metrics you use the chain label to differentiate between different query streams,
while you use the rankProfile label to do the same in the content level.