The ConfigServer and Vespa Application Pods have built-in resilience and recovery capabilities; they are automatically recovered during failures and gracefully shut down during maintenance or scaling operations to preserve data integrity.
Vespa relies on standard Kubernetes controllers to detect and restart crashed Pods. If a container exits unexpectedly (e.g., OOMKilled or application crash), the kubelet will automatically restart it.
However, the ConfigServers track the health history of every Pod. To prevent a "crash loop" from causing cascading failures or constantly churning resources, the system implements a strict throttling mechanism. The ConfigServers allow a maximum of 2 involuntary Pod disruptions per 24-hour period for a given Vespa Application. If this limit is exceeded, the ConfigServer stops automatically failing these Pods and will require human intervention to investigate the root cause.
To prevent query failures or data loss during termination, a PreStop Hook is placed on every ConfigServer and Vespa Application Pod. During a voluntary disruption, this hook ensures that existing traffic is drained and that data is flushed before killing the Pod.
Two types of disruptions exist in Kubernetes:
| Type | Scenario | Behavior |
|---|---|---|
| Voluntary Disruption | Scaling down, rolling upgrades, or node maintenance. | The preStop hook detects a voluntary disruption, stops the Vespa Container cluster from accepting new traffic, flushes in-memory data to disk for Content clusters, and ensures a clean exit before the Pod is deleted. |
| Involuntary Disruption | Node hardware failure, kernel panic, or eviction. | Kubernetes initiates the termination. The preStop hook attempts to run to flush data and close connections. However, if the Pod is lost abruptly. the hook cannot run, and recovery relies on Vespa's data replication. |
Defining a PodDisruptionBudget (PBD) is not supported for Vespa on Kubernetes. The ConfigServers will override any PBD with its own orchestration policy.
For Vespa Application Pods, the resources for each Pod, the number of Pods in a Vespa cluster, and the group configuration can be updated through the <services> element in the application package. Refer to the specification for more details.
ConfigServer Pod resources can be configured by overriding the vespa container's resource specification via the PodTemplate in the VespaSet. The Config Server deduces its heap size from the Pod cgroup limits, which are derived from the requests and limits set on the Pod.
Setting requests and limits to the same value is recommended to ensure the heap size is deduced correctly.
Horizontally scaling the replica count for ConfigServer Pods is not supported.
apiVersion: k8s.ai.vespa/v1
kind: VespaSet
metadata:
name: sample-vespaset
spec:
configServer:
image: "$OCI_IMAGE_REFERENCE"
storageClass: "gp3"
podTemplate:
spec:
containers:
- name: main
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "4"
memory: "8Gi"
Vespa on Kubernetes provides autoscaling through ranges specified in the resource elements in the application package. Refer to the Autoscaling guide for more details.
Vespa on Kubernetes supports rolling upgrades. To perform an upgrade, edit the VespaSet version spec to a new Vespa Version.
For example, to upgrade to Vespa Version 8.577, edit the VespaSet like the snippet below. Ensure that the
image for 8.577 is already cached in a known location and that $OCI_IMAGE_REFERENCE:8.577 is a valid image reference.
# VespaSet configuration for AWS EKS
apiVersion: k8s.ai.vespa/v1
kind: VespaSet
metadata:
name: vespaset-sample
namespace: $NAMESPACE
spec:
version: "8.577"
configServer:
image: "$OCI_IMAGE_REFERENCE"
storageClass: "gp3"
generateRbac: false
application:
image: "$OCI_IMAGE_REFERENCE"
storageClass: "gp3"
ingress:
endpointType: "LOAD_BALANCER"
Vespa will then begin a rolling upgrade, proceeding sequentially — first upgrading the Config Server Pods, and then the Application Pods.
Each Pod follows the same upgrade sequence: it is drained of traffic (or its data is flushed if it is a Content Pod), deleted, recreated with the new image, and verified healthy before the upgrade continues to the next Pod.
Throughout the upgrade, the VespaSet status is updated as each Pod progresses. To confirm that the upgrade has completed, check each Pod’s Converged Version in the VespaSet status.
While a Pod is being upgraded, its phase is reported as UPGRADING. In the example below, the Container Pod default-100 is currently upgrading.
$ kubectl describe vespaset vespaset-sample -n $NAMESPACE
Name: vespaset-sample
Namespace: $NAMESPACE
Labels: <none>
Annotations: <none>
API Version: k8s.ai.vespa/v1
Kind: VespaSet
Metadata:
Creation Timestamp: 2026-01-29T21:32:27Z
Finalizers:
vespasets.k8s.ai.vespa/finalizer
Generation: 1
Resource Version: 121822902
UID: a70f56e9-6625-4011-acd7-9f7cad29dbc2
Spec:
Application:
Image: $OCI_IMAGE_REFERENCE
Storage Class: gp3
Config Server:
Generate Rbac: false
Image: $OCI_IMAGE_REFERENCE
Storage Class: gp3
Ingress:
Endpoint Type: LOAD_BALANCER
Version: 8.577
Status:
Bootstrap Status:
Pods:
cfg-0:
Last Updated: 2026-01-29T21:38:45Z
Message: Pod is running
Phase: RUNNING
Converged Version: 8.577
cfg-1:
Last Updated: 2026-01-29T21:38:09Z
Message: Pod is running
Phase: RUNNING
Converged Version: 8.577
cfg-2:
Last Updated: 2026-01-29T21:36:32Z
Message: Pod is running
Phase: RUNNING
Converged Version: 8.577
default-100:
Last Updated: 2026-01-29T21:38:45Z
Message: Pod is running
Phase: UPGRADING
Converged Version: 8.576
default-101:
Last Updated: 2026-01-29T21:38:09Z
Message: Pod is running
Phase: RUNNING
Converged Version: 8.576
documentation-102:
Last Updated: 2026-01-29T21:36:32Z
Message: Pod is running
Phase: RUNNING
Converged Version: 8.576
documentation-103:
Last Updated: 2026-01-29T21:36:32Z
Message: Pod is running
Phase: RUNNING
Converged Version: 8.576
cluster-controller-104:
Last Updated: 2026-01-29T21:36:32Z
Message: Pod is running
Phase: RUNNING
Converged Version: 8.576
cluster-controller-105:
Last Updated: 2026-01-29T21:36:32Z
Message: Pod is running
Phase: RUNNING
Converged Version: 8.576
cluster-controller-106:
Last Updated: 2026-01-29T21:36:32Z
Message: Pod is running
Phase: RUNNING
Converged Version: 8.576
Last Transition Time: 2026-01-29T21:33:55Z
Message: All configservers running
Phase: RUNNING
Events: <none>