Enterprise

Lifecycle Operations for Vespa on Kubernetes

The ConfigServer and Vespa Application Pods have built-in resilience and recovery capabilities; they are automatically recovered during failures and gracefully shut down during maintenance or scaling operations to preserve data integrity.

Automatic Recovery

Vespa relies on standard Kubernetes controllers to detect and restart crashed Pods. If a container exits unexpectedly (e.g., OOMKilled or application crash), the kubelet will automatically restart it.

However, the ConfigServers track the health history of every Pod. To prevent a "crash loop" from causing cascading failures or constantly churning resources, the system implements a strict throttling mechanism. The ConfigServers allow a maximum of 2 involuntary Pod disruptions per 24-hour period for a given Vespa Application. If this limit is exceeded, the ConfigServer stops automatically failing these Pods and will require human intervention to investigate the root cause.

Graceful Shutdown

To prevent query failures or data loss during termination, a PreStop Hook is placed on every ConfigServer and Vespa Application Pod. During a voluntary disruption, this hook ensures that existing traffic is drained and that data is flushed before killing the Pod.

Two types of disruptions exist in Kubernetes:

Type	Scenario	Behavior
Voluntary Disruption	Scaling down, rolling upgrades, or node maintenance.	The preStop hook detects a voluntary disruption, stops the Vespa Container cluster from accepting new traffic, flushes in-memory data to disk for Content clusters, and ensures a clean exit before the Pod is deleted.
Involuntary Disruption	Node hardware failure, kernel panic, or eviction.	Kubernetes initiates the termination. The preStop hook attempts to run to flush data and close connections. However, if the Pod is lost abruptly. the hook cannot run, and recovery relies on Vespa's data replication.

Pod Disruption Budget

Defining a PodDisruptionBudget (PBD) is not supported for Vespa on Kubernetes. The ConfigServers will override any PBD with its own orchestration policy.

Application Pod Resources

For Vespa Application Pods, the resources for each Pod, the number of Pods in a Vespa cluster, and the group configuration can be updated through the <services> element in the application package. Refer to the specification for more details.

ConfigServer Pod Resources

ConfigServer Pod resources can be configured by overriding the vespa container's resource specification via the PodTemplate in the VespaSet. The Config Server deduces its heap size from the Pod cgroup limits, which are derived from the requests and limits set on the Pod. Setting requests and limits to the same value is recommended to ensure the heap size is deduced correctly.

Horizontally scaling the replica count for ConfigServer Pods is not supported.

apiVersion: k8s.ai.vespa/v1
kind: VespaSet
metadata:
  name: sample-vespaset
spec:
  configServer:
    image: "$OCI_IMAGE_REFERENCE"
    storageClass: "gp3"
    podTemplate:
      spec:
        containers:
          - name: vespa
            resources:
              requests:
                cpu: "4"
                memory: "8Gi"
              limits:
                cpu: "4"
                memory: "8Gi"

Autoscaling

Vespa on Kubernetes provides autoscaling through ranges specified in the resource elements in the application package. Refer to the Autoscaling guide for more details.

Upgrades

Vespa on Kubernetes supports rolling upgrades. To perform an upgrade, edit the VespaSet version spec to a new Vespa Version.

For example, to upgrade to Vespa Version 8.577, edit the VespaSet like the snippet below. Ensure that the image for 8.577 is already cached in a known location and that $OCI_IMAGE_REFERENCE:8.577 is a valid image reference.

# VespaSet configuration for AWS EKS
apiVersion: k8s.ai.vespa/v1
kind: VespaSet
metadata:
  name: vespaset-sample
  namespace: $NAMESPACE
spec:
  version: "8.577" 

  configServer:
    image: "$OCI_IMAGE_REFERENCE"
    storageClass: "gp3"
    generateRbac: false

  application:
    image: "$OCI_IMAGE_REFERENCE"
    storageClass: "gp3"

  ingress:
    endpointType: "LOAD_BALANCER"

Vespa will then begin a rolling upgrade, proceeding sequentially — first upgrading the Config Server Pods, and then the Application Pods.

Each Pod follows the same upgrade sequence: it is drained of traffic (or its data is flushed if it is a Content Pod), deleted, recreated with the new image, and verified healthy before the upgrade continues to the next Pod.

Throughout the upgrade, the VespaSet status is updated as each Pod progresses. To confirm that the upgrade has completed, check each Pod’s Converged Version in the VespaSet status.

While a Pod is being upgraded, its phase is reported as UPGRADING. In the example below, the Container Pod default-100 is currently upgrading.

$ kubectl describe vespaset vespaset-sample -n $NAMESPACE
Name:         vespaset-sample
Namespace:    $NAMESPACE
Labels:       <none>
Annotations:  <none>
API Version:  k8s.ai.vespa/v1
Kind:         VespaSet
Metadata:
  Creation Timestamp:  2026-01-29T21:32:27Z
  Finalizers:
    vespasets.k8s.ai.vespa/finalizer
  Generation:        1
  Resource Version:  121822902
  UID:               a70f56e9-6625-4011-acd7-9f7cad29dbc2
Spec:
  Application:
    Image:          $OCI_IMAGE_REFERENCE
    Storage Class:  gp3
  Config Server:
    Generate Rbac:    false
    Image:            $OCI_IMAGE_REFERENCE
    Storage Class:    gp3
  Ingress:
    Endpoint Type:  LOAD_BALANCER
  Version:          8.577
Status:
  Bootstrap Status:
    Pods:
      cfg-0:
        Last Updated:  2026-01-29T21:38:45Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.577
      cfg-1:
        Last Updated:  2026-01-29T21:38:09Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.577
      cfg-2:
        Last Updated:  2026-01-29T21:36:32Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.577
      default-100:
        Last Updated:  2026-01-29T21:38:45Z
        Message:       Pod is running
        Phase:         UPGRADING
        Converged Version: 8.576
      default-101:
        Last Updated:  2026-01-29T21:38:09Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.576
      documentation-102:
        Last Updated:  2026-01-29T21:36:32Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.576
      documentation-103:
        Last Updated:  2026-01-29T21:36:32Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.576
      cluster-controller-104:
        Last Updated:  2026-01-29T21:36:32Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.576
      cluster-controller-105:
        Last Updated:  2026-01-29T21:36:32Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.576
      cluster-controller-106:
        Last Updated:  2026-01-29T21:36:32Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.576
  Last Transition Time:  2026-01-29T21:33:55Z
  Message:               All configservers running
  Phase:                 RUNNING
Events:                  <none>