Enterprise

Upgrade Vespa on Kubernetes

Vespa on Kubernetes supports zero-downtime rolling upgrades. An upgrade involves upgrading the vespa-operator via the Helm chart and the ConfigServer and Application (Container and Content) Pods through the VespaSet resource.

We do not support version drift between the vespa-operator and the VespaSet. Accordingly, upgrades should be planned so that all components are updated together. To ensure availability, they should be performed in the order as shown in this guide.

Update the CRD

Some upgrades may introduce changes to the VespaSet CRD definition. These changes should be applied to the cluster before performing the upgrade. As a rule of thumb, we recommend executing this before every upgrade procedure.

Helm does not manage the lifecycle of the CRD after it is installed (see the official documentation). As a result, CRD updates must be handled manually. Given the official Helm Chart for Vespa on Kubernetes, this can be performed by extracting the CRD definition from the OCI package and applying it directly using kubectl.

$ helm show crds $HELM_CHART_REF --version $VESPA_VERSION > vespaset-crd.yaml
$ kubectl apply -f vespaset-crd.yaml

Upgrade the Vespa Operator

The operator can be upgraded through helm by running helm upgrade with the new VESPA_VERSION. Replace $NAMESPACE with the namespace where Vespa is installed. Refer to Factory for the latest VESPA_VERSION. Note that upgrading the operator does not affect the ConfigServer and Application Pods. Their upgrade will be performed in a subsequent step.

$ helm upgrade vespa-operator vespa/vespa-operator \
  --version $OPERATOR_VERSION \
  --namespace $NAMESPACE \
  --reuse-values

Wait for the operator to finish rolling out before proceeding.

$ kubectl rollout status deployment/vespa-operator -n $NAMESPACE

Upgrade the VespaSet

To upgrade the ConfigServer and application Pods, patch the spec.version field in the VespaSet resource. Ensure that the target image is available and accessible on the Kubernetes Node at VESPA_OPERATOR_IMAGE:VESPA_VERSION and VESPA_IMAGE:VESPA_VERSION before proceeding. For example:

$ cat > vespaset.yaml <<EOF
apiVersion: k8s.ai.vespa/v1
kind: VespaSet
metadata:
  name: vespaset-sample
  namespace: ${NAMESPACE}
spec:
  version: 8.566.7 # Specify the version to upgrade to.

  configServer:
    image: "${VESPA_OPERATOR_IMAGE}"
    storageClass: "gp3"
    generateRbac: false

  application:
    image: "${VESPA_IMAGE}"
    storageClass: "gp3"

  ingress:
    endpointType: "NONE"
EOF

$ kubectl apply -f vespaset.yaml

The ConfigServer Pods will detect a change to the VespaSet resource and orchestrate the upgrade procedure to themselves and the Application Pods.

Upgrade Sequence

The upgrade always proceeds in two phases: ConfigServer Pods are upgraded first, followed by Application Pods. This ordering is required because the Config Servers must be running the new version before they can safely orchestrate Application Pods onto it.

Additionally, the base template for creating a Vespa Pod, whether it be a ConfigServer or Application Pod, could have been changed during the upgrade. As such, the ConfigServer Pods should ensure that they are basing off the latest template, rather than a stale one, to prevent needlessly recreating the Pods with the latest template post-upgrade.

During the upgrade procedure, each Pod is upgraded one at a time. This process is sequential. For each Pod, the operator:

Drains the Pod of traffic and flushes any in-memory state to disk.
Deletes the Pod, and recreates it with the new image
Waits for the Pod to become healthy and report its Converged Version as a status on the VespaSet
Proceeds to the next Pod.

The cluster remains operational throughout this procedure. The remaining ConfigServer Pods continue serving configuration to Application Pods while each node is upgraded in turn, and the Dataplane layer will continue to serve traffic as normal. For Content Pods, the operator waits for data redistribution to complete before moving to the next Pod, ensuring no data loss during the rollout.

To ensure zero downtime for any applications, ingress should be properly configured so that traffic is correctly load balanced across the Dataplane layer, allowing requests to be seamlessly routed away from Pods undergoing upgrades. Refer to the Ingress page for more details.

Monitoring the Upgrade

Throughout the upgrade, each Pod's status is reflected in the VespaSet status. A Pod that is actively being upgraded reports its phase as UPGRADING. A Pod that has successfully completed the upgrade reports its Converged Version as the new version.

In the example below, the Config Server Pods have all converged to 8.577, while the Application Pod default-100 is currently upgrading and has not yet converged from 8.576.

$ kubectl describe vespaset vespaset-sample -n $NAMESPACE
Name:         vespaset-sample
Namespace:    $NAMESPACE
Labels:       <none>
Annotations:  <none>
API Version:  k8s.ai.vespa/v1
Kind:         VespaSet
Metadata:
  Creation Timestamp:  2026-01-29T21:32:27Z
  Finalizers:
    vespasets.k8s.ai.vespa/finalizer
  Generation:        1
  Resource Version:  121822902
  UID:               a70f56e9-6625-4011-acd7-9f7cad29dbc2
Spec:
  Application:
    Image:          $VESPA_IMAGE
    Storage Class:  gp3
  Config Server:
    Generate Rbac:    false
    Image:            $VESPA_IMAGE
    Storage Class:    gp3
  Ingress:
    Endpoint Type:  LOAD_BALANCER
  Version:          8.577
Status:
  Bootstrap Status:
    Pods:
      cfg-1:
        Last Updated:  2026-01-29T21:38:45Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.577
      cfg-2:
        Last Updated:  2026-01-29T21:38:09Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.577
      cfg-3:
        Last Updated:  2026-01-29T21:36:32Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.577
      default-100:
        Last Updated:  2026-01-29T21:38:45Z
        Message:       Pod is upgrading
        Phase:         UPGRADING
        Converged Version: 8.576
      default-101:
        Last Updated:  2026-01-29T21:38:09Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.576
      documentation-102:
        Last Updated:  2026-01-29T21:36:32Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.576
      documentation-103:
        Last Updated:  2026-01-29T21:36:32Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.576
      cluster-controller-104:
        Last Updated:  2026-01-29T21:36:32Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.576
      cluster-controller-105:
        Last Updated:  2026-01-29T21:36:32Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.576
      cluster-controller-106:
        Last Updated:  2026-01-29T21:36:32Z
        Message:       Pod is running
        Phase:         RUNNING
        Converged Version: 8.576
  Last Transition Time:  2026-01-29T21:33:55Z
  Message:               All configservers running
  Phase:                 RUNNING
Events:                  <none>

The upgrade is complete when every Pod's Converged Version matches the new version and all phases report RUNNING.

Debugging Upgrade Failures

If a Pod fails to converge to the target version — for example, due to an image pull failure, a crash loop, or a failed health check, the ConfigServer will continuously retry the upgrade for that Pod until it either succeeds or an administrator intervenes.

In this scenario, the administrator can diagnose the issue by inspecting the ConfigServer logs or the events of the failing Pod in the current upgrade phase. Once the issue is resolved, the ConfigServer will automatically retry the upgrade for that Pod and proceed with the remaining nodes.

For example, suppose the Pod search-106 is failing to upgrade.

$ kubectl get logs cfg-1 -n $NAMESPACE
$ kubectl get logs cfg-2 -n $NAMESPACE
$ kubectl get logs cfg-3 -n $NAMESPACE
$ kubectl describe pod search-106 -n $NAMESPACE

This design prevents a bad upgrade from cascading to the rest of the Pods. Since the ConfigServer refuses to advance past a Pod that has not converged, the remaining Pods stay on the previous known-good version while the administrator investigates.