Vespa Cloud

Data management and backup

This guide covers data management operations for Vespa Cloud applications, including automated backups, document export, feed, and bulk updates and removals.

Automated Backups

Depending on plan, content clusters are automatically backed up when a <backup> element is specified in deployment.xml. Vespa Cloud manages the backup schedule, storage, and lifecycle with no external tooling required. Backups will run at the configured frequency while also respecting any block windows defined for the instance.

<instance id="default">
    <backup frequency="7d" />
    <prod>
        <region>aws-us-east-1c</region>
    </prod>
</instance>

Backups are retained for three backup intervals (e.g. 21 days for a 7-day frequency). The most recent fully completed backup is always retained regardless of age. See Restore from Backup for how to request a restore.

If you prefer to manage backups yourself, documents can be exported manually using vespa visit as shown in the Google Cloud Function example.

Restore from Backup

Restoring from a backup is handled by Vespa Cloud. To initiate a restore, contact Vespa Support. Response time and priority handling are governed by your support plan.

Restore requires a deployed target cluster with:

The same number of content nodes as the backup.
At least equivalent disk capacity per node as at the time of the backup.

Note that content redistribution is usually required after restoration. See backup reference for details.

Export documents

Note: The examples below use the Vespa CLI. Ensure you have the latest version installed.

To export documents, configure the application to export from, then select zone, container cluster and schema - example:

$ vespa config set application vespa-team.vespacloud-docsearch.default

$ vespa visit --zone prod.aws-us-east-1c --cluster default --selection doc | head

Some of the parameters above are redundant if unambiguous. Here, the application is set up using a template found in multinode-HA with multiple container clusters. This example visit documents from the doc schema.

Use a fieldset to export document IDs only:

$ vespa visit --zone prod.aws-us-east-1c --cluster default --selection doc --field-set '[id]' | head

Note: Configuring the documentid field to be an attribute in the schema avoids that this requires disk access and, hence, speeds up the exporting process.

As the name implies, fieldsets are useful to select a subset of fields to export. Note that, if disk access is required to fetch a field from the fieldset, selecting fewer fields does not speed up the exporting process as the same amount of data is read from the index. The data transfer out of the Vespa application is smaller with fewer fields.

For copying documents between applications, see cloning applications and data.

Feed

If a document feed is generated with vespa visit (above), it is already in JSON Lines feed-ready format by default:

$ vespa visit | vespa feed - -t $ENDPOINT

Find more examples in cloning applications and data.

A document export generated using /document/v1 is slightly different from the .jsonl output from vespa visit (e.g., fields like a continuation token are added). Extract the document objects before feeding:

$ gunzip -c docs.gz | jq '.documents[]' | \
  vespa feed - -t $ENDPOINT

Delete

To remove all documents in a Vespa deployment—or a selection of them—run a deletion visit. Use the DELETE HTTP method, and fetch only the continuation token from the response:

#!/bin/bash

set -x

# The ENDPOINT must be a regional endpoint, do not use '*.g.vespa-app.cloud/'
ENDPOINT="https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud"
NAMESPACE=open
DOCTYPE=doc
CLUSTER=documentation

# doc.path =~ "^/old/" -- all documents under the /old/ directory:
SELECTION='doc.path%3D~%22%5E%2Fold%2F%22'

continuation=""

while
  token=$( curl -X DELETE -s \
           --cert data-plane-public-cert.pem \
           --key data-plane-private-key.pem \
           "${ENDPOINT}/document/v1/${NAMESPACE}/${DOCTYPE}/docid?selection=${SELECTION}&cluster=${CLUSTER}&${continuation}" \
           | tee >( jq . > /dev/tty ) | jq -re .continuation )
do
  continuation="continuation=${token}"
done

Each request will return a response after roughly one minute—change this by specifying timeChunk (default 60).

To purge all documents in a document export (above), generate a feed with remove-entries for each document ID, like:

$ gunzip -c docs.gz | jq '[ .documents[] | {remove: .id} ]' | head

[
  {
    "remove": "id:open:doc::open/documentation/schemas.html"
  },
  {
    "remove": "id:open:doc::open/documentation/securing-your-vespa-installation.html"
  },

Complete example for a single chunk:

$ gunzip -c docs.gz | jq '[ .documents[] | {remove: .id} ]' | \
  vespa feed - -t $ENDPOINT

Update

To update all documents in a Vespa deployment—or a selection of them—run an update visit. Use the PUT HTTP method, and specify a partial update in the request body:

#!/bin/bash

set -x

# The ENDPOINT must be a regional endpoint, do not use '*.g.vespa-app.cloud/'
ENDPOINT="https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud"
NAMESPACE=open
DOCTYPE=doc
CLUSTER=documentation

# doc.inlinks == "some-url" -- the weightedset<string> inlinks has the key "some-url"
SELECTION='doc.inlinks%3D%3D%22some-url%22'

continuation=""

while
  token=$( curl -X PUT -s \
           --cert data-plane-public-cert.pem \
           --key data-plane-private-key.pem \
           --data '{ "fields": { "inlinks": { "remove": { "some-url": 0 } } } }' \
           "${ENDPOINT}/document/v1/${NAMESPACE}/${DOCTYPE}/docid?selection=${SELECTION}&cluster=${CLUSTER}&${continuation}" \
           | tee >( jq . > /dev/tty ) | jq -re .continuation )
do
  continuation="continuation=${token}"
done

Each request will return a response after roughly one minute—change this by specifying timeChunk (default 60).

Using /document/v1/ api

To get started with a document export, find the namespace and document type by listing a few IDs. Hit the /document/v1/ ENDPOINT. Restrict to one CLUSTER, see content clusters:

$ curl \
  --cert data-plane-public-cert.pem \
  --key data-plane-private-key.pem \
  "$ENDPOINT/document/v1/?cluster=$CLUSTER"

For ID export only, use a fieldset:

$ curl \
  --cert data-plane-public-cert.pem \
  --key data-plane-private-key.pem \
  "$ENDPOINT/document/v1/?cluster=$CLUSTER&fieldSet=%5Bid%5D"

From an ID, like id:open:doc::open/documentation/schemas.html, extract

NAMESPACE: open
DOCTYPE: doc

Example script:

#!/bin/bash

set -x

# The ENDPOINT must be a regional endpoint, do not use '*.g.vespa-app.cloud/'
ENDPOINT="https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud"
NAMESPACE=open
DOCTYPE=doc
CLUSTER=documentation

continuation=""
idx=0

while
  ((idx+=1))
  echo "$continuation"
  printf -v out "%05g" $idx
  filename=${NAMESPACE}-${DOCTYPE}-${out}.data.gz
  echo "Fetching data..."
  token=$( curl -s \
           --cert data-plane-public-cert.pem \
           --key data-plane-private-key.pem \
           "${ENDPOINT}/document/v1/${NAMESPACE}/${DOCTYPE}/docid?wantedDocumentCount=1000&concurrency=4&cluster=${CLUSTER}&${continuation}" \
           | tee >( gzip > ${filename} ) | jq -re .continuation )
do
  continuation="continuation=${token}"
done

If only a few documents are returned per response, wantedDocumentCount (default 1, max 1024) can be specified for a lower bound on the number of documents per response, if that many documents still remain.

Specifying concurrency (default 1, max 100) increases throughput, at the cost of resource usage. This also increases the number of documents per response, and could lead to excessive memory usage in the HTTP container when many large documents are buffered to be returned in the same response.