Vespa CloudThis content is applicable to Vespa Cloud deployments.
Data management and backup
This guide covers data management operations for Vespa Cloud applications,
including automated backups, document export, feed, and bulk updates and removals.
Automated Backups
Depending on plan, content clusters are automatically backed up when a
<backup> element is specified in deployment.xml.
Vespa Cloud manages the backup schedule, storage, and lifecycle with no external tooling required. Backups will run at the configured frequency
while also respecting any block windows defined for the instance.
Backups are retained for three backup intervals (e.g. 21 days for a 7-day frequency).
The most recent fully completed backup is always retained regardless of age.
See Restore from Backup for how to request a restore.
If you prefer to manage backups yourself, documents can be exported manually using
vespa visit as shown in the
Google Cloud Function example.
Restore from Backup
Restoring from a backup is handled by Vespa Cloud. To initiate a restore, contact
Vespa Support. Response time and priority handling are governed by your
support plan.
Restore requires a deployed target cluster with:
The same number of content nodes as the backup.
At least equivalent disk capacity per node as at the time of the backup.
Note that content redistribution is usually required after restoration.
See backup reference for details.
Export documents
Note:
The examples below use the Vespa CLI. Ensure you have the latest version installed.
To export documents, configure the application to export from,
then select zone, container cluster and schema - example:
$ vespa config set application vespa-team.vespacloud-docsearch.default
$ vespa visit --zone prod.aws-us-east-1c --cluster default --selection doc | head
Some of the parameters above are redundant if unambiguous.
Here, the application is set up using a template found in
multinode-HA
with multiple container clusters.
This example visit
documents from the doc schema.
As the name implies, fieldsets are useful to select a subset of fields to export.
Note that this normally does not speed up the exporting process, as the same amount of data is read from the index.
The data transfer out of the Vespa application is smaller with fewer fields.
A document export generated using /document/v1
is slightly different from the .jsonl output from vespa visit
(e.g., fields like a continuation token are added).
Extract the document objects before feeding:
To remove all documents in a Vespa deployment—or a selection of them—run a deletion visit.
Use the DELETE HTTP method, and fetch only the continuation token from the response:
#!/bin/bash
set -x
# The ENDPOINT must be a regional endpoint, do not use '*.g.vespa-app.cloud/'
ENDPOINT="https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud"
NAMESPACE=open
DOCTYPE=doc
CLUSTER=documentation
# doc.path =~ "^/old/" -- all documents under the /old/ directory:
SELECTION='doc.path%3D~%22%5E%2Fold%2F%22'
continuation=""
while
token=$( curl -X DELETE -s \
--cert data-plane-public-cert.pem \
--key data-plane-private-key.pem \
"${ENDPOINT}/document/v1/${NAMESPACE}/${DOCTYPE}/docid?selection=${SELECTION}&cluster=${CLUSTER}&${continuation}" \
| tee >( jq . > /dev/tty ) | jq -re .continuation )
do
continuation="continuation=${token}"
done
Each request will return a response after roughly one minute—change this by specifying timeChunk (default 60).
To purge all documents in a document export (above),
generate a feed with remove-entries for each document ID, like:
To update all documents in a Vespa deployment—or a selection of them—run an update visit.
Use the PUT HTTP method, and specify a partial update in the request body:
#!/bin/bash
set -x
# The ENDPOINT must be a regional endpoint, do not use '*.g.vespa-app.cloud/'
ENDPOINT="https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud"
NAMESPACE=open
DOCTYPE=doc
CLUSTER=documentation
# doc.inlinks == "some-url" -- the weightedset<string> inlinks has the key "some-url"
SELECTION='doc.inlinks%3D%3D%22some-url%22'
continuation=""
while
token=$( curl -X PUT -s \
--cert data-plane-public-cert.pem \
--key data-plane-private-key.pem \
--data '{ "fields": { "inlinks": { "remove": { "some-url": 0 } } } }' \
"${ENDPOINT}/document/v1/${NAMESPACE}/${DOCTYPE}/docid?selection=${SELECTION}&cluster=${CLUSTER}&${continuation}" \
| tee >( jq . > /dev/tty ) | jq -re .continuation )
do
continuation="continuation=${token}"
done
Each request will return a response after roughly one minute—change this by specifying timeChunk (default 60).
Using /document/v1/ api
To get started with a document export, find the namespace and document type by listing a few IDs.
Hit the /document/v1/ ENDPOINT.
Restrict to one CLUSTER, see content clusters:
From an ID, like id:open:doc::open/documentation/schemas.html, extract
NAMESPACE: open
DOCTYPE: doc
Example script:
#!/bin/bash
set -x
# The ENDPOINT must be a regional endpoint, do not use '*.g.vespa-app.cloud/'
ENDPOINT="https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud"
NAMESPACE=open
DOCTYPE=doc
CLUSTER=documentation
continuation=""
idx=0
while
((idx+=1))
echo "$continuation"
printf -v out "%05g" $idx
filename=${NAMESPACE}-${DOCTYPE}-${out}.data.gz
echo "Fetching data..."
token=$( curl -s \
--cert data-plane-public-cert.pem \
--key data-plane-private-key.pem \
"${ENDPOINT}/document/v1/${NAMESPACE}/${DOCTYPE}/docid?wantedDocumentCount=1000&concurrency=4&cluster=${CLUSTER}&${continuation}" \
| tee >( gzip > ${filename} ) | jq -re .continuation )
do
continuation="continuation=${token}"
done
If only a few documents are returned per response, wantedDocumentCount (default 1, max 1024) can be
specified for a lower bound on the number of documents per response, if that many documents still remain.
Specifying concurrency (default 1, max 100) increases throughput, at the cost of resource usage.
This also increases the number of documents per response, and could lead to excessive memory usage
in the HTTP container when many large documents are buffered to be returned in the same response.