Use the /document/v1/ API to read, write, update and delete documents.
Refer to the document/v1 API reference for API details.
Reads and writes has an overview of alternative tools and APIs
as well as the flow through the Vespa components when accessing documents.
See getting started for how to work with the /document/v1/ API.
$ curl -X POST -H "Content-Type:application/json" --data '
{
"fields": {
"artist": "Coldplay",
"album": "A Head Full of Dreams",
"year": 2015
}
}' \
http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams
Important:
Use documenttype.fieldname (e.g. music.artist) in the condition,
not only fieldname.
If the condition is not met, a 412 Precondition Failed is returned:
{"pathId":"/document/v1/mynamespace/music/docid/a-head-full-of-dreams","id":"id:mynamespace:music::a-head-full-of-dreams","message":"[UNKNOWN(251013) @ tcp/vespa-container:19112/default]: ReturnCode(TEST_AND_SET_CONDITION_FAILED, Condition did not match document nodeIndex=0 bucket=20000000000000c4 ) "}
Updates to nonexistent documents are supported using
create.
This is often called an upsert — insert a document if it does not already exist, or update it if it exists.
An empty document is created on the content nodes, before the update is applied.
This simplifies client code in the case of multiple writers. Example:
Conditional updates and puts can be combined with create.
This has the following semantics:
If the document already exists, the condition is evaluated against the most recent document version available.
The operation is applied if (and only if) the condition matches.
Otherwise (i.e. the document does not exist or the newest document version is a tombstone),
the condition is ignored and the operation is applied as if no condition was provided.
Support for conditional puts with create was added in Vespa 8.178.
Warning:
If all existing replicas of a document are missing
when an operation with "create": true is executed, a new document will always be created.
This happens even if a condition has been given.
If the existing replicas become available later,
their version of the document will be overwritten by the newest update since it has a higher timestamp.
Note:
See document expiry
for auto-created documents — it is possible to create documents that do not match the selection criterion.
Note:
Specifying create for a Put operation without a
condition has no observable effect, as unconditional Put operations will always write
a new version of a document regardless of whether it existed already.
Data dump
To iterate over documents, use visiting — sample output:
{"pathId":"/document/v1/namespace/doc/docid","documents":[{"id":"id:namespace:doc::id-1","fields":{"title":"Document title 1",}}],"continuation":"AAAAEAAAAAAAAAM3AAAAAAAAAzYAAAAAAAEAAAAAAAFAAAAAAABswAAAAAAAAAAA"}
Note the continuation token — use this in the next request for more data.
Below is a sample script dumping all data using jq for JSON parsing.
It splits the corpus in 8 slices by default;
using a number of slices at least four times the number of container nodes is recommended for high throughput.
Timeout can be set lower for benchmarking.
(Each request has a maximum timeout of 60s to ensure progress is saved at regular intervals)
#!/bin bashset-eo pipefail
if[$# -gt 2 ]then
echo"Usage: $0 [number of slices, default 8] [timeout in seconds, default 31536000 (1 year)]"exit 1
fi
endpoint="https://my.vespa.endpoint"cluster="db"selection="true"slices="${1:-8}"timeout="${2:-31516000}"curlTimeout="$((timeout>60 ? 60 : timeout))"url="$endpoint/document/v1/?cluster=$cluster&selection=$selection&stream=true&timeout=$curlTimeout&concurrency=8&slices=$slices"auth="--key my-key --cert my-cert -H 'Authorization: my-auth'"curl="curl -sS $auth"start=$(date'+%s')doom=$((start +timeout))## auth can be something like auth='--key data-plane-private-key.pem --cert data-plane-public-cert.pem'curl="curl -sS $auth"function visit {sliceId="$1"documents=0
continuation=""while
printf-v filename "data-%03g-%012g.json.gz"$sliceId$documentsjson="$(eval"$curl '$url&sliceId=$sliceId$continuation'" | tee>(gzip>$filename) | jq '{ documentCount, continuation, message }')"message="$(jq -re .message <<<$json)"&&echo"Failed visit for sliceId $sliceId: $message">&2 &&exit 1
documentCount="$(jq -re .documentCount <<<$json)"&&((documents +=$documentCount))["$(date'+%s')"-lt"$doom"]&&token="$(jq -re .continuation <<<$json)"do
echo"$documentCount documents retrieved from slice $sliceId; continuing at $token"continuation="&continuation=$token"done
time=$(($(date'+%s')- start))echo"$documents documents total retrieved in $time seconds ($((documents /time)) docs/s) from slice $sliceId">&2
}for((sliceId = 0; sliceId < slices; sliceId++))do
visit $sliceId &
done
wait
Visiting throughput
Note that visit with selection is a linear scan over all the music documents
in the request examples at the start of this guide.
Each complete visit thus requires the selection expression to be evaluated for all documents.
Running concurrent visits with selections that match disjoint subsets of the document corpus
is therefore a poor way of increasing throughput,
as work is duplicated across each such visit.
Fortunately, the API offers other options for increasing throughput:
Split the corpus into any number of smaller slices,
each to be visited by a separate, independent series of HTTP requests.
This is by far the most effective setting to change,
as it allows visiting through all HTTP containers simultaneously,
and from any number of clients—either of which is
typically the bottleneck for visits through /document/v1.
A good value for this setting is at least a handful per container.
Increase backend concurrency
so each visit HTTP response is promptly filled with documents.
When using this together with slicing (above),
take care to also stream the HTTP responses (below),
to avoid buffering too much data in the container layer.
When a high number of slices is specified, this setting may have no effect.
Stream the HTTP responses.
This lets you receive data earlier, and more of it per request, reducing HTTP overhead.
It also minimizes memory usage due to buffering in the container,
allowing higher concurrency per container.
It is recommended to always use this, but the default is not to, due to backwards compatibility.
Getting started
Pro-tip: It is easy to generate a /document/v1 request by using the Vespa CLI,
with the -v option to output a generated /document/v1 request - example:
$ vespa document -v ext/A-Head-Full-of-Dreams.json
curl -X POST -H 'Content-Type: application/json'--data-binary @ext/A-Head-Full-of-Dreams.jsonhttp://127.0.0.1:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams
Success: put id:mynamespace:music::a-head-full-of-dreams
This is a quick guide into dumping random documents from a cluster to get started:
To get documents from a cluster,
look up the content cluster name from the configuration,
like in the
album-recommendation example: <content id="music" version="1.0">.
Use the cluster name to start dumping document IDs (skip jq for full json):
wantedDocumentCount is useful to let the operation run longer to find documents,
to avoid an empty result.
This operation is a scan through the corpus,
and it is normal to get empty result and the continuation token.
Look up the document with id id:mynamespace:music::love-is-here-to-stay:
{"pathId":"/document/v1/mynamespace/music/docid/love-is-here-to-stay","id":"id:mynamespace:music::love-is-here-to-stay","fields":{"artist":"Diana Krall","year":2018,"category_scores":{"type":"tensor<float>(cat{})","cells":{"pop":0.4000000059604645,"rock":0,"jazz":0.800000011920929}},"album":"Love Is Here To Stay"}}
When troubleshooting documents not found using the query API,
use vespa visit to export the documents.
Then compare the id field with other user-defined id fields in the query.
Observe the 404 Not Found.
Using the Vespa CLI is great for troubleshooting - use
-v for verbose output, this prints an equivalent curl command:
$ vespa document get -v id:mynamespace:music::non-existing-doc
curl -X GET http://127.0.0.1:8080/document/v1/mynamespace/music/docid/non-existing-doc
Error: Invalid document operation: 404 Not Found
Do not use group or number modifiers with regular indexed mode document types.
These are special cases that only work as expected for document types
with mode=streaming or mode=store-only.
Examples: