/document/v1 API guide

Use the /document/v1/ API to read, write, update and delete documents.

Refer to the document/v1 API reference for API details. Reads and writes has an overview of alternative tools and APIs as well as the flow through the Vespa components when accessing documents. See getting started for how to work with the /document/v1/ API.

Examples:

GET

Get
$ curl http://localhost:8080/document/v1/my_namespace/music/docid/love-id-here-to-stay
Visit Visit all documents with given namespace and document type:
$ curl http://localhost:8080/document/v1/namespace/music/docid
Visit all documents using continuation:
$ curl http://localhost:8080/document/v1/namespace/music/docid?continuation=AAAAEAAAAAAAAAM3AAAAAAAAAzYAAAAAAAEAAAAAAAFAAAAAAABswAAAAAAAAAAA
Visit using a selection:
$ curl http://localhost:8080/document/v1/namespace/music/docid?selection=music.genre=='blues'
Visit documents across all non-global document types and namespaces stored in content cluster mycluster:
$ curl http://localhost:8080/document/v1/?cluster=mycluster
Visit documents across all global document types and namespaces stored in content cluster mycluster:
$ curl http://localhost:8080/document/v1/?cluster=mycluster&bucketSpace=global
Read about visiting throughput below.
POST

Post data in the document JSON format.

$ curl -X POST -H "Content-Type:application/json" --data '
  {
      "fields": {
          "artist": "Coldplay",
          "album": "A Head Full of Dreams",
          "year": 2015
      }
  }' \
  http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams
PUT

Do a partial update for a document.

$ curl -X PUT -H "Content-Type:application/json" --data '
  {
      "fields": {
          "artist": {
              "assign": "Warmplay"
          }
      }
  }' \
  http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams
DELETE

Delete a document by ID:

$ curl -X DELETE http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams
Delete all documents in the music schema:
$ curl -X DELETE \
  "http://localhost:8080/document/v1/mynamespace/music/docid?selection=true&cluster=my_cluster"

Conditional writes

A test-and-set condition can be added to Put, Remove and Update operations. Example:

$ curl -X PUT -H "Content-Type:application/json" --data '
  {
      "condition": "music.artist==\"Warmplay\"",
      "fields": {
          "artist": {
              "assign": "Coldplay"
          }
      }
  }' \
  http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams

If the condition is not met, a 412 Precondition Failed is returned:

{
    "pathId": "/document/v1/mynamespace/music/docid/a-head-full-of-dreams",
    "id": "id:mynamespace:music::a-head-full-of-dreams",
    "message": "[UNKNOWN(251013) @ tcp/vespa-container:19112/default]: ReturnCode(TEST_AND_SET_CONDITION_FAILED, Condition did not match document nodeIndex=0 bucket=20000000000000c4 ) "
}

Also see the condition reference.

Create if nonexistent

Upserts

Updates to nonexistent documents are supported using create. This is often called an upsert — insert a document if it does not already exist, or update it if it exists. An empty document is created on the content nodes, before the update is applied. This simplifies client code in the case of multiple writers. Example:

$ curl -X PUT -H "Content-Type:application/json" --data '
  {
      "fields": {
          "artist": {
              "assign": "Coldplay"
          }
      }
  }' \
  http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-thoughts?create=true

Conditional updates and puts with create

Conditional updates and puts can be combined with create. This has the following semantics:

  • If the document already exists, the condition is evaluated against the most recent document version available. The operation is applied if (and only if) the condition matches.
  • Otherwise (i.e. the document does not exist or the newest document version is a tombstone), the condition is ignored and the operation is applied as if no condition was provided.

Support for conditional puts with create was added in Vespa 8.178.

$ curl -X POST -H "Content-Type:application/json" --data '
  {
      "fields": {
          "artist": {
              "assign": "Coldplay"
          }
      }
  }' \
  http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-thoughts?create=true&condition=music.title%3D%3D%27best+of%27

Data dump

To iterate over documents, use visiting — sample output:

{
    "pathId": "/document/v1/namespace/doc/docid",
    "documents": [
        {
            "id": "id:namespace:doc::id-1",
            "fields": {
                "title": "Document title 1",
            }
        }
    ],
    "continuation": "AAAAEAAAAAAAAAM3AAAAAAAAAzYAAAAAAAEAAAAAAAFAAAAAAABswAAAAAAAAAAA"
}

Note the continuation token — use this in the next request for more data. Below is a sample script dumping all data using jq for JSON parsing. It splits the corpus in 8 slices by default; using a number of slices at least four times the number of container nodes is recommended for high throughput. Timeout can be set lower for benchmarking. (Each request has a maximum timeout of 60s to ensure progress is saved at regular intervals)

#!/bin bash
set -eo pipefail

if [ $# -gt 2 ]
then
  echo "Usage: $0 [number of slices, default 8] [timeout in seconds, default 31536000 (1 year)]"
  exit 1
fi

endpoint="https://my.vespa.endpoint"
cluster="db"
selection="true"
slices="${1:-8}"
timeout="${2:-31516000}"
curlTimeout="$((timeout > 60 ? 60 : timeout))"
url="$endpoint/document/v1/?cluster=$cluster&selection=$selection&stream=true&timeout=$curlTimeout&concurrency=8&slices=$slices"
auth="--key my-key --cert my-cert -H 'Authorization: my-auth'"
curl="curl -sS $auth"
start=$(date '+%s')
doom=$((start + timeout))

## auth can be something like auth='--key data-plane-private-key.pem --cert data-plane-public-cert.pem'
curl="curl -sS $auth"

function visit {
  sliceId="$1"
  documents=0
  continuation=""
  while
    printf -v filename "data-%03g-%012g.json.gz" $sliceId $documents
    json="$(eval "$curl '$url&sliceId=$sliceId$continuation'" | tee >( gzip > $filename ) | jq '{ documentCount, continuation, message }')"
    message="$(jq -re .message <<< $json)" && echo "Failed visit for sliceId $sliceId: $message" >&2 && exit 1
    documentCount="$(jq -re .documentCount <<< $json)" && ((documents += $documentCount))
    [ "$(date '+%s')" -lt "$doom" ] && token="$(jq -re .continuation <<< $json)"
  do
    echo "$documentCount documents retrieved from slice $sliceId; continuing at $token"
    continuation="&continuation=$token"
  done
  time=$(($(date '+%s') - start))
  echo "$documents documents total retrieved in $time seconds ($((documents / time)) docs/s) from slice $sliceId" >&2
}

for ((sliceId = 0; sliceId < slices; sliceId++))
do
  visit $sliceId &
done
wait

Visiting throughput

Note that visit with selection is a linear scan over all the music documents in the request examples at the start of this guide. Each complete visit thus requires the selection expression to be evaluated for all documents. Running concurrent visits with selections that match disjoint subsets of the document corpus is therefore a poor way of increasing throughput, as work is duplicated across each such visit. Fortunately, the API offers other options for increasing throughput:

  • Split the corpus into any number of smaller slices, each to be visited by a separate, independent series of HTTP requests. This is by far the most effective setting to change, as it allows visiting through all HTTP containers simultaneously, and from any number of clients—either of which is typically the bottleneck for visits through /document/v1. A good value for this setting is at least a handful per container.
  • Increase backend concurrency so each visit HTTP response is promptly filled with documents. When using this together with slicing (above), take care to also stream the HTTP responses (below), to avoid buffering too much data in the container layer. When a high number of slices is specified, this setting may have no effect.
  • Stream the HTTP responses. This lets you receive data earlier, and more of it per request, reducing HTTP overhead. It also minimizes memory usage due to buffering in the container, allowing higher concurrency per container. It is recommended to always use this, but the default is not to, due to backwards compatibility.

Getting started

Pro-tip: It is easy to generate a /document/v1 request by using the Vespa CLI, with the -v option to output a generated /document/v1 request - example:

$ vespa document -v ext/A-Head-Full-of-Dreams.json

  curl -X POST -H 'Content-Type: application/json'
  --data-binary @ext/A-Head-Full-of-Dreams.json
  http://127.0.0.1:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams

  Success: put id:mynamespace:music::a-head-full-of-dreams

See the document JSON format for creating JSON payloads.

This is a quick guide into dumping random documents from a cluster to get started:

  1. To get documents from a cluster, look up the content cluster name from the configuration, like in the album-recommendation example: <content id="music" version="1.0">.
  2. Use the cluster name to start dumping document IDs (skip jq for full json):
    $ curl -s 'http://localhost:8080/document/v1/?cluster=music&wantedDocumentCount=10&timeout=60s' | \
      jq -r .documents[].id
    
    id:mynamespace:music::love-is-here-to-stay
    id:mynamespace:music::a-head-full-of-dreams
    id:mynamespace:music::hardwired-to-self-destruct
    
    wantedDocumentCount is useful to let the operation run longer to find documents, to avoid an empty result. This operation is a scan through the corpus, and it is normal to get empty result and the continuation token.
  3. Look up the document with id id:mynamespace:music::love-is-here-to-stay:
    $ curl -s 'http://localhost:8080/document/v1/mynamespace/music/docid/love-is-here-to-stay' | jq .
    
    {
        "pathId": "/document/v1/mynamespace/music/docid/love-is-here-to-stay",
        "id": "id:mynamespace:music::love-is-here-to-stay",
        "fields": {
            "artist": "Diana Krall",
            "year": 2018,
            "category_scores": {
                "type": "tensor<float>(cat{})",
                "cells": {
                    "pop": 0.4000000059604645,
                    "rock": 0,
                    "jazz": 0.800000011920929
                }
            },
            "album": "Love Is Here To Stay"
        }
    }
  4. Read more about document IDs.

Troubleshooting

  • When troubleshooting documents not found using the query API, use vespa visit to export the documents. Then compare the id field with other user-defined id fields in the query.

    $ vespa visit
    
    {
        "id": "id:mynamespace:music::when-we-all-fall-asleep-where-do-we-go",
        "fields": {
            "artist": "Billie Eilish",
            "doc_id": 12345
        }
    }

    Find more details on the components of the document ID.

  • Document not found responses look like:

    $ curl http://127.0.0.1:8080/document/v1/mynamespace/music/docid/non-existing-doc
    
    {
      "pathId": "/document/v1/mynamespace/music/docid/non-existing-doc",
      "id": "id:mynamespace:music::non-existing-doc"
    }

    This might look like an empty document, use -v for more output:

    $ curl -v http://127.0.0.1:8080/document/v1/mynamespace/music/docid/non-existing-doc
    
    > GET /document/v1/mynamespace/music/docid/non-existing-doc HTTP/1.1
    > Host: 127.0.0.1:8080
    > User-Agent: curl/7.88.1
    > Accept: */*
    >
    < HTTP/1.1 404 Not Found
    < Date: Fri, 26 May 2023 08:53:20 GMT
    < Content-Type: application/json;charset=utf-8
    < Content-Length: 108
    
    {
      "pathId": "/document/v1/mynamespace/music/docid/non-existing-doc",
      "id": "id:mynamespace:music::non-existing-doc"
    }

    Observe the 404 Not Found. Using the Vespa CLI is great for troubleshooting - use -v for verbose output, this prints an equivalent curl command:

    $ vespa document get -v id:mynamespace:music::non-existing-doc
    curl -X GET http://127.0.0.1:8080/document/v1/mynamespace/music/docid/non-existing-doc
    Error: Invalid document operation: 404 Not Found
    
    {
        "pathId": "/document/v1/mynamespace/music/docid/non-existing-doc",
        "id": "id:mynamespace:music::non-existing-doc"
    }
  • Query results can have results like:

    {
        "id": "index:mydoctype/3/399f8030300282ca93929939",
        "relevance": 0,
        "source": "test",
        "fields": {
            "sddocname": "testdoc",
            "myfield": 12
        }
    }

    Query result IDs are not the same as Document IDs. Use a separate field for the document ID, if needed.

  • Delete all documents in music schema, with security credentials:

    $ curl -X DELETE \
      --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
      "http://localhost:8080/document/v1/mynamespace/music/docid?selection=true&cluster=my_cluster"
    

Using number and group id modifiers

Do not use group or number modifiers with regular indexed mode document types. These are special cases that only work as expected for document types with mode=streaming or mode=store-only. Examples:

Get Get a document in a group:
$ curl http://localhost:8080/document/v1/mynamespace/music/number/23/some_key
$ curl http://localhost:8080/document/v1/mynamespace/music/group/mygroupname/some_key
Visit Visit all documents for a group:
$ curl http://localhost:8080/document/v1/namespace/music/number/23/
$ curl http://localhost:8080/document/v1/namespace/music/group/mygroupname/