• [+] expand all

/document/v1 API guide

Refer to the document/v1 API reference for details. Examples below refer to the sample application configuration.

Pro tip: It is easy to generate a /document/v1 request by using the Vespa CLI, with the -v option to output a generated /document/v1 request - example:

$ vespa document -v ext/A-Head-Full-of-Dreams.json

  curl -X POST -H 'Content-Type: application/json'
  --data-binary @ext/A-Head-Full-of-Dreams.json
  http://127.0.0.1:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams

  Success: put id:mynamespace:music::a-head-full-of-dreams

See the document JSON format for creating JSON payloads.

Getting started

This is a quick guide into dumping random documents from a cluster to get started:

  1. To get documents from a cluster, look up the content cluster name from the configuration, like in the album-recommendation example: <content id="music" version="1.0">.
  2. Use the cluster name to start dumping document IDs (skip jq for full json):
    $ curl -s 'http://localhost:8080/document/v1/?cluster=music&wantedDocumentCount=10&timeout=60s' | \
      jq -r .documents[].id
    
    id:mynamespace:music::love-id-here-to-stay
    id:mynamespace:music::a-head-full-of-dreams
    id:mynamespace:music::hardwired-to-self-destruct
    
    wantedDocumentCount is useful to let the operation run longer to find documents, to avoid an empty result. This operation is a scan through the corpus, and it is normal to get empty result and the continuation token.
  3. Look up the document with id id:mynamespace:music::love-id-here-to-stay:
    $ curl -s 'http://localhost:8080/document/v1/mynamespace/music/docid/love-id-here-to-stay' | jq .
    
    {
        "pathId": "/document/v1/mynamespace/music/docid/love-id-here-to-stay",
        "id": "id:mynamespace:music::love-id-here-to-stay",
        "fields": {
            "artist": "Diana Krall",
            "year": 2018,
            "category_scores": {
                "type": "tensor<float>(cat{})",
                "cells": {
                    "pop": 0.4000000059604645,
                    "rock": 0,
                    "jazz": 0.800000011920929
                }
            },
            "album": "Love Is Here To Stay"
        }
    }
  4. Read more about document IDs.

Request examples

GET

Get
$ curl http://localhost:8080/document/v1/my_namespace/music/docid/love-id-here-to-stay
Visit Visit all documents with given namespace and document type:
$ curl http://localhost:8080/document/v1/namespace/music/docid
Visit all documents using continuation:
$ curl http://localhost:8080/document/v1/namespace/music/docid?continuation=AAAAEAAAAAAAAAM3AAAAAAAAAzYAAAAAAAEAAAAAAAFAAAAAAABswAAAAAAAAAAA
Visit using a selection:
$ curl http://localhost:8080/document/v1/namespace/music/docid?selection=music.genre=='blues'
Visit documents across all non-global document types and namespaces stored in content cluster mycluster:
$ curl http://localhost:8080/document/v1/?cluster=mycluster
Visit documents across all global document types and namespaces stored in content cluster mycluster:
$ curl http://localhost:8080/document/v1/?cluster=mycluster&bucketSpace=global
Read about visiting throughput below.
POST

Post data in the document JSON format.

$ curl -X POST -H "Content-Type:application/json" --data '
  {
      "fields": {
          "artist": "Coldplay",
          "album": "A Head Full of Dreams",
          "year": 2015
      }
  }' \
  http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams
PUT

$ curl -X PUT -H "Content-Type:application/json" --data '
  {
      "fields": {
          "artist": {
              "assign": "Warmplay"
          }
      }
  }' \
  http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams
DELETE

Delete a document by ID:

$ curl -X DELETE http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams
Delete all documents in the music schema:
$ curl -X DELETE \
  "http://localhost:8080/document/v1/mynamespace/music/docid?selection=true&cluster=my_cluster"

Conditional writes

A test-and-set condition can be added to Put, Remove and Update operations. Example:

$ curl -X PUT -H "Content-Type:application/json" --data '
  {
      "condition": "music.artist==\"Warmplay\"",
      "fields": {
          "artist": {
              "assign": "Coldplay"
          }
      }
  }' \
  http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams

If the condition is not met, a 412 Precondition Failed is returned:

{
    "pathId": "/document/v1/mynamespace/music/docid/a-head-full-of-dreams",
    "id": "id:mynamespace:music::a-head-full-of-dreams",
    "message": "[UNKNOWN(251013) @ tcp/vespa-container:19112/default]: ReturnCode(TEST_AND_SET_CONDITION_FAILED, Condition did not match document nodeIndex=0 bucket=20000000000000c4 ) "
}

Also see the condition reference.

Create if nonexistent

Updates to nonexistent documents are supported using create. An empty document is created on the content nodes, before the update is applied. This simplifies client code in the case of multiple writers. Example:

$ curl -X PUT -H "Content-Type:application/json" --data '
  {
      "fields": {
          "artist": {
              "assign": "Coldplay"
          }
      }
  }' \
  http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-thoughts?create=true

create can be used in combination with a condition. If the document does not exist, the condition will be ignored and a new document with the update applied is automatically created. Otherwise, the condition must match for the update to take place.

Visiting throughput

Note that visit with selection is a linear scan over all the music documents in the request examples in the table above. Each complete visit thus requires the selection expression to be evaluated for all documents. Running concurrent visits with selections that match disjoint subsets of the document corpus is therefore a poor way of increasing throughput, as work is duplicated across each such visit. Fortunately, the API offers other options for increasing throughput:

  • Split the corpus into any number of smaller slices, each to be visited by a separate, independent series of HTTP requests. This is by far the most effective setting to change, as it allows visiting through all HTTP containers simultaneously, and from any number of clients—either of which is typically the bottleneck for visits through /document/v1. A good value for this setting is at least a handful per container.
  • Increase backend concurrency so each visit HTTP response is promptly filled with documents. When using this together with slicing (above), take care to also stream the HTTP responses (below), to avoid buffering too much data in the container layer. When a high number of slices is specified, this setting may have no effect.
  • Stream the HTTP responses. This lets you receive data earlier, and more of it per request, reducing HTTP overhead. It also minimizes memory usage due to buffering in the container, allowing higher concurrency per container. It is recommended to always use this, but the default is not to, due to backwards compatibility.

Data dump

To iterate over documents, use visiting — sample output:

{
    "pathId": "/document/v1/namespace/doc/docid",
    "documents": [
        {
            "id": "id:namespace:doc::id-1",
            "fields": {
                "title": "Document title 1",
            }
        }
    ],
    "continuation": "AAAAEAAAAAAAAAM3AAAAAAAAAzYAAAAAAAEAAAAAAAFAAAAAAABswAAAAAAAAAAA"
}

Note the continuation token — use this in the next request for more data. Below is a sample script dumping all data using jq for JSON parsing. It splits the corpus in 8 slices by default; using a number of slices at least four times the number of container nodes is recommended for high throughput. Timeout can be set lower for benchmarking. (Each request has a maximum timeout of 60s to ensure progress is saved at regular intervals)

#!/bin bash
set -eo pipefail

if [ $# -gt 2 ]
then
  echo "Usage: $0 [number of slices, default 8] [timeout in seconds, default 31536000 (1 year)]"
  exit 1
fi

endpoint="https://my.vespa.endpoint"
cluster="db"
selection="true"
slices="${1:-8}"
timeout="${2:-31516000}"
curlTimeout="$((timeout > 60 ? 60 : timeout))"
url="$endpoint/document/v1/?cluster=$cluster&selection=$selection&stream=true&timeout=$curlTimeout&concurrency=8&slices=$slices"
auth="--key my-key --cert my-cert -H 'Authorization: my-auth'"
curl="curl -sS $auth"
start=$(date '+%s')
doom=$((start + timeout))

## auth can be something like auth='--key data-plane-private-key.pem --cert data-plane-public-cert.pem'
curl="curl -sS $auth"

function visit {
  sliceId="$1"
  documents=0
  continuation=""
  while
    printf -v filename "data-%03g-%012g.json.gz" $sliceId $documents
    json="$(eval "$curl '$url&sliceId=$sliceId$continuation'" | tee >( gzip > $filename ) | jq '{ documentCount, continuation, message }')"
    message="$(jq -re .message <<< $json)" && echo "Failed visit for sliceId $sliceId: $message" >&2 && exit 1
    documentCount="$(jq -re .documentCount <<< $json)" && ((documents += $documentCount))
    [ "$(date '+%s')" -lt "$doom" ] && token="$(jq -re .continuation <<< $json)"
  do
    echo "$documentCount documents retrieved from slice $sliceId; continuing at $token"
    continuation="&continuation=$token"
  done
  time=$(($(date '+%s') - start))
  echo "$documents documents total retrieved in $time seconds ($((documents / time)) docs/s) from slice $sliceId" >&2
}

for ((sliceId = 0; sliceId < slices; sliceId++))
do
  visit $sliceId &
done
wait

Troubleshooting

  • Query results can have results like:
    {
        "id": "index:mydoctype/3/399f8030300282ca93929939",
        "relevance": 0,
        "source": "test",
        "fields": {
            "sddocname": "testdoc",
            "myfield": 12
        }
    }
    Query result IDs are not the same as Document IDs. Use a separate field for the document ID, if needed.
  • Delete all documents in music schema, with security credentials:
    $ curl -X DELETE \
      --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
      "http://localhost:8080/document/v1/mynamespace/music/docid?selection=true&cluster=my_cluster"
    

Using number and group id modifiers

Do not use group or number modifiers with regular indexed mode document types. These are special cases that only work as expected for document types with mode=streaming or mode=store-only. Examples:

Get Get a document in a group:
$ curl http://localhost:8080/document/v1/mynamespace/music/number/23/some_key
$ curl http://localhost:8080/document/v1/mynamespace/music/group/mygroupname/some_key
Visit Visit all documents for a group:
$ curl http://localhost:8080/document/v1/namespace/music/number/23/
$ curl http://localhost:8080/document/v1/namespace/music/group/mygroupname/