• [+] expand all

/document/v1 API guide

This is the /document/v1 API guide. Refer to the document/v1 API reference.

Request examples

GET
Get
$ curl http://hostname:8080/document/v1/my_namespace/my_document-type/docid/1
Get a document in a group:
$ curl http://hostname:8080/document/v1/namespace/music/number/23/some_key
$ curl http://hostname:8080/document/v1/namespace/music/group/groupname/some_key
Visit visit all documents:
$ curl http://hostname:8080/document/v1/namespace/music/docid
Visit all documents using continuation:
$ curl http://hostname:8080/document/v1/namespace/music/docid?continuation=AAAAEAAAAAAAAAM3AAAAAAAAAzYAAAAAAAEAAAAAAAFAAAAAAABswAAAAAAAAAAA
Visit using a selection:
$ curl http://hostname:8080/document/v1/namespace/music/docid?selection=music.genre=='blues'
Visit all documents for a group:
$ curl http://hostname:8080/document/v1/namespace/music/number/23/
$ curl http://hostname:8080/document/v1/namespace/music/group/groupname/
Visit documents across all non-global document types and namespaces stored in content cluster mycluster:
$ curl http://hostname:8080/document/v1/?cluster=mycluster
Visit documents across all global document types and namespaces stored in content cluster mycluster:
$ curl http://hostname:8080/document/v1/?cluster=mycluster&bucketSpace=global
Read about visiting throughput below.
POST Post data in the document JSON format.
$ curl -X POST -H "Content-Type:application/json" --data-binary @document-1.json http://hostname:8080/document/v1/namespace/music/docid/1
{
    "fields": {
        "songs": "Knockin on Heaven's Door; Mr. Tambourine Man",
        "title": "Best of Bob Dylan",
        "url": "https://music.yahoo.com/bobdylan/BestOf"
    }
}
PUT
$ curl -X PUT -H "Content-Type:application/json" --data-binary @update.json http://hostname:8080/document/v1/namespace/music/docid/1
{
    "fields": {
        "title": {
            "assign": "New title"
        }
    }
}
DELETE Delete document with ID 1:
$ curl -X DELETE http://hostname:8080/document/v1/namespace/music/docid/1
Delete all documents in my_doctype schema:
$ curl -X DELETE --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
  "$ENDPOINT/document/v1/my_namespace/my_doctype/docid?selection=true&cluster=my_cluster"

ID examples

  • Uniform distribution: id:mynamespace:music::mydocid-123
  • Data access is grouped, e.g. personal data (each user has a numeric user id): id:mynamespace:music:n=12345:mydocid-123
  • Using a string identifier to group data: id:mynamespace:music:g=mymusicsite.com:mydocid-123

Conditional writes

A test-and-set condition can be added to Put, Remove and Update operations. Example:

{
    "update": "id:mynamespace:music::a-head-full-of-dreams",
    "condition": "music.artist==\"Elvis\"",
    "fields": {
        "artist": {
            "assign": "Coldplay"
        }
    }
}

If the condition is not met, a 412 Precondition Failed is returned:

$ curl -X PUT -H "Content-Type:application/json" \
  --data-binary @src/test/resources/A-Head-Full-of-Dreams-update.json \
  http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-dreams

{
    "pathId": "/document/v1/mynamespace/music/docid/a-head-full-of-dreams",
    "id": "id:mynamespace:music::a-head-full-of-dreams",
    "message": "[UNKNOWN(251013) @ tcp/vespa-container:19112/default]:
      ReturnCode(TEST_AND_SET_CONDITION_FAILED,
      Condition did not match document nodeIndex=0 bucket=20000000000000de)"
}

Also see the condition reference.

Create if nonexistent

Updates to nonexistent documents are supported using create. An empty document is created on the content nodes, before the update is applied. This simplifies client code in the case of multiple writers. Example:

$ cat src/test/resources/A-Head-Full-of-Dreams-update.json
{
    "fields": { "artist": { "assign": "Coldplay" } }
}

$ curl -X PUT -H "Content-Type:application/json" \
  --data-binary @src/test/resources/A-Head-Full-of-Dreams-update.json \
  'http://localhost:8080/document/v1/mynamespace/music/docid/a-head-full-of-things?&create=true'

create can be used in combination with a condition. If the document does not exist, the condition will be ignored and a new document with the update applied is automatically created. Otherwise, the condition must match for the update to take place.

Visiting throughput

Note that visit with selection is a linear scan over all the music documents in the request examples in the table above. Each complete visit thus requires the selection expression to be evaluated for all documents. Running concurrent visits with selections that match disjoint subsets of the document corpus is therefore a poor way of increasing throughput, as work is duplicated across each such visit. Fortunately, the API offers other options for increasing throughput:

  • Split the corpus into any number of smaller slices, each to be visited by a separate, independent series of HTTP requests. This is by far the most effective setting to change, as it allows visiting through all HTTP containers simultaneously, and from any number of clients—either of which is typically the bottleneck for visits through /document/v1. A good value for this setting is at least a handful per container.
  • Increase backend concurrency so each visit HTTP response is promptly filled with documents. When using this together with slicing (above), take care to also stream the HTTP responses (below), to avoid buffering too much data in the container layer. When a high number of slices is specified, this setting may have no effect.
  • Stream the HTTP responses. This lets you receive data earlier, and more of it per request, reducing HTTP overhead. It also minimizes memory usage due to buffering in the container, allowing higher concurrency per container. It is recommended to always use this, but the default is not to, due to backwards compatibility.

Data dump

To iterate over documents, use visiting — sample output:

{
    "pathId": "/document/v1/namespace/doc/docid",
    "documents": [
        {
            "id": "id:namespace:doc::id-1",
            "fields": {
                "title": "Document title 1",
                ...
            }
        },
        ...
    ],
    "continuation": "AAAAEAAAAAAAAAM3AAAAAAAAAzYAAAAAAAEAAAAAAAFAAAAAAABswAAAAAAAAAAA"
}

Note the continuation token — use this in the next request for more data. Below is a sample script dumping all data using jq for JSON parsing. It splits the corpus in 8 slices by default; using a number of slices at least four times the number of container nodes is recommended for high throughput. Timeout can be set lower for benchmarking. (Each request has a maximum timeout of 60s to ensure progress is saved at regular intervals)

#!/bin bash
set -eo pipefail

if [ $# -gt 2 ]
then
  echo "Usage: $0 [number of slices, default 8] [timeout in seconds, default 31536000 (1 year)]"
  exit 1
fi

endpoint="https://my.vespa.endpoint"
cluster="db"
selection="true"
slices="${1:-8}"
timeout="${2:-31516000}"
curlTimeout="$((timeout > 60 ? 60 : timeout))"
url="$endpoint/document/v1/?cluster=$cluster&selection=$selection&stream=true&timeout=$curlTimeout&concurrency=8&slices=$slices"
auth="--key my-key --cert my-cert -H 'Authorization: my-auth'" 
curl="curl -sS $auth"
doom=$((start + timeout))

## auth can be something like auth='--key data-plane-private-key.pem --cert data-plane-public-cert.pem'
curl="curl -sS $auth"

function visit {
  sliceId="$1"
  documents=0
  continuation=""
  while
    printf -v filename "data-%03g-%012g.json.gz" $sliceId $documents
    json="$(eval "$curl '$url&sliceId=$sliceId$continuation'" | tee >( gzip > $filename ) | jq '{ documentCount, continuation, message }')"
    message="$(jq -re .message <<< $json)" && echo "Failed visit for sliceId $sliceId: $message" >&2 && exit 1
    documentCount="$(jq -re .documentCount <<< $json)" && ((documents += $documentCount))
    [ "$(date '+%s')" -lt "$doom" ] && token="$(jq -re .continuation <<< $json)"
  do
    echo "$documentCount documents retrieved from slice $sliceId; continuing at $token"
    continuation="&continuation=$token"
  done
  time=$(($(date '+%s') - start))
  echo "$documents documents total retrieved in $time seconds ($((documents / time)) docs/s) from slice $sliceId" >&2
}

for ((sliceId = 0; sliceId < slices; sliceId++))
do
  visit $sliceId &
done
wait

Using fieldsets

When visiting across all document types, some internal document fields (e.g. Geo fields) set by Vespa may be returned as part of the results. To avoid this, limit visiting to just one document type using selection and explicitly filter these internal fields away using fieldSet:

$ curl http://hostname:8080/document/v1/?cluster=mycluster&selection=mydoctype&fieldSet=mydoctype:%5Bdocument%5D