Writing to Vespa

Vespa documents are created according to the Document JSON Format or constructed programmatically - options:

  • RESTified Document Operation API: REST API for get, put, remove, update, visit.
  • The Vespa HTTP client. A standalone jar which feeds to Vespa either by method calls in Java or from the command line. It provides a simple API while achieving high performance by using multiplexing and multiple parallel async connections. It is recommended in all cases when feeding from a node outside the Vespa cluster.
  • The Document API. This provides direct read-and write access to Vespa documents using Vespa's internal communication layer. Use this when accessing documents from Java components in Vespa such as searchers and document processors.
  • vespa-feeder is a utility to feed data with high performance. vespa-get gets single documents, vespa-visit gets multiple.
Refer to feed sizing guide for feeding performance.

CRUD operations:

Put Put is used to write a document. A document is a set of name-value pairs referred to as fields. The fields available for a given document is given by the document type, provided by the application's search definition - see field types. A document is overwritten if a document with the same document ID exists and without a test and set condition.
Remove Remove removes a document. Later requests to access the document will not find it - read more about remove-entries. If the document to be removed is not found, this is returned in the reply. This is not considered a failure. Like the put and update operations, a test and set condition can be specified for remove operations, only removing the document when the condition is true.
Update Update is also referred to as partial update as it updates parts of a document. If the document to update does not exist, the update returns a reply stating that no document was found. A test and set condition can be specified for updates. Example usage is updating only documents with given timestamps.
Get Get returns the newest document instance. The get reply includes the last-updated timestamp of the document.

Feed block

Feed operations fail when a cluster is at disk or memory capacity. Configure resource-limits to tune this - the defaults block feeding before disk or memory is full.

The attribute multivalue mapping and enum store can also go full and block feeding.

To remedy, add nodes to the content cluster or use nodes with higher capacity. The data will auto-redistribute, and feeding is unblocked. These metrics indicate whether feeding is blocked (set to 1 when blocked):

content.proton.resource_usage.feeding_blocked disk or memory
content.proton.documentdb.attribute.resource_usage.feeding_blocked attribute enum store or multivalue
When feeding is blocked, events are logged - examples:
Put operation rejected for document 'id:test:test::0': 'diskLimitReached: {
  action: \"add more content nodes\",
  reason: \"disk used (0.85) > disk limit (0.8)\",
  capacity: 100000000000,
  free: 85000000000,
  available: 85000000000,
  diskLimit: 0.8
}'

Batch delete

Options for batch deleting documents:

  1. Find documents using search, delete, repeat. Pseudocode:
    while True; do
       query and read document ids, if empty exit
       delete document ids using /document/v1
       wait a sec
    
  2. Like 1. but use the Java client. Instead of deleting one-by-one, stream remove operations to the API (write a Java program for this), or append to a JSON file and use the binary:
    $ java -jar $VESPA_HOME/lib/jars/vespa-http-client-jar-with-dependencies.jar --host document-api-host < deletes.json
    
  3. Use a document selection. This deletes all documents not matching the expression. The content node will iterate over the corpus and delete documents (that are later compacted out):
    <documents garbage-collection="true">
        <document type="mytype" selection="mytype.version &gt; 4" >
    </documents>
    

Ordering

The Document API uses the document identifier to implement ordering. Documents with the same identifier will have the same serialize id, and a Document API client will ensure that only one operation with a given serialize id is pending at the same time. This ensures that if a client sends multiple operations for the same document, they will be processed in a defined order.

Note: If sending two put operations to the same document, and the first operation fails, the second operation that was enqueued is sent. If the client chooses to just resend the failed request, the order of operations has been switched.

If different clients have pending operations on the same document, the order is undefined.

Timestamps

Write operations like put, update and remove, have a timestamp assigned, passing through the distributor. The timestamp is guaranteed to be unique within the bucket where it is stored. The timestamp is used by the content layer to decide which operation is newest. These timestamps may be used when visiting, to only process/retrieve documents within a given timeframe. To guarantee unique timestamps, they are in microseconds, and the microsecond part may be generated or altered to avoid conflicts with other documents.

The internal timestamp is often referred to as the last modified time. This is the time of the last write operation going through the distributor. If documents are migrated from cluster to cluster, the target cluster will have new timestamps for their entries, and when reprocessing documents within a cluster, documents will have new timestamps even if not modified.