Writing documents to Vespa

Vespa documents are created according to the Document JSON Format or constructed programmatically. Options for writing Documents to Vespa:

  • RESTified Document Operation API: Simple REST API for operations based on document ID (get, put, remove, update,visit).
  • The Vespa HTTP client. This is a small standalone jar which feeds to Vespa either through method calls in Java or by consuming a file from the command line. It provides a simple API while achieving high performance by using multiplexing and multiple parallel async connections. It is recommended in all cases when feeding from a node outside the Vespa cluster.
  • The Document API. This provides direct read-and write access to Vespa documents through Vespa's internal communication layer. Use this when accessing documents from Java components in Vespa such as searchers and document processors.
  • vespa-feeder is a utility to feed data with high performance. vespa-get gets single documents, vespa-visit gets multiple.

The CRUD operations are the four basic functions of persistent storage:

Put Put is used to write a document to Vespa. A document is a set of name-value pairs referred to as fields. The fields available for a given document is given by the document type, provided by the application's search definition - see field types. A document is overwritten if a document with the same document ID exists and no test-and-set condition is given. By specifying a test and set condition, one can perform a conditional put that only executes if the condition matches the already existing document.
Remove Remove removes a document from Vespa. Later requests to access the document will not find it - read more about remove-entries. If the document to be removed is not found, this is returned in the reply. This is not considered a failure. Like the put and update operations, a test and set condition can be specified for remove operations, only removing the document when the condition (document selection) matches the document.
Update Update is also referred to as partial update as it updates parts of a document. If the document to update does not exist, the update returns a reply stating that no document was found. A test and set condition can be specified for updates. Example usage is updating only documents with given timestamps.
Get Get returns the newest document instance. The get reply includes the last-updated timestamp of the document, stating when the document was last written.

Ordering

The Document API uses the document identifier to implement ordering. Documents with the same identifier will have the same serialize id, and a Document API client will ensure that only one operation with a given serialize id is pending at the same time. This ensures that if a client sends multiple operations for the same document, they will be processed in a defined order.

Note: If sending two put operations to the same document, and the first operation fails, the second operation that was enqueued is sent. If the client chooses to just resend the failed request, the order of operations has been switched.

If different clients have operations towards the same document pending, the order of operations is undefined.

Timestamps

Write operations like put, update and remove, will have a timestamp assigned to them going through the distributor. This timestamp is guaranteed to be unique within the bucket where it is stored. This timestamp is used by the content layer to decide which operation is newest. These timestamps may be used when visiting, to only process/retrieve documents within a given timeframe. To guarantee unique timestamps, they are given in microseconds, and the microsecond part may be generated or altered to avoid conflicts with other documents.

Last modified time

The internal timestamp is often referred to as the last modified time. This is the time of the last write operation going through the distributor. If documents are migrated from cluster to cluster, the target cluster will have new timestamps for their entries, and when reprocessing documents within a cluster, documents will have new timestamps even if not modified.

Capacity and feed

Feed operations fail when a cluster is at full capacity. The following limits will block feeding:

resourcedefaultmetricdescription
disk writefilter.disklimit content.proton.resource_usage.disk Configure disk limit
memory writefilter.memorylimit content.proton.resource_usage.memory Configure memory limit
attribute enum store writefilter.attribute.enumstorelimit content.proton.documentdb.attribute.resource_usage.enum_store For string attribute fields or attribute fields with fast-search, there is a max limit on the size of the unique values stored for that attribute. The component storing these values is called enum store. The limit is 32GB
attribute multi-value writefilter.attribute.multivaluelimit content.proton.documentdb.attribute.resource_usage.multi_value For array or weighted set attribute fields, there is a max limit on the number of documents that can have the same number of values. The limit is 128M (2^27) documents. To remedy, either change the attribute field to use huge, or add/change nodes
To remedy, add nodes to the content cluster or swap nodes with higher capacity. The data will auto-redistribute and feeding will succeed again. These metrics indicate whether feeding is blocked (set to 1 when blocked):
  • content.proton.resource_usage.feeding_blocked: disk / memory
  • content.proton.documentdb.attribute.resource_usage.feeding_blocked: attribute enum store or multi-value
When feeding is blocked, events are logged - examples:
Put operation rejected for document 'id:test:test::0': 'diskLimitReached: {
  action: \"add more content nodes\",
  reason: \"disk used (0.85) > disk limit (0.8)\",
  capacity: 100000000000,
  free: 85000000000,
  available: 85000000000,
  diskLimit: 0.8
}'

Performance

To improve feed throughput, increase visibility-delay to batch writes on the content nodes for higher write performance. This trades off latency - writes will take effect in search results after visibility-delay seconds. This is particularly useful when batch feeding, like initial bootstrap or grid jobs.

Speed of partial updates and garbage collection is reduced if searchable copies is lower than redundancy. This is caused by no attributes for the non-searchable copies. Set fast-access on the attributes to update to ensure they exist for the non-searchable copies as well.