• [+] expand all

Reads and writes

This guide covers the aspects of accessing documents in Vespa. Documents are stored in content clusters. Writes (PUT, UPDATE, DELETE) and reads (GET) pass through a container cluster. Find a more detailed flow at the end of this article.

Vespa Overview

Vespa's indexing structures are built for high-rate, memory-only operations for field updates. Refer to the feed sizing guide for write performance, in particular partial updates for in-memory-only writes.

Vespa supports parent/child for de-normalized data. This can be used to simplify the code to update application data, as one write will update all children documents.

Applications can add custom feed document processors and multiple container clusters - see indexing for details.

Vespa is eventually consistent - find details on dynamic behavior in elastic Vespa. Also see the Vespa consistency model. It is recommended to use the same client instance for updating a given document - both for data consistency but also performance (see concurrent mutations). Read more on write operation ordering. For performance, group field updates to the same document into one update operation.

Applications can auto-expire documents. This feature also blocks PUTs to documents that are already expired - see indexing and document selection. This is a common problem when feeding test data with timestamps, and the writes a silently dropped.

Also see troubleshooting.

Get

Get a document by ID.

Put

Write a document by ID - a document is overwritten if a document with the same document ID exists.

Remove

Remove a document by ID. If the document to be removed is not found, it is not considered a failure. Read more about data-retention. Also see batch deletes.

Update

Also referred to as partial updates, as it updates some/all fields of a document by ID. If the document to update is not found, it is not considered a failure.

Update supports create if nonexistent.

Updates can have conditions for test-and-set use cases.

All data structures (attribute, index and summary) are updatable. Note that only assign and remove are idempotent - message re-sending can apply updates more than once. Use conditional writes for stronger consistency.

All field types
  • assign (may also be used to clear fields)
Numeric field types
Composite types
Tensor types
  • modify Modify individual cells in a tensor - can replace, add or multiply cell values
  • add Add cells to mapped or mixed tensors
  • remove Remove cells from mapped or mixed tensors

API and utilities

Documents are created using JSON or in Java:

/document/v1/ API for get, put, remove, update, visit.
vespa-feed-client
  • Java library and command line client for feeding document operations using /document/v1/ over HTTP/2
  • Asynchronous, high-performance Java implementation, with retries and dynamic throttling
  • Simpler alternative to the Vespa HTTP client (below)
  • Supports a JSON array of feed operations, as well as JSONL: one operation JSON per line
Vespa HTTP client Note: This will be replaced by the vespa-feed-client. Jar writing to Vespa either by method calls in Java or from the command line. It provides a simple API with high performance using multiplexing and multiple parallel async connections. It is recommended in all cases when feeding from a node outside the Vespa cluster.
Java Document API Provides direct read-and write access to Vespa documents using Vespa's internal communication layer. Use this when accessing documents from Java components in Vespa such as searchers and document processors.
vespa-feeder Utility to feed data with high performance. vespa-get gets single documents, vespa-visit gets multiple.

Components

Use the vespa-feed-client or /document/v1/ API directly to read and write documents. (Note that the vespa-http-client will be discontinued, use the vespa-feed-client as a drop-in replacement). Alternatively, use vespa-feeder to feed files or the Java Document API.

Next is indexing and/or document processing where documents are prepared for indexing (and optionally processed using custom code), before being sent to the content node. The distributor maps the document to bucket, and sends it to proton nodes:

Feed with feed client Feed with vespafeeder

Document processing The document processing chain is a chain of processors that manipulate documents before they are stored. Document processors can be user defined. When using indexed search, the final step in the chain prepares documents for indexing. The Document API forwards requests to distributors. It calculates the correct content node using the distribution algorithm and the cluster state. With no known cluster state, the client library will send requests to a random node, which replies with the updated cluster state if the node was incorrect. Cluster states are versioned, such that clients hitting outdated distributors do not override updated states with old states.
Distributor

The distributor keeps track of which content nodes that stores replicas of each bucket (maximum one replica each), based on redundancy and information from the cluster controller. A bucket maps to one distributor only. A distributor keeps a bucket database with bucket metadata. The metadata holds which content nodes store replicas of the buckets, the checksum of the bucket content and the number of documents and meta entries within the bucket. Each document is algorithmically mapped to a bucket and forwarded to the correct content nodes. The distributors detect whether there are enough bucket replicas on the content nodes and add/remove as needed. Write operations wait for replies from every replica and fail if less than redundancy are persisted within timeout.

Cluster controller The cluster controller manages the state of the distributor and content nodes. This cluster state is used by the document processing chains to know which distributor to send documents to, as well as by the distributor to know which content nodes should have which bucket.
Proton Proton node has a bucket management system, which sends requests to a set of document databases, which each consists of three sub-databases. In short, this node activates and deactivates buckets for queries.

Further reading