Vespa offers configurable data redundancy with eventual consistency across replicas. It's designed for high efficiency under workloads where eventual consistency is an acceptable tradeoff. This document aims to go into some detail on what these tradeoffs are, and what you, as a user, can expect.
Vespa may be considered a limited subset of AP under the CAP theorem.
Under CAP, there is a fundamental limitation of whether any distributed system can offer guarantees on consistency (C) or availability (A) in scenarios where nodes are partitioned (P) from each other. Since there is no escaping that partitions can and will happen, we talk either of systems that are either CP or AP.
Consistency (C) in CAP implies that reads and writes are strongly consistent, i.e. the system offers linearizability. Weaker forms such as causal consistency or "read your writes" consistency is not sufficient. As mentioned initially, Vespa is an eventually consistent data store and therefore does not offer this property. In practice, Consistency requires the use of a majority consensus algorithm, which Vespa does not currently use.
Availability (A) in CAP implies that all requests receive a non-error response regardless of how the network may be partitioned. Vespa is dependent on a centralized (but fault tolerant) node health checker and coordinator. A network partition may take place between the coordinator and a subset of nodes. Operations to nodes in this subset aren't guaranteed to succeed until the partition heals. As a consequence, Vespa is not guaranteed to be strongly available, so we treat this as a "limited subset" of AP (though this is not technically part of the CAP definition).
In practice, the best-effort semantics of Vespa have proven to be both robust and highly available in common datacenter networks.
When a client receives a successful write response, the operation has been written and synced to disk. The replication level is configurable. Operations are by default written on all available replica nodes before sending a response. "Available" here means being Up in the cluster state, which is determined by the fault-tolerant, centralized Cluster Controller service. If a cluster has a total of 3 nodes, 2 of these are available and the replication factor is 3, writes will be ACKed to the client if both the available nodes ACK the operation.
On each replica node, operations are persisted to a write-ahead log before being applied. The system will automatically recover after a crash by replaying logged operations. Writes are guaranteed to be synced to durable storage prior to sending a successful response to the client, so acknowledged writes are retained even in the face of sudden power loss.
If a client receives a failure response for a write operation, the operation may or may not have taken place on a subset of the replicas. If not all replicas could be written to, they are considered divergent (out of sync). The system detects and reconciles divergent replicas. This happens without any required user intervention.
Each document write assigns a new wall-clock timestamp to the resulting document version. As a consequence, configure servers with NTP to keep clock drift as small as possible. Large clock drifts may result in timestamp collisions or unexpected operation orderings.
Vespa has support for conditional writes for individual documents through test-and-set operations. Multi-document transactions are not supported.
If possible, prefer using conditional updates with
create: true instead
of conditional puts. This is because the write repair logic for conditional
updates is capable of handling data consistency edge cases.
After a successful response, changes to the search indexes are immediately visible by default.
Reads are consistent on a best-effort basis and are not guaranteed to be linearizable.
Searches may observe partial updates, as updates are not atomic across index structures. This can only happen after a write has started, but before it's complete. Once a write is complete, all index updates are visible.
Searches may observe transient loss of coverage when nodes go down. Vespa will restore coverage automatically when this happens. How fast this happens depends on the configured searchable-copies value.
If replicas diverge during a Get, Vespa performs a read-repair. This fetches the requested document from all divergent replicas. The client then receives the version with the newest timestamp.
If replicas diverge during a Visit, the behavior is slightly different between the Document V1 API and vespa-visit:
vespa-visitwill by default retry visiting the bucket until it is in sync. This may take a long time if large parts of the system are out of sync.
The rationale for this difference in behavior is that Document V1 is usually
called in a real-time request context, whereas
vespa-visit is usually called
in a background/batch processing context.
Visitor operations iterate over the document corpus in an implementation-specific order. Any given document is returned in the state it was in at the time the visitor iterated over the data bucket containing the document. This means there is no snapshot isolation—a document mutation happening concurrently with a visitor may or may not be reflected in the returned document set, depending on whether the mutation happened before or after iteration of the bucket containing the document.
Reconciliation is the act of bringing divergent replicas back into sync. This usually happens after a node restarts or fails. It will also happen after network partitions.
Unlike several other eventually consistent databases, Vespa doesn't use distributed replica operation logs. Instead, reconciling replicas involves exchanging sets of timestamped documents. Reconciliation is complete once the union set of documents is present on all replicas. Metadata is checksummed to determine whether replicas are in sync with each other.
When reconciling replicas, the newest available version of a document will "win" and become visible. This version may be a remove (tombstone). Tombstones are replicated in the same way as regular documents.
If a test-and-set operation updates at least one replica, it will eventually become visible on the other replicas.
The reconciliation operation is referred to as a "merge" in the rest of the Vespa documentation.
Tombstone entries have a configurable time-to-live before they are compacted away. Nodes that have been partitioned away from the network for a longer period of time than this TTL should ideally have their indexes removed before being allowed back into the cluster. If not, there is a risk of resurrecting previously removed documents. Vespa does not currently detect or handle this scenario automatically.
See the documentation on data-retention-vs-size.