A Vespa instance can grow and shrink while serving queries, reads and writes. To maintain a uniform data distribution due to changes in size or topology, documents are automatically redistributed, with minimal data movement. To resize, just change the nodes and redeploy the application - no restarts needed.
The elasticity mechanism is also used to recover from a node loss - new replicas of documents are auto-created to maintain redundancy. Failed nodes is hence not a problem that requires immediate attention - as long as the instance has sufficient capacity to handle the data and load, it self-heals from node failures.
Planned node retirements is done by marking nodes as retired - this generates replicas on the other nodes, so the set of retired nodes can safely be shut down. This makes it easy to migrate to a new node resource profile, like nodes with more memory.
The auto-elasticity is configured for a normal fail-safe operation, but there are tradeoffs like recovery speed and resource usage. Learn more in procedures.
Documents are mapped to buckets, read more below. When adding a new node, a new ideal state is calculated for all buckets. The buckets mapped to the new node are moved, the superfluous are removed. See redistribution example - add a new node to the system, with redundancy n=2:
The distribution algorithm generates a random node sequence for each bucket. In this example with n=2, replicas map to the two nodes sorted first. The illustration shows how placement onto two nodes changes as a third node is added. The new node takes over as primary for the buckets where it got sorted first, and as secondary for the buckets where it got sorted second. This ensures minimal data movement when nodes come and go, and allows capacity to be changed easily.
No buckets are moved between the existing nodes when a new node is added. Based on the pseudo-random sequences, some buckets change from primary to secondary, or are removed. Multiple nodes can be added in the same deployment. Adding more than one minimizes the total bucket redistribution, but increases the time to get to the ideal state.
Whether a node fails or is retired, the same redistribution happens. If the node is retired, replicas are generated on the other nodes and the node stays up, but with no active replicas. Example of redistribution after node failure, n=2:
Here, node 2 fails. This node held the active replicas of bucket 2 and 6. Once the node fails the secondary replicas are set active. If they were already in a ready state, they start serving queries immediately, otherwise they will index replicas, see searchable-copies. All buckets that no longer have secondary replicas are merged to the remaining nodes according to the ideal state.
Use the group configuration to build a topology of the nodes. This enables replica placement for use cases listed below:
|Cluster upgrade||Take out a full group for upgrade, instead of a node at a time. Read more.|
|Query throughput||Applications with high query rates / high static query cost can confine queries to less nodes. Read more.|
|Topology||Place replicas based on network topology / racking of hosts|
Data redistribution after changing cluster topology is automatic, however note that query coverage is reduced in the redistribution period. Read more in sizing examples.
Tuning group sizes and node resources enables applications to easily find the latency/cost sweet spot, the elasticity operations are automatic and queries / reads / writes work as usual - no downtime!
A document Put or Update is sent to all replicas of the bucket with the document. If bucket replicas are out of sync, a bucket merge operation is run to re-sync the bucket. A bucket contains tombstones of recently removed documents.
Buckets are split when they grow too large, and joined when they shrink. This is a key feature for high performance in very small to very large instances, and eliminates need for downtime or manual operations when scaling. Read more in buckets.
Buckets can be viewed as sophisticated shards, and are not configured, they are dynamic. Query operations are run ignorant of buckets and makes query sizing independent of bucket distribution.
Ideal state distribution algorithm
The ideal state distribution algorithm uses a variation of the RUSH algorithms to distribute documents. Under normal circumstances, it makes a minimal number of documents move when nodes are added or removed.
Central to the algorithm is the assignment of a node sequence to each bucket. The illustration shows how each bucket uses a pseudo-random sequence of numbers to derive the node sequence. The pseudo-random generator is seeded with the bucket id, so each bucket will always get the same sequence:
Each node has a distribution-key, which is an index in the number sequence. The set of number/index pairs is sorted on descending numbers, and this new sequence of indexes determines which nodes the replicas are placed on, with the first node being host to the primary replica, i.e. the active replica. This specification of where to place a bucket is called the bucket's ideal state.
Consistency is maintained at bucket level. Content nodes calculate checksums based on the bucket contents, and the content nodes compare checksums among the bucket replicas. A bucket merge is issued to resolve inconsistency, when detected. While there are inconsistent bucket replicas, operations are routed to the "best" replica.
As buckets are split and joined, it is possible for replicas of a bucket to be split at different levels. A node may have been down while its buckets have been split or joined. This is called inconsistent bucket splitting. Bucket checksums can not be compared across buckets with different split levels. Consequently, content nodes do not know whether all documents exist in enough replicas in this state. Due to this, inconsistent splitting is one of the highest maintenance priorities. After all buckets are split or joined back to the same level, the content nodes can verify that all the replicas are consistent and fix any detected issues with a merge. Read more.