# Content Cluster Elasticity

Vespa clusters can be grown and shrunk while serving queries and writes. Documents in content clusters are automatically redistributed on changes to maintain an even distribution with minimal data movement. To resize, just change the nodes and redeploy the application - no restarts needed.

Documents are management by Vespa in chunked called buckets. The size and number of buckets are completely managed by Vespa and there is never any need to manually control sharding.

The elasticity mechanism is also used to recover from a node loss: New replicas of documents are created automatically on other nodes to maintain the configured redundancy. Failed nodes is therefore not a problem that requires immediate attention - clusters will self-heal from node failures as long as there are sufficient resources.

When you want to remove nodes from a content cluster, you can have the system migrate data off them in an orderly fashion prior to removal. This is done by marking nodes as retired. This is useful to remove nodes that should be retired, but also to migrate a cluster to entirely new nodes while online: Add the new nodes, mark the old nodes retired, wait for the data to be redistributed and remove the old nodes.

The auto-elasticity is configured for a normal fail-safe operation, but there are tradeoffs like recovery speed and resource usage. Learn more in procedures.

To add or remove nodes from a content cluster, just nodes tag of the content cluster in services.xml and redeploy. Read more in procedures.

When adding a new node, a new ideal state is calculated for all buckets. The buckets mapped to the new node are moved, the superfluous are removed. See redistribution example - add a new node to the system, with redundancy n=2:

The distribution algorithm generates a random node sequence for each bucket. In this example with n=2, replicas map to the two nodes sorted first. The illustration shows how placement onto two nodes changes as a third node is added. The new node takes over as primary for the buckets where it got sorted first, and as secondary for the buckets where it got sorted second. This ensures minimal data movement when nodes come and go, and allows capacity to be changed easily.

No buckets are moved between the existing nodes when a new node is added. Based on the pseudo-random sequences, some buckets change from primary to secondary, or are removed. Multiple nodes can be added in the same deployment.

## Removing nodes

Whether a node fails or is retired, the same redistribution happens. If the node is retired, replicas are generated on the other nodes and the node stays up, but with no active replicas. Example of redistribution after node failure, n=2:

Here, node 2 fails. This node held the active replicas of bucket 2 and 6. Once the node fails the secondary replicas are set active. If they were already in a ready state, they start serving queries immediately, otherwise they will index replicas, see searchable-copies. All buckets that no longer have secondary replicas are merged to the remaining nodes according to the ideal state.

## Grouped distribution

Nodes in content clusters can be placed in groups. A group of nodes in a content cluster will have one or more complete replicas of the entire document corpus.

This is useful in the cases listed below:

Cluster upgrade With multiple groups it becomes safe to take out a full group for upgrade instead of just one a node at a time. Read more. Applications with high query rates and/or high static query cost can use groups to scale to higher query rates since Vespa will automatically send a query to just a single group. Read more. By using groups you can control replica placement over network switches or racks to ensure there is redundancy at the switch and rack level.

Data redistribution after changing cluster topology is automatic, However, if you change between a grouped and non-grouped configuration, there will be a period of reduced query coverage during the transition. Read more in sizing examples.

Tuning group sizes and node resources enables applications to easily find the latency/cost sweet spot, the elasticity operations are automatic and queries and writes work as usual with no downtime.

## Buckets

To manage documents, Vespa groups them in buckets, using hashing or hints in the document id.

A document Put or Update is sent to all replicas of the bucket with the document. If bucket replicas are out of sync, a bucket merge operation is run to re-sync the bucket. A bucket contains tombstones of recently removed documents.

Buckets are split when they grow too large, and joined when they shrink. This is a key feature for high performance in small to large instances, and eliminates need for downtime or manual operations when scaling. Buckets are purely a content management concept, and data is not stored or indexed in separate buckets, nor does queries relate to buckets in any way. Read more in buckets.

## Ideal state distribution algorithm

The ideal state distribution algorithm uses a variant of the CRUSH algorithm to decide bucket placement. It makes a minimal number of documents move when nodes are added or removed.

Central to the algorithm is the assignment of a node sequence to each bucket. The illustration shows how each bucket uses a pseudo-random sequence of numbers to derive the node sequence. The pseudo-random generator is seeded with the bucket id, so each bucket will always get the same sequence:

Each node has a distribution-key, which is an index in the number sequence. The set of number/index pairs is sorted on descending numbers, and this new sequence of indexes determines which nodes the replicas are placed on, with the first node owning the primary replica, i.e. the active replica. This specification of where to place a bucket is called the bucket's ideal state.

## Consistency

Consistency is maintained at bucket level. Content nodes calculate checksums based on the bucket contents, and the content nodes compare checksums among the bucket replicas. A bucket merge is issued to resolve inconsistency, when detected. While there are inconsistent bucket replicas, operations are routed to the "best" replica.

As buckets are split and joined, it is possible for replicas of a bucket to be split at different levels. A node may have been down while its buckets have been split or joined. This is called inconsistent bucket splitting. Bucket checksums can not be compared across buckets with different split levels. Consequently, content nodes do not know whether all documents exist in enough replicas in this state. Due to this, inconsistent splitting is one of the highest maintenance priorities. After all buckets are split or joined back to the same level, the content nodes can verify that all the replicas are consistent and fix any detected issues with a merge. Read more.