Vespa clusters can be grown and shrunk while serving queries and writes. Documents in content clusters are automatically redistributed on changes to maintain an even distribution with minimal data movement. To resize, just change the nodes and redeploy the application - no restarts needed.
Documents are managed by Vespa in chunks called buckets. The size and number of buckets are completely managed by Vespa and there is never any need to manually control sharding.
The elasticity mechanism is also used to recover from a node loss: New replicas of documents are created automatically on other nodes to maintain the configured redundancy. Failed nodes is therefore not a problem that requires immediate attention - clusters will self-heal from node failures as long as there are sufficient resources.
When you want to remove nodes from a content cluster, you can have the system migrate data off them in an orderly fashion prior to removal. This is done by marking nodes as retired. This is useful to remove nodes that should be retired, but also to migrate a cluster to entirely new nodes while online: Add the new nodes, mark the old nodes retired, wait for the data to be redistributed and remove the old nodes.
The auto-elasticity is configured for a normal fail-safe operation, but there are tradeoffs like recovery speed and resource usage. Learn more in procedures.
To add or remove nodes from a content cluster, just nodes
tag of the content cluster in
services.xml
and redeploy.
Read more in procedures.
When adding a new node, a new ideal state is calculated for all buckets. The buckets mapped to the new node are moved, the superfluous are removed. See redistribution example - add a new node to the system, with redundancy n=2:
The distribution algorithm generates a random node sequence for each bucket. In this example with n=2, replicas map to the two nodes sorted first. The illustration shows how placement onto two nodes changes as a third node is added. The new node takes over as primary for the buckets where it got sorted first, and as secondary for the buckets where it got sorted second. This ensures minimal data movement when nodes come and go, and allows capacity to be changed easily.
No buckets are moved between the existing nodes when a new node is added. Based on the pseudo-random sequences, some buckets change from primary to secondary, or are removed. Multiple nodes can be added in the same deployment.
Whether a node fails or is retired, the same redistribution happens. If the node is retired, replicas are generated on the other nodes and the node stays up, but with no active replicas. Example of redistribution after node failure, n=2:
Here, node 2 fails. This node held the active replicas of bucket 2 and 6. Once the node fails the secondary replicas are set active. If they were already in a ready state, they start serving queries immediately, otherwise they will index replicas, see searchable-copies. All buckets that no longer have secondary replicas are merged to the remaining nodes according to the ideal state.
Nodes in content clusters can be placed in groups. A group of nodes in a content cluster will have one or more complete replicas of the entire document corpus.
This is useful in the cases listed below:
Cluster upgrade | With multiple groups it becomes safe to take out a full group for upgrade instead of just one a node at a time. Read more. |
---|---|
Query throughput | Applications with high query rates and/or high static query cost can use groups to scale to higher query rates since Vespa will automatically send a query to just a single group. Read more. |
Topology | By using groups you can control replica placement over network switches or racks to ensure there is redundancy at the switch and rack level. |
Tuning group sizes and node resources enables applications to easily find the latency/cost sweet spot, the elasticity operations are automatic and queries and writes work as usual with no downtime.
A great elasticity feature is the ability to change topology without service disruption. This is a live change, and will auto-redistribute documents to the new topology. If full query coverage is required during the transition, make sure that there is always one or more "full" group(s):
It is possible to stage a migration in two steps:
A common way to increase capacity is then to add more nodes to existing groups, wait for re-distribution to complete, then add groups.
To manage documents, Vespa groups them in buckets, using hashing or hints in the document id.
A document Put or Update is sent to all replicas of the bucket with the document. If bucket replicas are out of sync, a bucket merge operation is run to re-sync the bucket. A bucket contains tombstones of recently removed documents.
Buckets are split when they grow too large, and joined when they shrink. This is a key feature for high performance in small to large instances, and eliminates need for downtime or manual operations when scaling. Buckets are purely a content management concept, and data is not stored or indexed in separate buckets, nor does queries relate to buckets in any way. Read more in buckets.
The ideal state distribution algorithm uses a variant of the CRUSH algorithm to decide bucket placement. It makes a minimal number of documents move when nodes are added or removed. Central to the algorithm is the assignment of a node sequence to each bucket:
Steps to assign a bucket to a set of nodes:
Repeat this for all buckets in the system.
Consistency is maintained at bucket level. Content nodes calculate checksums based on the bucket contents, and the content nodes compare checksums among the bucket replicas. A bucket merge is issued to resolve inconsistency, when detected. While there are inconsistent bucket replicas, operations are routed to the "best" replica.
As buckets are split and joined, it is possible for replicas of a bucket to be split at different levels. A node may have been down while its buckets have been split or joined. This is called inconsistent bucket splitting. Bucket checksums can not be compared across buckets with different split levels. Consequently, content nodes do not know whether all documents exist in enough replicas in this state. Due to this, inconsistent splitting is one of the highest maintenance priorities. After all buckets are split or joined back to the same level, the content nodes can verify that all the replicas are consistent and fix any detected issues with a merge. Read more.