Document Distribution

Documents can be uniformly distributed or in groups. Read elastic Vespa before this article.

Grouped documents

The group element is used to distribute documents - use cases:

Cluster upgrade Use two (or more) groups with one replica each - then shut down one group in the cluster for upgrade. This is quicker than doing it one by one node - the tradeoff is that 50% of the nodes must sustain the full load (in case of two groups), and one must always grow the cluster by two (or more) nodes, one in each group. Latency per search node goes up with a larger index, such grouping will double the index size.
Query throughput Applications with high query load on a small index can benefit of keeping the full index on all nodes - i.e. configuring the same number of groups as nodes. This confines queries to one node only, so less total work is done per query in the cluster. Hence the throughput increase. This increases latency compared with splitting the index on many nodes. It also limits options when growing the index size - at some point, the query latency might grow too high, and the only option then is to add nodes to the group. Having two nodes per group is bad if one node goes down, as one node must hence be able to hold the full index again. Use this option with care.


By default, Vespa distributes buckets uniformly across all distributor and storage nodes. This creates a uniform distribution, but has a few drawbacks:

  • If the cluster is large, some nodes often have less network bandwidth between them than nodes that are closer to each other. Spreading the copies at random, we might not get to utilize this improved bandwidth between nodes that are closer to each other at all.
  • Wanting copies of data to be available at all times, storing few copies, means that you can do maintenance on very few nodes at a time to keep data available. For instance, if you store only 2 copies, you can only do maintenance on one node at a time to ensure all data has at least one copy available, which makes upgrades time consuming.

Hierarchical distribution can remedy the above. With this, you have control over where bucket copies are stored in relation to each other. For instance, if your cluster consists of 10 racks, you can define that for any given bucket, it should store 2 copies in one rack and 2 copies in another rack. That means that you can take down a whole rack, and you are still sure that all your data is available, because you know all data on that rack should have 2 copies on another rack. Also, improved network bandwidth can be utilized. If a node has restarted and it needs to fetch some data that came in while restarting to get up to date, it can now choose to fetch that data from a node on the same rack, which has better connectivity as it is in the same rack. (Here you must of course define one group per rack, and add the nodes in the rack in the respective groups)

This can also be used for simple data center failover. If nodes are divided into two separate geographical locations, you can define a group for each of these locations, and you can for instance configure 2 copies to be stored on each location. That way, your data is always available on both locations, and you can survive a fire without data loss.

Currently, you have no control of what group is selected for which copies though. The failover example works because you divide your copies among N groups with equal amount of copies in each, and you only have N groups available, thus you know each group will have that amount of copies.

Groups can be defined in several layers, creating a tree-structure. That way you can for instance define a top level group to do failover, and below that you can make smaller groups to improve network bandwidth locally and ensure you can take down multiple nodes simultaneously without making any data unavailable.

When configuring the groups, you have to bundle either a set of storage nodes (for a bottom level group) or a set of groups at a lower level to make your group. Then you must specify how data is to be distributed among the group children. Refer to the group documentation for details about how to configure storage groups.

Skew in data distribution when using groups

Which groups are selected as primary, secondary and so on, groups for a given bucket is randomly determined using the same ideal state algorithm we use to pick nodes, described in more detail in the ideal state reference document. Each group is assigned an index, to be used in this algorithm. Because each bucket will have different sets of groups assigned to it, all data should still be equally divided among nodes, even though you have defined that one rack should keep twice as many bucket copies as another rack. If you have two racks, then one will typically store 2 copies for half of the buckets, and the other will store 2 copies of the other half of the buckets.

This will however likely create a bit worse skew globally compared to not using groups. If you're to divide buckets between two groups, you will likely get a little skew. Say one group will store 50.05 % of your data because the ideal state algorithm use pseudo-random numbers and doesn't create perfect distributions. Then the next level might also have a little skew, and as we move down, the cumulative skew will rise a bit.


Here are some examples illustrating how the data placement control feature would be helpful. They all depict a deployment scenario with redundancy 3 (i.e. 3 data copies) and a cluster topology composed of 2 clusters with 2 racks of nodes each. These examples have few groups and nodes to keep the example simple. In a normal case you'd probably have more racks than you want to store copies in for instance, so you'd pick 2 of N racks rather than 2 of 2.

Cautious data placement

A way to reduce the risk of data unavailability, is to spread the data copies across different geographical locations (e.g. data centers). In this example, the aim is to place all the copies in different racks (cautiousness).

Furthermore, this data placement enables fast upgrade procedures without service interruption, as entire groups can be upgraded at a time.

Data placement for performance

Large deployments involving dozens or hundreds of nodes intrinsically imply heterogeneous connectivity between groups of nodes. For instance, nodes located on the same switch will experience a greater connectivity than nodes that are not. And so it is at the rack, cluster and data center levels.

It is possible to reduce cross-level communication patterns by placing data replicas close to each other. In this example, the aim is to place all the copies in the same rack (optimized for performance).

Hybrid data placement

Hybrid data placement trade performance and cautiousness to get a bit of both worlds. In this example, the aim is to place two copies in the same rack, and the third copy in a different rack but still in the same cluster.

If additional cautiousness is desired, the third copy can be placed in the other cluster.