Clients route requests to distributors using the distribution algorithm. Node states are input to the algorithm. Distributors use the distribution algorithm to keep bucket copies on correct storage nodes. To do this, it needs to know which storage nodes are available to hold copies and what their capacities are. Service layer nodes may split requests between different partitions, in which case it needs to know what partitions are available. To achieve this, state is tracked for all the major components of the content layer:
The distributors use similar node states as the service layer nodes. However, maintenance currently makes little sense as there is only one distributor able to handle requests for a given bucket. Setting a distributor in maintenance mode will make the subset owned by that distributor unavailable, and is thus not recommended to be used for distributors. The cluster controller will not, by default, mask distributors restarting as in maintenance mode. This will change when distributor redundancy is implemented.
Retired mode also makes little sense for distributors, as ownership transfer happens quickly at any distributor state change anyhow. A retired distributor will not be used by any clients.
The content layer allows service layer nodes to be split into multiple partitions, allowing parts of a service layer node to be unavailable. If the backend is actually using a JBOD disk setup, and a disk is unusable, it can mark it unavailable, without causing all the capacity on the entire node to be lost.
The partitions can currently only be in up or down states, and changing state entails restarting the service layer node.
The cluster controller generates a cluster state by combining node and partition states for all the nodes. Each time the state is altered in a way that has an effect on the cluster, the cluster state version is upped, and the new cluster state is broadcasted to all the distributor and service layer nodes, provided a minimum time has passed since last time a new state was created.
The cluster controller has settings for how big a percentage of distributor and service layer nodes need to be available for the cluster to be available. If too many nodes are unavailable, allowing load to the remaining nodes will overload the nodes and completely fill them with data. This will not give a good user experience, and the cluster will use quite some time to recover from this afterwards. Thus, if a cluster is missing too many nodes to perform decently, the entire cluster is considered down. While this sounds drastic, it allows building a failover solution, or at the very least, the cluster will get back to a usable state faster once enough nodes are available again.
Note, that if a cluster has so many nodes unavailable that it is considered down, the state of the individual nodes are irrelevant to the cluster, and thus new cluster states will not be created and broadcasted before enough nodes are back for the cluster to come back up. A cluster state indicating the entire cluster is down, may thus have outdated data on the node level.
State is viewed in different contexts: unit, user and generated states:
The cluster controller fetches node states from all nodes. It attempts to always have a pending node state request to every node, such that any node state change can be reported back to the cluster controller immediately. These states are called unit states. States reported from the nodes themselves are either initializing, up or stopping. If the node can not be reached, a down state is assumed.
By default, it is assumed that all the nodes in the cluster are preferred to be up and available. Several other cases exist though:
User states are stored in a ZooKeeper cluster run by the configuration servers.
When viewing node states from a cluster state, one sees a state calculated based on reported states, preferred states and historic knowledge kept of the node by the cluster controller.
The cluster controller will typically mask a down reported state with a maintenance state for a short time if it has seen a stopping state indicating a controlled restart. Assuming the node will soon be initializing again, masking it as maintenance will stop distributors from creating new copies of buckets during that time window. If the node fails to come back quickly though, the state will become down.
Nodes that come back up, reporting as initializing, may be masked as down by the cluster controller. The cluster controller might suspect it of stalling or crashing during initialization due to historic events, and may keep it unused for the time being to avoid the cluster to be interrupted again.
Node states seen through the currently active cluster state are called generated states.
State information is available through the State Rest API. This interface can also be used to set user states for nodes. For convenience, use the command line tools. See the --help pages for vespa-get-cluster-state, vespa-get-node-state and vespa-set-node-state.
More detailed information is available in the cluster controller status pages. They can change at any time, and the output of these pages may require a good knowledge of content layer design to make sense. Find the status pages on the STATE port used by the cluster controller component at /clustercontroller-status/v1/