Cluster Controller

The cluster controller will:

Give the cluster a uniform view of its state The main responsibility of the cluster controller is to give the entire cluster a uniform view of what nodes are available and what they capacities are. To do this, it polls all distributor and content nodes for their health and state information. All the states are combined into a cluster state which is broadcasted back to all the distributors and content nodes.
Allow administrators to override node states If one wants a node to be less available than its health indicates, administrators can override node states in the cluster controller. Maintenance and retired modes offer other options to stop using the node than just stopping the services, and enforcing it to be down also means that if someone accidentally starts the services again it will not enter the live cluster.
Automatically detect and fix cluster issues The cluster controller may notice that a node is misbehaving, such as constantly crashing. The cluster controller may detect such issues and fix them by stop using problematic nodes.
Improving alerting and monitoring As the cluster controller monitors every nodes health, and can easily gather node metrics, it is a good location to implement utilities to improve automatic detection of issues and cluster monitoring.

Uniform state view

The main responsibility of the cluster controller is to give the cluster a uniform view of its state. This is achieved by gathering node states, combining them into a cluster state which is then broadcasted to all the nodes in the cluster.

Clients use the cluster state to route to the correct distributor using the ideal state algorithm. They get updated cluster states through the distributors they talk to. If clients use the wrong distributor for a bucket, it will get the correct cluster state returned in the reply. The client then calculates the correct distributor and resends the request. That way, clients do not depend directly on the cluster controller as distributors serves cluster states to clients.

With an updated cluster state, distributors can detect that they are the correct target for the request, and content nodes can verify that the request comes from the correct distributor node. Verifying this allows the cluster to be less confused by misbehaving nodes.

Gathering node states

The cluster state needs availability and capacity information from all the distributor and content nodes. The cluster controller retrieves this information by polling all the nodes for health and state information.

To give the cluster controller an updated view, the cluster controller sends requests with the currently known node state. If the supplied state is no longer correct, the node returns correct information immediately. If the state is correct, the request lingers on the node for a while, such that the node can reply to it immediately if its state changes. After a while, the cluster controller will send a new state request to the node, even if it has one pending. This will trigger a reply to the lingering request and make the new one linger instead. This way, each node have a pending state request at almost all times, such that new information can be quickly given to the cluster controller.

In the case of a node doing a controlled shutdown, it may thus start the shutdown process by responding to the pending state request that it is now stopping. Seeing this reply, the cluster controller can know that the node is doing a controlled shutdown or restart. Note: As controlled restarts or shutdowns are implemented as TERM signals from the config-sentinel, the cluster controller is not able to differ between controlled shutdowns and the process being killed by the OS for using too much memory or any other functionality that sends a TERM signal.

Combining node states into a cluster state

The node states kept in the cluster state doesn't need to match the states reported from the node themselves. For instance, while a node may return that it is stopping to the cluster controller to indicate a reason for it shutting down, the cluster state will say that this node is down or in maintenance mode.

The cluster controller is also free to keep the node temporarily out of service. For instance, if the node died while initializing last time it started, the cluster controller may find it wise to keep the node in the down state until it verifies that the node managed to finish initializing this time around.

On top of this, there may be permanent state overrides for the node. If the cluster controller detects the node constantly restarting, creating unnecessary state flux that interrupts the cluster, it may permanently set the node out of service until an administrator can fix it. Administrators may also manually override the state to set the node temporarily out of service while doing maintenance work, or removing it permanently due to some issues being manually detected.

The node states in the cluster states are thus calculated based on the reported states from the node, and historic knowledge kept in the cluster controller. If the result ends up in a state with higher availability than the user state for the node, the state is adjusted based on it.

Cluster state broadcasting

When node states are received, an cluster controller internal cluster state is updated. To limit the amount of cluster states broadcasted to the cluster, the cluster controller is not allowed to create a new official state before a given time period has past since the last one. Additionally it will wait for a short period of time after receiving a change, in case there are other changes happening at about the same time that should be included too. For instance, distributors and content nodes that are on the same node often go down at the same time.

If the internal cluster state have changed enough from the last official cluster state for the cluster to care, enough time has passed since the last official state was generated (typically a few seconds), and no more node state changes have been seen recently (typically less than a second), a new official cluster state is generated based on the internal state. The version number is upped, and the new official cluster state is immediately broadcasted to all the nodes.

Master election

A set of cluster controllers elect a master among themselves. The master gathers node states from all the distributor and content nodes, and combine them into a cluster state. The cluster state is broadcasted to all the nodes every time it changes, allowing all nodes to know what distributors are responsible for handling what parts of the data. Clients picks up an updated cluster state if they try talking to the wrong distributor.

Having only a single cluster controller, if the controller goes down for any reason, there is no longer a cluster controller alive to adjust the cluster state. While this is not a single point of failure, as typically a distributor or content node have to go down afterwards for it to matter much, it would still be good if one could have more than one cluster controller such that it was less critical for one to go down.

However, having two cluster controllers working at the same time, the cluster could end up confused with the cluster controllers sending contradicting cluster states. One of them may have failed to talk to node 3 while the other one may have failed to talk to node 5, and then they send out two different cluster states. The nodes in the cluster may receive the cluster states in different order, and may end up confused with the state of the cluster.

To allow multiple controllers, but avoid confusion, only a single cluster controller is allowed to broadcast cluster states. This cluster controller is known as the master cluster controller. The other cluster controller instances only exist to participate in master election and potentially take over if the master dies.

All cluster controllers will vote for cluster controller with the lowest index that says it is ready. If a cluster controller have more than half of the configured cluster controllers voting for it, it will be elected master. A newly elected master will not start broadcasting states before a transition time is passed, allowing an old master to have some time to realize it is no longer the master.

This means, that for there to be a master with N nodes down, 2 * N + 1 cluster controllers must be configured. It also means only the cluster controllers with the N lowest indexes will ever be able to become master. The remaining cluster controllers will never do anything but participate in the master election voting process.

Historic state

Very little state is persisted by the cluster controllers. If the master changes, or is restarted, historic knowledge built up is typically reset. A few exceptions exist, being stored in a ZooKeeper cluster:

  • The last cluster state version is stored, such that a cluster controller taking over can continue the version numbers where the old one left of.
  • The user states set externally are stored, such that a reset does not remove an administrator to set a node down for instance.
Note: ZooKeeper data tends to be corrupted if the disk they store data on gets full. It is thus recommended to not store ZooKeeper data on partitions that are also used for logs or data storage, which are prone to getting filled. If ZooKeeper data is corrupted, it must be deleted, in which case all persisted state is cleared. If having multiple nodes, the state may survive from a node where the data is not corrupt. Otherwise the data is lost. User states must be reset. Cluster state versions will start back at zero.

Cluster State Version Reset

If ZooKeeper data goes missing, the cluster state version will be reset to zero. This is not an issue for distributors and content nodes, as these will trust that the most recently received cluster state is always the most correct one.

Clients use the versions to detect that a cluster state is upgraded. The cluster controller tries to broadcast the cluster state at the same time to all the nodes, but the distributors may process the new cluster state at a bit different times, such that clients may alternatively talk to distributors having the newest cluster state or not. By using the version numbers, the clients does not dump the newest cluster state for one received by a distributor that have not processed the new one yet.

To handle a cluster state version reset, clients have some functionality for detecting whether distributors seems to agree the cluster state it has is wrong, even if version is newer. In such cases clients will dump its cluster state and use a new one even if version is lower. Clients thus handle a version reset, but it may result in a few temporarily failures while it is confused.

Monitoring cluster controller state

State Rest API

Basic cluster and node state information is available through the State Rest API. This API can also be used to set a user state for a node, e.g. forcing a node permanently down.

Command line tools

A few command line tools have been implemented using the State Rest API:

  • vespa-get-cluster-state
  • vespa-get-node-state
  • vespa-set-node-state

Monitoring cluster controller state

There is an internal status page for the cluster controller available at the status port. This page is located at /clustercontroller-status/v1/<clustername>/. If a cluster name is not specified, the available clusters will be listed.

Configuration

To configure the cluster controller, use services.xml and/or add configuration under the services element - example:

<services version="1.0">
  <config name="vespa.config.content.fleetcontroller">
    <min_time_between_new_systemstates>5000</min_time_between_new_systemstates>
  </config>