## Install

Refer to the multinode install for a primer on how to set up a cluster. Required architecture is x86_64.

## System status

• Check logs
• Use performance graphs, System Activity Report (sar) or status pages to track load
• Use query tracing
• Use feed tracing
• Use the cluster controller status page (below) to track the status of search/storage nodes.

All Vespa processes have a PID file $VESPA_HOME/var/run/{service name}.pid, where {service name} is the Vespa service name, e.g. container or distributor. It is the same name which is used in the administration interface in the config sentinel. ## Status pages Vespa service instances have status pages for debugging and testing. Status pages are subject to change at any time - take care when automating. Procedure 1. Find the port: The status pages runs on ports assigned by Vespa. To find status page ports, use vespa-model-inspect to list the services run in the application. $ vespa-model-inspect services

To find the status page port for a specific node for a specific service, pick the correct service and run:
$vespa-model-inspect service [Options] <service-name>  2. Get the status and metrics: distributor, storagenode, searchnode and container-clustercontroller are content services with status pages. These ports are tagged HTTP. The cluster controller have multiple ports tagged HTTP, where the port tagged STATE is the one with the status page. Try connecting to the root at the port, or /state/v1/metrics. The distributor and storagenode status pages are available at /: $ vespa-model-inspect service searchnode
searchnode @ myhost.mydomain.com : search
search/search/cluster.search/0
tcp/myhost.mydomain.com:19111 (FS4)
tcp/myhost.mydomain.com:19112 (TEST HACK SRMP)
tcp/myhost.mydomain.com:19113 (ENGINES-PROVIDER RPC)
tcp/myhost.mydomain.com:19114 (HEALTH JSON HTTP)
$curl http://myhost.mydomain.com:19114/state/v1/metrics ...$ vespa-model-inspect service distributor
distributor @ myhost.mydomain.com : content
search/distributor/0
tcp/myhost.mydomain.com:19116 (MESSAGING)
tcp/myhost.mydomain.com:19117 (STATUS RPC)
tcp/myhost.mydomain.com:19118 (STATE STATUS HTTP)
$curl http://myhost.mydomain.com:19118/state/v1/metrics ...$ curl http://myhost.mydomain.com:19118/
...

A status page for the cluster controller is available at the status port at http://hostname:port/clustercontroller-status/v1/<clustername>. If clustername is not specified, the available clusters will be listed. The cluster controller leader status page will show if any nodes are operating with differing cluster state versions. It will also show how many data buckets are pending merging (document set reconciliation) due to either missing or being out of sync.

### Distributor or content node not existing

Content cluster nodes will register in the vespa-slobrok naming service on startup. If the nodes have not been set up or fail to start required processes, the naming service will mark them as unavailable.

Effect on cluster: Calculations for how big percentage of a cluster that is available will include these nodes even if they never have been seen. If many nodes are configured, but not in fact available, the cluster may set itself offline due by concluding too many nodes are down.

### Content node not available on the network

vespa-slobrok requires nodes to ping it periodically. If they stop sending pings, they will be set as down and the cluster will restore full availability and redundancy by redistribution load and data to the rest of the nodes. There is a time window where nodes may be unavailable but still not set down by slobrok.

Effect on cluster: Nodes that become unavailable will be set as down after a few seconds. Before that, document operations will fail and will need to be resent. After the node is set down, full availability is restored. Data redundancy will start to restore.

### Clean mode

There has been rare occasions were we ended up with data that was internally inconsistent. This should be very rare and has only happend once. For those circumstances it is possible to start the node in a special validate_and_sanitize_docstore mode. This will do its best to clean up inconsistent data. However detecting that this is required is not an easy feat. In order for this approach to work, all nodes must be stopped before enabling this feature. If not the poison pill might come back as they have the same redundancy as the rest of your documents.

### Distributor or content node crashing

A crashing node restarts in much the same node as a controlled restart. A content node will not finish processing the currently pending requests, causing failed requests. Client resending might hide these failures, as the distributor should be able to process the resent request quickly, using other copies than the recently lost one.

### Thrashing nodes

An example is OS disk using excessive amount of time to complete IO requests. Eventually the maximum number of files are open, and as the OS is so dependent on the filesystem, it ends up not being able to do much at all.

get-node-state requests from the cluster controller fetch node metrics from /proc and write this to a temp directory on the disk before responding. This causes a thrashing node to time out get-node-state requests, setting the node down in the cluster state.

Effect on cluster: This will have the same effects like the not available on network issue.

### Constantly restarting distributor or service layer node

A broken node may end up with processes constantly restarting. It may die during initialization due to accessing corrupt files, or it may die when it starts receiving requests of a given type triggering a node local bug. This is bad for distributor nodes, as these restarts create constant ownership transfer between distributors, causing windows where buckets are unavailable.

The cluster controller has functionality for detecting such nodes. If a node restarts in a way that is not detected as a controlled shutdown, more than max_premature_crashes, the cluster controller will set the wanted state of this node to be down.

Detecting a controlled restart is currently a bit tricky. A controlled restart is typically initiated by sending a TERM signal to the process. Not having any other sign, the content layer has to assume that all TERM signals are the cause of controlled shutdowns. Thus, if the process keep being killed by kernel due to using too much memory, this will look like controlled shutdowns to the content layer.