Administration procedures and tools

Install

Refer to the multinode install for a primer on how to set up a cluster. Required architecture is x86_64.

System status

Process PID files

All Vespa processes have a PID file $VESPA_HOME/var/run/{service name}.pid, where {service name} is the Vespa service name, e.g. container or distributor. It is the same name which is used in the telnet administration interface in the config sentinel.

Status pages

Vespa service instances have status pages for debugging and testing. Status pages are subject to change at any time - take care when automating. Procedure

  1. Find the port: The status pages runs on ports assigned by Vespa. To find status page ports, use vespa-model-inspect to list the services run in the application.
    $ vespa-model-inspect services
    
    To find the status page port for a specific node for a specific service, pick the correct service and run:
    $ vespa-model-inspect service [Options] <service-name>
    
  2. Get the status and metrics: distributor, storagenode, searchnode and container-clustercontroller are content services with status pages. These ports are tagged HTTP. The cluster controller have multiple ports tagged HTTP, where the port tagged STATE is the one with the status page. Try connecting to the root at the port, or /state/v1/metrics. The distributor and storagenode status pages are available at /, while the container-clustercontroller status page is found at /clustercontroller-status/v1/[clustername/]. Example:
      $ vespa-model-inspect service searchnode
      searchnode @ myhost.mydomain.com : search
      search/search/cluster.search/0
      tcp/myhost.mydomain.com:19110 (STATUS ADMIN RTC RPC)
      tcp/myhost.mydomain.com:19111 (FS4)
      tcp/myhost.mydomain.com:19112 (TEST HACK SRMP)
      tcp/myhost.mydomain.com:19113 (ENGINES-PROVIDER RPC)
      tcp/myhost.mydomain.com:19114 (HEALTH JSON HTTP)
      $ curl http://myhost.mydomain.com:19114/state/v1/metrics
      ...
      $ vespa-model-inspect service distributor
      distributor @ myhost.mydomain.com : content
      search/distributor/0
      tcp/myhost.mydomain.com:19116 (MESSAGING)
      tcp/myhost.mydomain.com:19117 (STATUS RPC)
      tcp/myhost.mydomain.com:19118 (STATE STATUS HTTP)
      $ curl http://myhost.mydomain.com:19118/state/v1/metrics
      ...
      $ curl http://myhost.mydomain.com:19118/
      ...
    

Detecting overload

  • If the number of requests exceed what the cluster is scaled for, there may be permanent overload until more nodes can be added or some data can be migrated off the cluster
  • If nodes have been disabled, cluster capacity is lowered, which may cause permanent overload
  • Spikes in external load may lead to temporary overload
  • Background tasks with inappropriate priorities may overload cluster until they complete
  • Cluster incidents requiring maintenance operations to fix may cause temporary overloads. Most maintenance operations can be performed at reasonably low priority, but some operations are more important than others, and may have been set to a priority higher than some not that critical external traffic
  • Metrics for operation throughput can also be used, to detect that the cluster is suddenly doing a lot of one type of operation that it doesn't normally do
  • Disk and/or memory might be exhausted and block feeding - recover from feed block

Queues

The most trustworthy metric for overload is the prioritized partition queues on the content nodes. In an overload situation, operations above some priority level will come in so fast that operations below this priority level will just fill up the queue. The metric to look out for is the .filestor.alldisks.queuesize metric, giving a total value for all the partitions.

While queue size is the most dependable metric, operations can be queued elsewhere too, which may be able to hide the issue. There are some visitor related queues: .visitor.allthreads.queuesize and .visitor.cv_queuesize. Additionally, maintenance operations may be queued merely by not being created yet, as there's no reason to keep more pending operations than is needed to get good throughput.

Track system resources

It may seem obvious to detect overload by monitoring CPU, memory and IO usage. However, a fully utilized resource does not necessarily indicate overload. As the content cluster supports prioritized operations, it will typically do as many low priority operations as it is able to when no high priority operations are in the queue. This means, that even if there's just a low priority reprocess or a load rebalancing effort going on after a node went down, the cluster may still use up all available resources to process these low priority tasks.

Network bandwidth consumption and switch saturation is something to look out for. These communication channels are not prioritized, and if they fill up processing low priority operations, high priority operations may not get through. If latencies gets mysterious high while no queuing is detected, the network is a candidate.

Rate limiting

Many applications built on Vespa is used by multiple clients and faces the problem of protecting clients from others overuse or abuse. To solve this problem, use the RateLimitingSearcher to rate limit load from each client type.

Monitor distance to ideal state

Refer to the distribution algorithm. Distributor status pages can be viewed to manually inspect state metrics:

.idealstate.idealstate_diff This metric tries to create a single value indicating distance to the ideal state. A value of zero indicates that the cluster is in the ideal state. Graphed values of this metric gives a good indication for how fast the cluster gets back to the ideal state after changes. Note that some issues may hide other issues, so sometimes the graph may appear to stand still or even go a bit up again, as resolving one issue may have detected one or several others.
.idealstate.buckets_toofewcopies Specifically lists how many buckets have too few copies. Compare to the buckets metric to see how big a portion of the cluster this is.
.idealstate.buckets_toomanycopies Specifically lists how many buckets have too many copies. Compare to the buckets metric to see how big a portion of the cluster this is.
.idealstate.buckets The total number of buckets managed. Used by other metrics reporting bucket counts to know how big a part of the cluster they relate to.
.idealstate.buckets_notrusted Lists how many buckets have no trusted copies. Without trusted buckets operations against the bucket may have poor performance, having to send requests to many copies to try and create consistent replies.
.idealstate.delete_bucket.pending Lists how many buckets that needs to be deleted.
.idealstate.merge_bucket.pending Lists how many buckets there are, where we suspect not all copies store identical document sets.
.idealstate.split_bucket.pending Lists how many buckets are currently being split.
.idealstate.join_bucket.pending Lists how many buckets are currently being joined.
.idealstate.set_bucket_state.pending Lists how many buckets are currently altered for active state. These are high priority requests which should finish fast, so these requests should seldom be seen as pending.

Cluster reconfig / restart

Notes:

  • Running vespa-deploy prepare will not change served configurations until vespa-deploy activate is run. vespa-deploy prepare will warn about all config changes that require restart.
  • Refer to search definitions for how to add/change/remove these.
  • Refer to elastic Vespa for details on how to add/remove capacity from a Vespa cluster.
  • See chained components for how to add or remove searchers and document processors.

Add or remove services on a node

It is possible to run multiple Vespa services on the same host. If changing the services on a given host, stop Vespa on the given host before running vespa-deploy activate. This is because the services will be allocated port numbers depending on what is running on the host. Consider if some of the services changed are used by services on other hosts. In that case, restart services on those hosts too. Procedure:

  1. Edit services.xml and hosts.xml
  2. Stop Vespa on the nodes that have changes
  3. Run vespa-deploy prepare and vespa-deploy activate
  4. Start Vespa on the nodes that have changes

Add/remove/modify document processor chains

Document processing chains can be added, removed and modified at runtime. Modification includes adding/removing document processors in chains and changing names of chains and processors etc. In short,

  1. Make the necessary adjustments in services.xml
  2. Run vespa-deploy prepare
  3. Run vespa-deploy activate
Note when adding or modifying a processing chain in a running cluster; if at the same time deploying a new document processor (i.e. a document processor that was unknown to Vespa at the time the cluster was started), there are a few things to note:
  1. On each docproc node in the cluster, telnet to port 19098
  2. Type ls
  3. Type restart docprocservice, and/or restart docprocservice2, restart docprocservice3 and so on
  4. Type quit

Restart distributor node

When a distributor stops, it will try to respond to any pending cluster state request first. New incoming requests after shutdown is commenced will fail immediately, as the socket is no longer accepting requests. Cluster controllers will thus detect processes stopping almost immediately.

The cluster state will be updated with the new state internally in the cluster controller. Then the cluster controller will wait for maximum min_time_between_new_systemstates before publishing the new cluster state - this to reduce short-term state fluctuations.

The cluster controller has the option of setting states to make other distributors take over ownership of buckets, or mask the change, making the buckets owned by the distributor restarting unavailable for the time being. Distributors restart fast, so the restarting distributor may transition directly from up to initializing. If it doesn't, current default behavior is to set it down immediately.

If transitioning directly from up to initializing, requests going through the remaining distributors will be unaffected. The requests going through the restarting distributor will immediately fail when it shuts down, being resent automatically by the client. The distributor typically restart within seconds, and syncs up with the service layer nodes to get metadata on buckets it owns, in which case it is ready to serve requests again.

If the distributor transitions from up to down, and then later to initializing, other distributors will request metadata from the service layer node to take over ownership of buckets previously owned by the restarting distributor. Until the distributors have gathered this new metadata from all the service layer nodes, requests for these buckets can not be served, and will fail back to client. When the restarting node comes back up and is marked initializing or up in the cluster state again, the additional nodes will dump knowledge of the extra buckets they previously acquired.

For requests with timeouts of several seconds, the transition should be invisible due to automatic client resending. Requests with a lower timeout might fail, and it is up to the application whether to resend or handle failed requests.

Requests to buckets not owned by the restarting distributor will not be affected. The other distributors will start to do some work though, affecting latency, and distributors will refetch metadata for all buckets they own, not just the additional buckets, which may cause some disturbance.

Restart content node

When a content node restarts in a controlled fashion, it marks itself in the stopping state and rejects new requests. It will process its pending request queue before shutting down. Consequently, client requests are typically unaffected by content node restarts. The currently pending requests will typically be completed. New copies of buckets will be created on other nodes, to store new requests in appropriate redundancy. This happens whether node transitions through down or maintenance state. The difference being that if transitioning through maintenance state, the distributor will not start any effort of synchronizing new copies with existing copies. They will just store the new requests until the maintenance node comes back up.

When coming back up, content nodes will start with gathering information on what buckets it has data stored for. While this is happening, the service layer will expose that it is initializing, but not done with the bucket list stage. During this time, the cluster controller will not mark it initializing in cluster state yet. Once the service layer node knows what buckets it has, it reports that it is calculating metadata for the buckets, at which time the node may become visible as initializing in cluster state. At this time it may start process requests, but as bucket checksums have not been calculated for all buckets yet, there will exist buckets where the distributor doesn't know if they are in sync with other copies or not.

The background load to calculate bucket checksums has low priority, but load received will automatically create metadata for used buckets. With an overloaded cluster, the initializing step may not finish before all buckets have been initialized by requests. With a cluster close to max capacity, initializing may take quite some time.

The cluster is mostly unaffected during restart. During the initializing stage, bucket metadata is unknown. Distributors will assume other copies are more appropriate for serving read requests. If all copies of a bucket are in an initializing state at the same time, read requests may be sent to a bucket copy that does not have the most updated state to process it.

Troubleshooting

Distributor or content node not existing

Content cluster nodes will register in the vespa-slobrok naming service on startup. If the nodes have not been set up or fail to start required processes, the naming service will mark them as unavailable.

Effect on cluster: Calculations for how big percentage of a cluster that is available will include these nodes even if they never have been seen. If many nodes are configured, but not in fact available, the cluster may set itself offline due by concluding too many nodes are down.

Content node not available on the network

vespa-slobrok requires nodes to ping it periodically. If they stop sending pings, they will be set as down and the cluster will restore full availability and redundancy by redistribution load and data to the rest of the nodes. There is a time window where nodes may be unavailable but still not set down by slobrok.

Effect on cluster: Nodes that become unavailable will be set as down after a few seconds. Before that, document operations will fail and will need to be resent. After the node is set down, full availability is restored. Data redundancy will start to restore.

Distributor or content node crashing

A crashing node restarts in much the same node as a controlled restart. A content node will not finish processing the currently pending requests, causing failed requests. Client resending might hide these failures, as the distributor should be able to process the resent request quickly, using other copies than the recently lost one.

Trashing nodes

An example is OS disk using excessive amount of time to complete IO requests. Ends up with maximum number of files open, and as the OS is so dependent on the filesystem, it ends being able to do not much at all.

get-node-state requests from the cluster controller fetch node metrics from /proc and write this to a temp directory on the disk before responding. This causes a trashing node to time out get-node-state requests, setting the node down in the cluster state.

Effect on cluster: This will have the same effects like the not available on network issue.

Constantly restarting distributor or service layer node

A broken node may end up with processes constantly restarting. It may die during initialization due to accessing corrupt files, or it may die when it starts receiving requests of a given type triggering a node local bug. This is bad for distributor nodes, as these restarts create constant ownership transfer between distributors, causing windows where buckets are unavailable.

The cluster controller has functionality for detecting such nodes. If a node restarts in a way that is not detected as a controlled shutdown, more than max_premature_crashes, the cluster controller will set the wanted state of this node to be down.

Detecting a controlled restart is currently a bit tricky. A controlled restart is typically initiated by sending a TERM signal to the process. Not having any other sign, the content layer has to assume that all TERM signals are the cause of controlled shutdowns. Thus, if the process keep being killed by kernel due to using too much memory, this will look like controlled shutdowns to the content layer.