Config sentinel

The config sentinel starts and stops services - and restart failed services unless they are manually stopped. All nodes in a Vespa system has at least these running processes:

config-proxy proxies config requests between Vespa applications and the configserver node. All configuration is cached locally so that this node can maintain its current configuration, even if the configserver shuts down
config-sentinel registers itself with the config-proxy and subscribes to and enforces node configuration, meaning the configuration of what services should be run locally, and with what parameters
vespa-logd monitors $VESPA_HOME/logs/vespa/vespa.log, which is used by all other services, and relays everything to the log-server

Start sequence:

  1. config-proxy is started. The environment variables VESPA_CONFIGSERVERS and VESPA_CONFIGSERVER_RPC_PORT are used to connect to the config-server(s). It will retry all config servers in case some are down.
  2. config-sentinel is started, and subscribes to node configuration (i.e. a service list) from config-proxy using its hostname as the config id. See Node and Network setup - Concepts and Requirements for details about how the hostname is detected and how to override it. The config for the config-sentinel (the service list) lists the processes to be started, along with the config id to assign to each, typically the logical name of that service instance.
  3. config-proxy subscribes to node configuration from config-server, caches it, and returns the result to config-sentinel
  4. config-sentinel starts the services given in the node configuration, with the config id as argument - see example output below, like id="search/qrservers/qrserver.0". Each service:
    1. Subscribes to configuration from config-proxy
    2. config-proxy subscribes to configuration from config-server, caches it and returns result to the service
    3. The service runs according to its configuration, logging to $VESPA_HOME/logs/vespa/vespa.log. The processes instantiate internal components, each assigned the same or another config id, and instantiating further components.

When new config is deployed to config-servers they propagate the changed configuration to nodes subscribing to it. In turn, these nodes reconfigure themselves accordingly.

User interface

The config sentinel runs a RPC service which can be used to list, start and stop the services supposed to run on that node. This can be useful for testing and debugging. Use the vespa-sentinel-cmd command line tool to trigger these actions. Example output from vespa-sentinel-cmd list:

vespa-sentinel-cmd 'sentinel.ls' OK.
container state=RUNNING mode=AUTO pid=27993 exitstatus=0 id="default/container.0"
container-clustercontroller state=RUNNING mode=AUTO pid=27997 exitstatus=0 id="admin/cluster-controllers/0"
distributor state=RUNNING mode=AUTO pid=27996 exitstatus=0 id="search/distributor/0"
logd state=RUNNING mode=AUTO pid=5751 exitstatus=0 id="hosts/r6-3/logd"
logserver state=RUNNING mode=AUTO pid=27994 exitstatus=0 id="admin/logserver"
searchnode state=RUNNING mode=AUTO pid=27995 exitstatus=0 id="search/search/cluster.search/0"
slobrok state=RUNNING mode=AUTO pid=28000 exitstatus=0 id="admin/slobrok.0"
To learn more about the processes and services, see files and processes. Use vespa-model-inspect host hostname to list services running on a node.

Cluster startup

One can configure Vespa to do a Safe Cluster Startup, setting a minimum node available ratio. Use this to control the Vespa cluster start sequence. The config sentinel will not start services on a node unless it has connectivity to a minimum other nodes.

The safe mode is disabled by default, enable it by setting configuration on top level in services.xml (since Vespa-7.415):

<services>
    <config name="cloud.config.sentinel">
        <connectivity>
            <minOkPercent>40</minOkPercent>
            <maxBadCount>1</maxBadCount>
        </connectivity>
    </config>

Example: minOkPercent 40 means that services will be started only if more than or equal to 40% of nodes are up. If there are 5 nodes (container + content nodes) in the application, the first node started will not start its services - when the second node is started, services will be started on both.

maxBadCount is for connectivity checks where the other node is is up, but we still do not have proper two-way connectivity. Normally, one-way connectivity means network configuration is broken and needs looking into, so this may be set very low (1 or even 0 are the recommended values). If there are some temporary problems (in the example below non-responding DNS which leads to various issues at startup) the config sentinel will loop and retry, so the service startup will just be slightly delayed.

Example log:

[2021-06-15 14:33:25] EVENT   : starting/1 name="sbin/vespa-config-sentinel -c hosts/le40808.ostk (pid 867)"
[2021-06-15 14:33:25] EVENT   : started/1 name="config-sentinel"
[2021-06-15 14:33:25] CONFIG  : Sentinel got 4 service elements [tenant(footest), application(bartest), instance(default)] for config generation 1001
[2021-06-15 14:33:25] CONFIG  : Booting sentinel 'hosts/le40808.ostk' with [stateserver port 19098] and [rpc port 19097]
[2021-06-15 14:33:25] CONFIG  : listening on port 19097
[2021-06-15 14:33:25] CONFIG  : Sentinel got model info [version 7.420.21] for 35 hosts [config generation 1001]
[2021-06-15 14:33:25] CONFIG  : connectivity.maxBadCount = 3
[2021-06-15 14:33:25] CONFIG  : connectivity.minOkPercent = 40
[2021-06-15 14:33:28] INFO    : Connectivity check details: 2086533.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le01287.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le23256.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le23267.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le23297.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le23312.ostk -> connect OK, but reverse check FAILED
[2021-06-15 14:33:28] INFO    : Connectivity check details: le23317.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le23319.ostk -> connect OK, but reverse check FAILED
[2021-06-15 14:33:28] INFO    : Connectivity check details: le30550.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le30553.ostk -> connect OK, but reverse check FAILED
[2021-06-15 14:33:28] INFO    : Connectivity check details: le30556.ostk -> unreachable from me, but up
[2021-06-15 14:33:28] INFO    : Connectivity check details: le30560.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le30567.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40387.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40389.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40808.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40817.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40833.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40834.ostk -> unreachable from me, but up
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40841.ostk -> connect OK, but reverse check FAILED
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40858.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40860.ostk -> unreachable from me, but up
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40863.ostk -> connect OK, but reverse check FAILED
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40873.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40892.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40900.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40905.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: le40914.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: sm02318.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: sm02324.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: sm02340.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: zt40672.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: zt40712.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: zt40728.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] INFO    : Connectivity check details: zt41329.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:28] WARNING : 8 of 35 nodes up but with network connectivity problems (max is 3)
[2021-06-15 14:33:28] WARNING : Bad network connectivity (try 1)
[2021-06-15 14:33:30] WARNING : slow resolve time: 'le30556.ostk' -> '1234:5678:90:123::abcd' (5.00528 s)
[2021-06-15 14:33:30] WARNING : slow resolve time: 'le40834.ostk' -> '1234:5678:90:456::efab' (5.00527 s)
[2021-06-15 14:33:30] WARNING : slow resolve time: 'le40860.ostk' -> '1234:5678:90:789::cdef' (5.00459 s)
[2021-06-15 14:33:31] INFO    : Connectivity check details: le23312.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:31] INFO    : Connectivity check details: le23319.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:31] INFO    : Connectivity check details: le30553.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:31] INFO    : Connectivity check details: le30556.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:31] INFO    : Connectivity check details: le40834.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:31] INFO    : Connectivity check details: le40841.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:31] INFO    : Connectivity check details: le40860.ostk -> connect OK, but reverse check FAILED
[2021-06-15 14:33:31] INFO    : Connectivity check details: le40863.ostk -> OK: both ways connectivity verified
[2021-06-15 14:33:31] INFO    : Enough connectivity checks OK, proceeding with service startup
[2021-06-15 14:33:31] EVENT   : starting/1 name="searchnode"
...