• [+] expand all

Configuration Servers

Vespa Configuration Servers host the endpoint where application packages are deployed - and serves generated configuration to all services - see the overview and Vespa configuration for details. I.e. one cannot configure Vespa without config servers, and services cannot run without it.

It is useful to understand the Vespa start sequence. Refer to the sample applications multinode and multinode-HA for practical examples of multi-configserver configuration.

Vespa configuration is set up using one or more configuration servers (config servers). A config server uses ZooKeeper as a distributed data storage for the configuration system. In addition, each node runs a config proxy to cache configuration data - find an overview at services start.


The config servers are defined in VESPA_CONFIGSERVERS, services.xml and hosts.xml:

$ VESPA_CONFIGSERVERS=myserver0.mydomain.com,myserver1.mydomain.com,myserver2.mydomain.com
    <admin version="2.0">
            <configserver hostalias="admin0" />
            <configserver hostalias="admin1" />
            <configserver hostalias="admin2" />
    <host name="myserver0.mydomain.com">
    <host name="myserver1.mydomain.com">
    <host name="myserver2.mydomain.com">

VESPA_CONFIGSERVERS must be set on all nodes. This is a comma- or whitespace-separated list with the hostname of all config servers, like myhost1.mydomain.com,myhost2.mydomain.com,myhost3.mydomain.com.

When there are multiple config servers, the config proxy will pick a config server randomly (to achieve load balancing between config servers). The config proxy is fault-tolerant and will switch to another config server (if there is more than one) if the one it is using becomes unavailable or there is an error in the configuration it receives.

For the system to tolerate n failures, ZooKeeper by design requires using (2*n)+1 nodes. Consequently, only an odd numbers of nodes is useful, so you need minimum 3 nodes to have a fault-tolerant config system.

Even when using just one config server, the application will work if the server goes down (but deploying application changes will not work). Since the config proxy runs on every node and caches configs, it will continue to serve config to the services on that node. However, restarting a node when config servers are unavailable means that services on the node will be unable to start since the cache will be destroyed when restarting the config proxy.

Refer to the admin model reference for more details on services.xml.

Start sequence

To bootstrap a Vespa application instance, the high-level steps are:

  • Start config servers
  • Deploy config
  • Start Vespa nodes

multinode-HA is a great guide on how to start a multinode Vespa application instance - try this first. Detailed steps for config server startup:

  1. Set VESPA_CONFIGSERVERS on all nodes, using fully qualified hostnames and the same value on all nodes, including the config servers.
  2. Start the config server on the nodes configured in services/hosts.xml. Make sure the startup is successful by inspecting /state/v1/health, default on port 19071:
    $ curl http://localhost:19071/state/v1/health
        "time" : 1651147368066,
        "status" : {
            "code" : "up"
        "metrics" : {
            "snapshot" : {
                "from" : 1.651147308063E9,
                "to" : 1.651147367996E9
    If there is no response on the health API, two things can have happened:
    • The config server process did not start - inspect logs using vespa-logfmt, or check $VESPA_HOME/logs/vespa/vespa.log, normally /opt/vespa/logs/vespa/vespa.log.
    • The config server process started, and is waiting for Zookeeper quorum:
    $ vespa-logfmt -S configserver
    configserver     Container.com.yahoo.vespa.zookeeper.ZooKeeperRunner	Starting ZooKeeper server with /opt/vespa/var/zookeeper/conf/zookeeper.cfg. Trying to establish ZooKeeper quorum (members: [node0.vespanet, node1.vespanet, node2.vespanet], attempt 1)
    configserver     Container.com.yahoo.container.handler.threadpool.ContainerThreadpoolImpl	Threadpool 'default-pool': min=12, max=600, queue=0
    configserver     Container.com.yahoo.vespa.config.server.tenant.TenantRepository	Adding tenant 'default', created 2022-04-28T13:02:24.182Z. Bootstrapping in PT0.175576S
    configserver     Container.com.yahoo.vespa.config.server.rpc.RpcServer	Rpc server will listen on port 19070
    configserver     Container.com.yahoo.container.jdisc.state.StateMonitor	Changing health status code from 'initializing' to 'up'
    configserver     Container.com.yahoo.jdisc.http.server.jetty.Janitor	Creating janitor executor with 2 threads
    configserver     Container.com.yahoo.jdisc.http.server.jetty.JettyHttpServer	Threadpool size: min=22, max=22
    configserver     Container.org.eclipse.jetty.server.Server	jetty-9.4.46.v20220331; built: 2022-03-31T16:38:08.030Z; git: bc17a0369a11ecf40bb92c839b9ef0a8ac50ea18; jvm
    configserver     Container.org.eclipse.jetty.server.handler.ContextHandler	Started o.e.j.s.ServletContextHandler@341c0dfc{19071,/,null,AVAILABLE}
    configserver     Container.org.eclipse.jetty.server.AbstractConnector	Started configserver@3cd6d147{HTTP/1.1, (http/1.1, h2c)}{}
    configserver     Container.org.eclipse.jetty.server.Server	Started @21955ms
    configserver     Container.com.yahoo.container.jdisc.ConfiguredApplication	Switching to the latest deployed set of configurations and components. Application config generation: 0
    It will hang until quorum is reached, and the second highlighted log line is emitted. Root causes for missing quorum can be:
    • No connectivity between the config servers. Zookeeper logs the members like (members: [node0.vespanet, node1.vespanet, node2.vespanet], attempt 1). Verify that the nodes running config server can reach each other on port 2181.
    • No connectivity can be wrong network config. multinode-HA uses a docker network, make sure there are no underscores in the hostnames.
  3. Once all config servers return up on state/v1/health, an application package can be deployed. This means, if deploy fails, it is always a good idea to verify the config server health first - if config servers are up, and deploy fails, it is most likely an issue with the application package - if so, refer to application packages.
  4. A successful deployment logs the following, for the prepare and activate steps:
    Container.com.yahoo.vespa.config.server.ApplicationRepository	Session 2 prepared successfully.
    Container.com.yahoo.vespa.config.server.deploy.Deployment	Session 2 activated successfully using no host provisioner. Config generation 2. File references: [file '9cfc8dc57f415c72']
    Container.com.yahoo.vespa.config.server.session.SessionRepository	Session activated: 2
  5. Start the Vespa nodes. Technically, they can be started at any time. When troubleshooting, it is easier to make sure the config servers are started successfully, and deployment was successful - before starting any other nodes. Refer to the Vespa start sequence and Vespa start / stop / restart.

Make sure to look for logs on all config servers when debugging.

Scaling up

Do this by adding nodes one by one. Add a config server node for increased fault tolerance or when replacing a node. Procedure:

  1. Install vespa on new config server node.
  2. Add config server node to VESPA_CONFIGSERVERS on all nodes (both config server nodes and other nodes)
  3. Restart the config server on the original config server nodes and start it on the new one, one at a time (if you want to verify that new config is in use, try using /opt/vespa/libexec/vespa/vespa-curl-wrapper https://localhost:19071/status| jq .configserverConfig.zookeeperserver after each restart to confirm that config servers are defined as expected).
  4. Update services.xml and hosts.xml with the new set of config servers, then vespa-deploy prepare and vespa-deploy activate
  5. Restart other nodes (all except config servers) one by one to start using the new config servers.

Note: ZooKeeper will automatically redistribute the application data.

Scaling up by Majority

When increasing from 1 to 3 nodes or 3 to 7, the blank nodes constitutes a majority in the cluster. After restarting the config servers, they will not contain the old application data, because the blank nodes might win the ZooKeeper master election - depending on restart timing. To avoid any issues with this, scale up by minor sets of the nodes - example:

  1. Scale from 1 to 2
  2. Scale from 2 to 3

Scaling down

Remove a config server from a cluster:

  1. Remove config server node from VESPA_CONFIGSERVERS on all vespa nodes
  2. Restart other nodes one by one to start using the new set of config servers.
  3. Restart remaining config servers one by one.
  4. Verify that these nodes have data, by using vespa-get-config or vespa-zkcli ls (see below). If they are blank, redo vespa-deploy prepare and vespa-deploy activate. Also see health checks.
  5. Pull removed node from production.

Replacing nodes

  • Make sure to replace only one node at a time.
  • If you have less than 3 config servers you need to first scale up with a new node, then scale down by removing the old node. Repeat for each node you want to replace.
  • If you have 3 or more you can replace one of the old nodes in VESPA_CONFIGSERVERS with the new one instead of adding one, otherwise same procedure as in Scaling up. Repeat for each node you want to replace.
  • Tools

    Tools to access config:


    ZooKeeper handles data consistency across multiple config servers. The config server Java application runs a ZooKeeper server, embedded with an RPC frontend that the other nodes use. ZooKeeper stores data internally in nodes that can have sub-nodes, similar to a file system.

    When starting or restarting the config server, the configuration file for ZooKeeper, $VESPA_HOME/var/zookeeper/conf/zookeeper.cfg, is generated based on the contents of VESPA_CONFIGSERVERS. Hence, config server(s) must all be restarted if that changes on a config server node.

    At vespa-deploy prepare, the application's files, along with global configurations, are stored in ZooKeeper. The application data is stored under /config/v2/tenants/default/sessions/[sessionid]/userapp. At vespa-deploy activate, the newest application is activated live by writing the session id into /config/v2/tenants/default/applications/default:default:default. It is at that point the other nodes get configured.

    Use vespa-zkcli to inspect state, replace with actual session id:

    $ vespa-zkcli ls  /config/v2/tenants/default/sessions/sessionid/userapp
    $ vespa-zkcli get /config/v2/tenants/default/sessions/sessionid/userapp/services.xml

    The ZooKeeper server logs to $VESPA_HOME/logs/vespa/zookeeper.configserver.0.log (files are rotated with sequence number)

    ZooKeeper Recovery

    If the config server(s) should experience data corruption, for instance a hardware failure, use the following recovery procedure. One example of such a scenario is if $VESPA_HOME/logs/vespa/zookeeper.configserver.0.log says java.io.IOException: Negative seek offset at java.io.RandomAccessFile.seek(Native Method), which indicates ZooKeeper has not been able to recover after a full disk. There is no need to restart Vespa on other nodes during the procedure:

    1. vespa-stop-configserver
    2. vespa-configserver-remove-state
    3. vespa-start-configserver
    4. vespa-deploy prepare <application path>
    5. vespa-deploy activate

    This procedure completely cleans out ZooKeeper's internal data snapshots and deploys from scratch.

    Note that by default the cluster controller that maintains the state of the content cluster will use the shared same ZooKeeper instance, so the content cluster state is also reset when removing state. Manually set state will be lost (e.g. a node with user state down). It is possible to run cluster-controllers in standalone zookeeper mode - see standalone-zookeeper.

    ZooKeeper barrier timeout

    If the config servers are heavily loaded, or the applications being deployed are big, the internals of the server may time out when synchronizing with the other servers during deploy. To work around, increase the timeout by setting: VESPA_CONFIGSERVER_ZOOKEEPER_BARRIER_TIMEOUT to 600 (seconds) or higher, and restart the config servers.


    To access config from a node not running the config system (e.g. doing feeding via the Document API), use the environment variable VESPA_CONFIG_SOURCES:

    $ export VESPA_CONFIG_SOURCES="myadmin0.mydomain.com:19071,myadmin1.mydomain.com:19071"

    Alternatively, for Java programs, use the system property configsources and set it programmatically or on the command line with the -D option to Java. The syntax for the value is the same as for VESPA_CONFIG_SOURCES.

    System requirements

    The minimum heap size for the JVM it runs under is 128 Mb and max heap size is 2 GB (which can be changed with a setting). It writes a transaction log that is regularly purged of old items, so little disk space is required. Note that running on a server that has a lot of disk I/O will adversely affect performance and is not recommended.


    The config server RPC port can be changed by setting VESPA_CONFIGSERVER_RPC_PORT on all nodes in the system.

    Changing HTTP port requires changing the port in $VESPA_HOME/conf/configserver-app/services.xml:

        <server port="19079" id="configserver" />

    When deploying, use the -p option, if port is changed from the default.


    Health checks

    Verify that a config server is up and running using /state/v1/health, see start sequence. Status code is up if the server is up and has finished bootstrapping.

    Alternatively, use http://localhost:19071/status.html which will return response code 200 if server is up and has finished bootstrapping.

    Metrics are found at /state/v1/metrics. Use vespa-model-inspect to find host and port number, port is 19071 by default.


    When having more than one config server, consistency between the servers is crucial. http://localhost:19071/status can be used to check that settings for config servers are the same for all servers.

    vespa-config-status can be used to check config on nodes.

    http://localhost:19071/application/v2/tenant/default/application/default displays active config generation and should be the same on all servers, and the same as in response from running vespa-deploy

    Bad Node

    If running with more than one config server and one of these goes down or has hardware failure, the cluster will still work and serve config as usual (clients will switch to use one of the good servers). It is not necessary to remove a bad server from the configuration.

    Deploying applications will take longer, as vespa-deploy will not be able to complete a deployment on all servers when one of them is down. If this is troublesome, lower the barrier timeout - (default value is 120 seconds).

    Note also that if you have not configured cluster controllers explicitly, these will run on the config server nodes and the operation of these might be affected. This is another reason for not trying to manually remove a bad node from the config server setup.

    Stuck filedistribution

    The config system distributes binary files (such as jar bundle files) using file-distribution - use vespa-status-filedistribution to see detailed status if it gets stuck.


    Insufficient memory on the host / in the container running the config server will cause startup or deploy / configuration problems - see Docker containers.


    The following can be caused by a full disk on the config server, or clocks out of sync:

    at com.yahoo.vespa.zookeeper.ZooKeeperRunner.startServer(ZooKeeperRunner.java:92)
    Caused by: java.io.IOException: The accepted epoch, 10 is less than the current epoch, 48

    Users have reported that "Copying the currentEpoch to acceptedEpoch fixed the problem".