• [+] expand all

Vespa Benchmarking

Benchmarking a Vespa application is essential to get an idea of how well the test configuration performs. Thus, benchmarking is an essential part of sizing a search cluster itself. Benchmarking a cluster can answer the following questions:

  • What throughput and latency to expect from a search node?
  • Which resource is the bottleneck in the system?

These in turn indirectly answers other questions such as how many nodes are needed, and if it will help to upgrade disk or CPU. Thus, benchmarking will help in finding the optimal Vespa configuration, using all resources optimally, which in turn lowers costs.

A good rule is to benchmark whenever the workload changes. Benchmarking should also be done when adding new features to queries.

Having an understanding of the query mix and SLA will help to set the test parameters. Before benchmarking, consider:

  • What is the expected query mix? Having a representative query mix to test with is essential in order to get valid results. Splitting up in different types of queries is also a useful way to get an idea of which query is the heaviest.
  • What is the expected SLA, both in terms of latency and throughput?
  • How important is real-time behavior? What is the rate of incoming documents, if any?

If benchmarking using Vespa Cloud, see Vespa Cloud Benchmarking.


Vespa provides a query load generator tool, vespa-fbench, to run queries and generate statistics - much like a traditional web server load generator. It allows running any number of clients (i.e. the more clients, the higher load), for any length of time, and adjust the client response time before issuing the next query. It outputs the throughput, max, min, and average latency, as well as the 25, 50, 75, 90, 95 and 99 percentiles. This provides quite accurate information of how well the system manages the workload.

Note: It is possible to list several hostnames and ports. The different hostnames will be distributed to the clients in a round-robin manner, such that, with two hosts, client 0, 2, …, 38 makes requests to the first host, while client 1, 3, …, 39 makes requests to the second host.

Disclaimer: vespa-fbench is a tool to drive load for benchmarking and tuning. It is not a tool for finding the maximum load or latencies in a production setting. This is due to the way it is implemented: It is run with -n number of clients per run. It is good for testing, as proton can be observed at different levels of concurrency. In the real world, the number of active clients will follow a different distribution, and impact 95p / 99p percentiles.

Prepare queries

vespa-fbench uses query files for GET and POST queries - see the reference - examples:

{"yql" : "select * from sources * where true"}

A common way to make query files is to use the queries from production installations, or generate the queries from the document feed or expected queries.

vespa-fbench runs each client in a separate thread. Split the query files into one file per client:

$ gunzip -c queries.gz | grep '/search/' | vespa-fbench-split-file 512

Make sure to generate as many files as concurrent clients (-n parameters) - use -r to reuse query files. Each client would use a query and output file given by the given pattern and it's client number, i.e. client 1 will use query file query001.txt and output file output001.txt.

Run queries

A typical vespa-fbench command looks like:

$ vespa-fbench -n 8 -q query%03d.txt -s 300 -c 0 myhost.mydomain.com 8080

This starts 8 clients, each using queries from a query file prefixed with query, followed by the client number. This way, client 1 will use query000.txt etc. The -s parameter indicates that the benchmark will run for 300 seconds. The -c parameter, states that each client should wait for 0 milliseconds between each query. The last two parameters are container hostname and port. Multiple hosts and ports can be provided, and the clients will be uniformly distributed to query the containers round-robin.

A more complex example, using docker, hitting a Vespa Cloud endpoint:

$ docker run -v /Users/myself/tmp:/testfiles \
      -w /testfiles --entrypoint '' vespaengine/vespa \
      /opt/vespa/bin/vespa-fbench \
          -C data-plane-public-cert.pem -K data-plane-private-key.pem -T /etc/ssl/certs/ca-bundle.crt \
          -n 10 -q query%03d.txt -o output%03d.txt -xy -s 300 -c 0 \
          myapp.mytenant.aws-us-east-1c.z.vespa-app.cloud 443

Post Processing

After each run, a summary is written to stdout (and possibly an output file from each client) - example:

***************** Benchmark Summary *****************
clients:                      30
ran for:                    1800 seconds
cycle time:                    0 ms
lower response limit:          0 bytes
skipped requests:              0
failed requests:               0
successful requests:    12169514
cycles not held:        12169514
minimum response time:      0.82 ms
maximum response time:   3010.53 ms
average response time:      4.44 ms
25 percentile:              3.00 ms
50 percentile:              4.00 ms
75 percentile:              6.00 ms
90 percentile:              7.00 ms
95 percentile:              8.00 ms
99 percentile:             11.00 ms
actual query rate:       6753.90 Q/s
utilization:               99.93 %

Take note of the number of failed requests, as a high number here can indicate that the system is overloaded, or that the queries are invalid.

The -xy options outputs benchmarking data to output files, one per client. Note that saving all responses to disk might impact the performance of the benchmarking itself. If only the summary is needed, it is recommended to not use output files.

Use vespa-fbench-result-filter.pl to format the above output into a space-separated format - this simplifies import to spreadsheets of plotting.


  • In some modes of operation, vespa-fbench waits before sending the next query. "utilization" represents the time that vespa-fbench is sending queries and waiting for responses. For example, a 'system utilization' of 50% means that vespa-fbench is stress testing the system 50% of the time, and is doing nothing the remaining 50% of the time
  • Do not run vespa-fbench on the same machine as the search node, as it can impact the performance of vespa system. If running on the container node, track CPU usage to make sure impact is minimal.
  • vespa-fbench latency results include network latency. Measure and subtract network latency to obtain the true vespa query latency.
  • If many of the queries return zero results, the average latency will be low


Strategy: find optimal requestthreads number, then find capacity by increasing number of parallel test clients:

  1. Test with single client (n=1), single thread to find a latency baseline. For each test run, increase threads:

    <content id="search" version="1.0">
    use 1, 2, 4, 8, ... threads and measure query latency (vespa-fbench output) and CPU utilization (metric - below). Note: after deploying the thread config change, proton must be restarted for new thread setting to take effect (look for ONLINE):
    $ vespa-stop-services && vespa-start-services && sleep 60 && vespa-proton-cmd --local getProtonStatus
      "documentdb:search","OK","state=ONLINE configstate=OK",""
  2. use #threads sweet spot, then increase number of clients, observe latency and CPU. Rule of thumb: not exceed 70% CPU

  • Allow a 1-2 minute warmup after proton restart. Alternatively, let each test run use 5-10 minutes, and use stable latency values
  • Normally, looking at CPU metrics for the node running proton suffice - but make sure the node running the container is not saturated
  • Test validity: When scaling down from full data set to single-node, make sure to extract data that are representative. I.e. the number of ranked hits per node should be the same as for in full data set. Look at zero hit queries in fbench output to evaluate
  • Grouping is multi-phase, so behavior in a larger system will be different, based on data set


The container nodes expose the /metrics/v2/values interface - use this to dump metrics during benchmarks. Example - output all metrics from content node:

$ curl http://localhost:8080/metrics/v2/values | \
  jq '.nodes[] | select(.role=="content/mysearchcluster/0/0") | .node.metrics[].values'
Output CPU util:
$ curl http://localhost:8080/metrics/v2/values | \
  jq '.nodes[] | select(.role=="content/mysearchcluster/0/0") | .node.metrics[].values."cpu.util"'

Elastic Block Store

If benchmarking Vespa using EBS or similar quota-based file system, pay attention to the IO operations. EBS generally has high performance as long as there are free IO credits. Use Cloudwatch, navigate to Elastic Block Store and track Burst Balance Average - make sure the units never drop to 0.

Note that even though tests are successful for normal write and read load, there are operational use cases that will radically increase IO load. Worst case is adding more nodes to a system, where current nodes will de-index documents are they are written to new nodes, which will also have a high write load. The more documents, the more IO credits are spent.

Summary: if using EBS or similar storage, make sure all operational use cases are tested. Vespa Team normally recommends using instances with local SSDs for applications with much data / high IO.