# Vespa Serving Scaling Guide

Will it scale? is a question we hear a lot? The answer is usually yes as Vespa can scale in any dimension:

• Scale document volume and write (update/put) volume
• Scale query throughput
• Scale serving latency to meet service level agreement (SLA)
The question you try to answer during a sizing exercise is "what the total cost would be to serve a use case using Vespa". This document helps sizing an application correctly with as low cost as possible. Vespa is used to implement many use cases, and this document is relevant for all of them:
With Vespa it is possible to do benchmarking on a few nodes to infer the overall performance and cost of the chosen deployment architecture, and as Vespa supports live resizing, it is easy to scale from a prototype to a full production size deployment.

This document covers sizing and capacity planning for serving, see feed performance sizing for feed performance sizing and Vespa serving feature tuning. It also covers the following topics:

## Data distribution in Vespa - flat versus grouped

The basic element in the Vespa search architecture is a content node, which is part of a content cluster. A Vespa deployment can have several content clusters, which can be scaled independently.

A content node holds a fraction of the entire data corpus. Data is distributed to nodes using a distribution algorithm, which goal is to uniformly distribute data over the set of nodes. The goal is also to avoid distribution skew, while at the same time supporting re-distribution of data, with minimal data movement, if the size of the content cluster changes. Read elastic Vespa to learn how data is distributed across nodes, and how adding or removing nodes works. See also Vespa's consistency model documentation.

With a flat distribution, the content is distributed to content nodes using the ideal state distribution algorithm. A query is dispatched in parallel from a container instance to all content nodes in the content cluster. Each content node searches the active part of the ready sub-database. The above figure illustrates a deployment using 4 nodes with redundancy 2 and searchable-copies 2 - see the availability section.

When using flat data distribution, the only way to scale query throughput is to reduce the search latency. Given a fixed occupancy (users, load clients), this relationship between query throughput and latency is described by Little's law - more on this in content cluster scalability model section.

With a grouped distribution, content is distributed to a configured set of groups, such that the entire document collection is contained in each group. A group contains a set of content nodes where the content is distributed using the distribution algorithm. In the above illustration, there are 4 nodes in total, 2 groups with 2 nodes in each group. redundancy is 2 and searchable-copies is also 2. As can seen from the figure with this grouped configuration, the content nodes only have a populated ready sub-database. A query is dispatched in parallel to all nodes in one group at a time using a dispatch-policy. The default policy is round-robin, which gives an uniform query distribution over the set of groups.

### High Data Availability

Ideally, the data is available and searchable at all times, even during node failures. High availability costs resources due to data replication. How many replicas of the data to configure, depends on what kind of availability guarantees the deployment should provide. Configure availability vs cost:

redundancy Defines the total number of copies of each piece of data the cluster will store and maintain to avoid data loss. Example: with a redundancy of 2, the system tolerates 1 node failure before any further node failures may cause data to become unavailable. Configures how many of the copies (as configured with redundancy) to be indexed (ready) at any time. Configuring searchable-copies to be less than redundancy saves resources (memory, disk, cpu), as not all copies are indexed (ready). In case of node failure, the remaining nodes needs to a index the not ready documents which belonged to the failed node. In this transition period, the search has reduced search coverage.

The above figure illustrates the three sub-databases inside a Vespa content node.

• The documents in the Ready DB are indexed, but only the documents in Active state are searchable. In a flat distributed system there is only one active instance of the same document, while with grouped distribution there is one active instance per group.
• The documents in the Not Ready DB are stored but not indexed.
• The documents in the Removed are stored but blacklisted, hidden from search. The documents are permanently deleted from storage by Proton maintenance jobs.
If the availability guarantees tolerate temporary search coverage loss during node failures (e.g. searchable-copies=1), this is by far the most optimal for for serving performance - the query work per node is less as index structures like posting lists are smaller. The index structures only contains documents in Active state, not including Not Active documents.

With searchable-copies=2 and redundancy=2, each replica is fully indexed on separate content nodes. Only the documents in Active state is searchable, the posting lists for a given term is (up to) doubled as compared to searchable-copies=1.

See Content cluster Sizing example deployments for examples using grouped and flat data distribution.

## Life of a query in Vespa

Vespa executes a query in two protocol phases (or more if using result grouping features) to optimize the network footprint of the parallel query execution. The first protocol phase executes the query in parallel over content nodes in a group to find the global top hits, the second protocol phase fetches the data of the global top hits.

During the first phase, content nodes match and rank documents using the selected rank-profile/model. The hits are returned to the stateless container for merging and potentially blending when multiple content clusters are involved.

When the global top ranking documents are found, the second protocol phase fetch the summary data for the global best hits (e.g. summary snippets, the original field contents, and ranking features). By doing the query in two protocol phases one avoids transferring summary data for hits which will not make it into the global best hits.

Components Involved in Query Execution:

• Container
• Parses the API request and the query and run time context features.
• Modify the query according to the search definition specification (stemming, etc) for a text search application or creating run time query or user context tensors for a ML serving application.
• Invokes chains of custom container components/plugins which can work on the request and query input and also the results.
• Dispatching of query to content nodes in the content cluster(s) for parallel execution. With flat distribution, queries are dispatched to all content nodes while with a grouped distribution the query is dispatched to all content nodes within a group and the queries are load-balanced between the groups using a dispatch-policy.
• Blending of global top ranking results from cluster(s).
• Fetching the top ranking results with document summaries from cluster(s).
• Result processing and possible top-k re-ranking and finally rendering of results back to client.
• Search (Proton)
• Finding all documents matching the query specification. For a ML serving use case, the selection might be a subset of the content pool (e.g limit the model to only be evaluated for content-type video documents), while for a text ranking application it might be a WAND text matching query.
• Calculating the score (which might be a text ranking relevancy score or the inferred score of a Machine Learned model) of each hit, using the chosen rank-profile. See ranking with Vespa.
• Aggregating information over all the generated hits using result grouping.
• Sorting hits on relevancy score (text ranking) or inference score (e.g ML model serving), or on attribute(s).
• Processing and returning the document summaries of the selected top hits (during summary fetch phase after merging and blending has happened on levels above).
The detailed execution inside Proton during the matching and ranking first protocol phase is:
1. Build up the query tree from the serialized network representation.
2. Lookup the query terms in the index and B-tree dictionaries and estimate the number of hits each term and parts of the query tree will produce. Terms which searches attribute fields without fast-search will be given a hit count estimate equal to the total number of documents.
3. Optimize and re-arrange the query tree for most efficient performance trying to move terms or operators with the lowest hit ratio estimate first in the query tree.
4. Prepare for query execution, by fetching posting lists from the index and B-tree structures.
5. Multi-threaded execution per search starts using the above information. Each thread will do its own thread local setup.
6. Each search thread will evaluate the query over its document space.
7. The search threads complete first phase and agree which hits will continue to second phase ranking (if enabled per the used rank-profile). The threads operate over a shared heap with the global top ranking hits.
8. Each thread will the complete second phase and grouping/aggregation/sorting.
9. Merge all threads results and return back up to the container.
Container clusters are stateless and hence easy to scale horizontally, and don't require any data distribution during re-sizing. The set of stateful content clusters can be scaled independently and re-sized which requires re-distribution of data. Re-distribution of data in Vespa, is supported and designed to be done without significant serving impact. Altering the number of nodes or groups in a Vespa content cluster does not require re-feeding of the corpus, so it's easy to start out with a sample prototype and scale it to production scale workloads.

## Content cluster scalability model

Vespa is a parallel computing platform where the work of matching and ranking is parallelized across a set of nodes and processors. The speedup we can get by altering the number of nodes in a Vespa content group follows Amdahl's law, which is a formula used to find the maximum improvement possible by improving a particular part of a system. In parallel computing, Amdahl's law is mainly used to predict the theoretical maximum speedup for program processing using multiple processors. In Vespa, as in any parallel computing system, there is work which can be parallelized and work which cannot. The relationship between these two work types determine how to best scale the system, using a flat or grouped distribution. We introduce the following concepts:

 static query work Portion of the query work on a content node that does not depend on the number of documents indexed on the node. This is an administrative overhead caused by system design and abstractions, e.g. number of memory allocations per query term. Typically a large query tree means higher static work, and this work cannot be parallelized over multiple processors, threads or nodes. The static query work portion is described in step 1 to 4 and step 9 in the detailed life of a query explanation above. dynamic query work Portion of the query work on a content node that depends on the number of documents indexed and active on the node. This portion of the work scales mostly linearly with the number of matched documents. The dynamic query work can be parallelized over multiple processors and nodes. Referenced later as DQW. The DQW also depends on the phase two protocol summary fill where the actual contents of the global best documents is fetched from the content nodes which produced the hit in the first protocol phase. Total query work The total query work is given as the dynamic query work (DQW) + static query work (SQW).
Adding content nodes to content cluster (keeping the total document volume fixed) configured with flat distribution, reduces the dynamic query work per node (DQW) but does not reduce the static query work (SQW). The overall system cost also increases as you need to rent another node.

Since DQW depends and scales almost linearly with the number of documents on the content nodes, we can try to distribute the work over more nodes. Amdahl's law specifies that the maximum speedup we achieve by parallelizing the dynamic work (DQW) is given by the formula:

$$\text{max_speedup}_{\text{group}} = \frac{1}{1 - \frac{DQW}{SQW+DQW}}$$

For example, if we through metrics see that the DQW = 0.50, the maximum speedup we can get by increasing parallelism by using more nodes and decreasing DQW is 2. With fixed occupancy (number of users, clients or load), Little's Law' tells us that we cans achieve 2 times throughput if we are able to speed up the latency by 2 times:

$$\frac{1}{1 - \frac{0.5}{0.5+0.5}} = 2$$

When SQW is no longer significantly less than DQW, adding more nodes in a flat distributed cluster just increases the overall system cost. This without any serving performance gain, except increasing overall supported feed throughput, which increases almost linearly with number of nodes.

Two different DQW/(DQW+SQW) factors are are illustrated in the figures below. The overall query work TQW is the same for both cases (10 ms), but the DQW portion of the TQW is different. The throughput (QPS) is a function of the latency (Little's Law) and the number of cpu cores * nodes. Using 1 node with 24 v-cpu cores and 10 ms service time (TQW), we expect to reach close to 2400 QPS at 100% utilization (unless there are other bottlenecks like network or stateless container processing).

In the first figure the overall latency is 10 ms, but the dynamic query work (latency) is only 50% and given Amdahl's law it follows that the maximum speedup we can get is 2. This is true regardless of how many processors or nodes we distribute the dynamic query work over. No matter how many nodes we add, we don't get above 4800 queries/s. The only thing we achieve by adding more nodes is increasing the cost without any performance benefits.

In the second figure we illustrate a system where the dynamic work portion is much higher (0.9), and the theoretical maximum speedup becomes bound by 10x as given by Amdahl's law. Note that both figures are with a single flat distributed content cluster with a fixed document volume.

Given the theoretical explanation above we can provide two rules of thumb for scaling throughput and latency:

Add nodes in a flat distribution When DQW/TQW is large (close to 1.0), throughput QPS can be scaled by just adding more content nodes in a system using flat distribution. This will reduce the number of documents per node, and thus reduce the DQW per node. When DQW/TQW is low, one can no longer just add more content nodes to scale throughput and must instead use a grouped distribution to scale throughput.

## Scaling latency in a content group

Whether we have a single group (flat distribution) or multiple groups, the serving latency depends on the factors already described; DQW and SQW. For use cases where DQW dominates the total query work TQW, we can effectively scale latency down by parallelizing the DQW over more nodes per group.

It is important to decide on a latency service level agreement (SLA) before sizing the Vespa deployment for the application and query features. A latency SLA is often specified as a latency percentile at a certain throughput level - example:

• SLA Example 1: 95.0 percentile < 100 ms @ 2000 QPS
• SLA Example 2: 95.0 percentile < 40 ms @ 8000 QPS
As we have seen in the previous section, different use cases might have different performance characteristics, depending on how the dynamic work query portion is compared to the static query work portion. This graph illustrates the relationship between overall latency versus number of documents indexed per node:

• For the yellow use case, the measured latency is as we can see almost independent of the total document volume. The observed latency at 10M documents per node is almost the same as with 1M documents per node. This illustrates a use case with low dynamic query work portion. Such a use case would be most cost effective by storing as many documents as possible (within the memory/disk/feeding constrains set by the concurrency settings and node flavor) and scale throughput by using a grouped distribution.
• For the blue use case, the measured latency shows a clear correlation with the document volume. This tells us that the dynamic query work portion is high, and that adding nodes to the group will help reduce the DQW per node. The sweet spot is found where we meet the targeted latency SLA at a given document volume. This sweet spot depends on which model or ranking features are in use, e.g how expensive the model is per hit. E.g. a xgboost model with 3000 trees might breach the targeted SLA already at 200K documents, while a 300 tree model might be below our SLA at 2M documents. See also feature tuning.

It is possible to reduce latency of queries where the dynamic query work portion is high, query throughput is relatively low, and using a multi-cpu core node. Using multiple threads per search for a use case where the static query work is high, will be as wasteful as adding nodes to a flat distribution, as demonstrated in the previous sections.

Using more threads per search will reduce DQW and latency as long as there are cpu cores available. Typically there is a small synchronization overhead when concurrency becomes higher, as the searcher threads needs to communicate through a shared heap containing the best hits found. A search request with 4 threads will occupy all 4 threads until the last thread has completed, and the intra-node per thread document space partitioning must be balanced to give optimal results. By default this number is 1, as that gives the best resource usage measured as cpu/query. The optimal threads per search depends on the query use case, and should be evaluated by benchmarking.

The threads per search settings globally is tuned by persearch. This can be overridden to a lower value in rank profiles so that different query use cases can use different number of threads per search. Using multiple threads per search allows better utilization of multi-core cpu architectures for low query volume applications.

## Scaling document volume per node

One want to fit as many documents as possible into a node given the node constrains (e.g available cpu, memory, disk) while maintaining:

• The targeted search latency SLA
• The targeted feed and update rate, and feed latency
With the latency SLA in mind, benchmark with increasing number of documents per node and watch system level metrics and Vespa metrics. If latency is within the stated latency SLA and the system meets the targeted sustained feed rate, overall cost is reduced by fitting more documents into each node (e.g. by increasing memory, cpu and disk constraints set by the node flavor).

Vespa will block feed operations if resource limits have been reached.

### Disk usage sizing

Disk usage of a content node increases as the document volume increases: The disk usage per document depends on various factors like the number of document search definitions, the number of indexed fields and their settings, and most important the size of the fields that are indexed and stored. The simplest way to determine the disk usage is to simply index documents and watch the disk usage along with the relevant metrics. The redundancy (number of copies) impact the disk usage footprint, obviously.

Note that content node maintenance jobs temporarily increases disk usage. E.g. index fusion is an example, where new index files are written, causing an increase in used disk space while running. Space used depends on configuration and data - headroom must include the temporary usage. See metrics for capacity planning.

### Memory usage sizing

The memory usage on a content node increases as the document volume increases. The memory usage increases almost linearly with the number of documents. The Vespa vespa-proton-bin process (content node) uses the full 64 bit virtual address space, so the virtual memory usage reported might be very high, as both index and summary files are mapped into memory using mmap and pages are paged into memory as needed.

The memory usage per document depends on the number of fields, the raw size of the documents and how many of the fields are defined as attributes. See attribute memory usage sizing and see metrics for capacity planning.

## Scaling Throughput

As seen in the previous sections, when the static query work (SQW) becomes large, scale throughput using grouped distribution. Regardless, if throughput is scaled by grouped distribution for use cases with high static query work portion or a flat distribution for high dynamic query work portion, we should identify how much throughput the total system supports.

Finding where the latency starts climbing exponentially versus throughput is important in order to make sure that the deployed system is scaled well below this throughput threshold. Also, that it has capacity to absorb load increases over time, as well as having sufficient capacity to sustain node outages during peak traffic.

At some throughput level, some resource(s) in the system will be fully saturated, and requests will be queued up causing latency to spike up exponentially, as requests are spending more time in the queue. The more queries/s we try to push by increasing the load, the longer a request needs to be in the queue waiting to be served. This behaviour is illustrated in the figure below:

A small increase in serving latency is observed as throughput increases, until saturated at approximately 2200 QPS. Pushing more QPS than this only increases queueing time, and overall latency increases.

## Scaling for failures and headroom

It is important to also measure behaviour under non-ideal circumstances, to avoid getting too good results. E.g. simulating node failures or node replacements, verifying feeding concurrency versus search and serving. See Serving availability tuning.

Generally, the higher utilization a system has in production, the more fragile it becomes when changing query patterns or ranking models.

The target system utilization should be kept sufficiently low for the response times to be reasonable and within latency SLA, even with some extra traffic occurring at peak hours. See also graceful degradation.

## Metrics for Vespa Sizing

The relevant Vespa Metrics for measuring the cost factors, in addition to system level metrics like cpu util, are:

Metric capturing static query work (SQW) at content nodes
content.proton.documentdb.matching.rank_profile.query_collateral_time

Metric capturing dynamic query work (DQW) at content nodes
content.proton.documentdb.matching.rank_profile.match_time


By sampling these metrics, we can calculate the theoretical speedup we can achieve by increasing number of nodes using flat distribution, by using Amdahl's law:

$$\text{max_speedup}_{\text{}} = \frac{1}{1 - \frac{query\_collateral\_time}{query\_collateral\_time+match\_time}}$$

In addition, the following metrics are used to find number of matches per query per node:

content.proton.documentdb.matching.rank_profile.docs_matched
content.proton.documentdb.matching.rank_profile.queries

Disk usage:
• documentdb: vespa.content.proton.documentdb.disk_usage.last
• transaction log: vespa.content.proton.transactionlog.disk_usage.last
Memory usage:
• documentdb: vespa.content.proton.documentdb.memory_usage.allocated_bytes.last