• [+] expand all

Vespa Feed Sizing Guide

Vespa is optimized to sustain a high feed load while serving - also during planned and unplanned changes to the instance. Vespa supports feed rates at memory speed, this guide goes through how to configure, test and size the application for optimal feed performance.

Read reads and writes first. This has an overview of Vespa, where the key takeaway is the stateless container cluster and the stateful content cluster. The processing of documents PUT to Vespa is run in the container cluster, and includes both Vespa-internal processing like tokenization and application custom code in document processing. The stateless cluster is primarily CPU bound, read indexing for how to separate search and write to different container clusters. Other than that, make sure the container cluster has enough memory to avoid excessive GC - the heap must be big enough. Allocate enough CPU for the indexing load.

All operations are written and synced to the transaction log on each content node. This is sequential (not random) IO but might impact overall feed performance if running on NAS attached storage where the sync operation has a much higher cost than on local attached storage (e.g. SSD). See sync-transactionlog.

The remainder of this document concerns how to configure the application to optimize feed performance on the content node.

Content node

Each content node runs the vespa-proton-bin process. The following illustrates the components involved during feeding, see proton for details.

Proton feed overview

As there are multiple components and data structures involved, this guide starts with the simplest examples and then adds optimizations and complexity.

Document store

Documents are written to the document store in all indexing modes - this is where the copy of the PUT document is persisted. See Summary Manager + Document store in illustration above.

PUT-ing documents to the document store can be thought of as appending to files using sequential IO, expecting a high write rate, using little memory and CPU. Each PUT is one write (simplified). Writing a new version of a document (PUT a document that already exists) is the same as non-existent - the in-memory mapping from document id to data file position is updated to point to the latest version in both cases.

A partial UPDATE to a document incurs a read from the document store to get the current fields. Then the new field values are applied and the new version or the document is written. Hence, like a PUT with an extra read.

Attribute

To real-time update fields in high volume, use attribute fields. See Attribute Manager + Attributes in illustration above.

schema ticker {
    document ticker {
        field price type float {
            indexing: summary | attribute
        }

Attribute fields are not stored in the document store, so there is no IO (except sequential flushing). This enables application to write at memory speed to Vespa - a 10k update rate per node is possible. See partial updates for details.

Some applications have a limited set of documents, with a high change-rate to fields in the documents (e.g. stock prices - number of stocks is almost fixed, prices changes constantly). Such applications are easily write bound.

Redundancy settings

To achieve memory-only updates, make sure all attributes to update are ready, meaning the content node has loaded the attribute field into memory:

  • One way to ensure this is to set searchable copies equal to redundancy - i.e. all nodes that has a replica of the document has loaded it as searchable
  • Another way is by setting fast-access on each attribute to update

Debugging performance

When debugging update performance, it is useful to know if an update hits the document store or not. Enable spam log level and look for SummaryAdapter::put - then do an update:

$ vespa-logctl searchnode:proton.server.summaryadapter spam=on
.proton.server.summaryadapter    ON  ON  ON  ON  ON  ON OFF  ON

$ vespa-logfmt -l all -f | grep 'SummaryAdapter::put'
[2019-10-10 12:16:47.339] SPAM    : searchnode       proton.proton.server.summaryadapter	summaryadapter.cpp:45 SummaryAdapter::put(serialnum = '12', lid = 1, stream size = '199')

Existence of such log messages indicates that the update was accessing the document store.

Index

Changes to index fields are written to the document store and the index. See Index Manager + Index in illustration above. Note that an UPDATE operation requires a read-modify-write to the document store and limits throughput. Changes to the index based on the updated field value are memory only operations, but use CPU to update the index. Refer to partial updates for more details.

schema music {
    document music {
        field artist type string {
            indexing: summary | index
        }

Thread pools

Several thread pools are involved when handling write operations on a content node. These are summarized in the following table. Metrics are available for each thread pool, see searchnode metrics for details.

To analyse performance and bottlenecks, the most relevant metrics are .utilization and .queuesize. In addition, .saturation is relevant for the field writer thread pool. See bottlenecks for details.

Thread pool Description
master

Updates the document metastore, prepares tasks to the index and summary threads, and splits a write operation into a set of tasks to update individual attributes, executed by the threads in the field writer.

Threads 1
Instances One instance per document database.
Metric prefix content.proton.documentdb.threading_service.master.
index

Manages writing of index fields in the memory index. It splits a write operation into a set of tasks to update individual index fields, executed by the threads in the field writer.

Threads 1
Instances One instance per document database.
Metric prefix content.proton.documentdb.threading_service.index.
summary

Writes documents to the document store.

Threads 1
Instances One instance per document database.
Metric prefix content.proton.documentdb.threading_service.summary.
field writer

The threads in this thread pool are used to invert index fields, write changes to the memory index, and write changes to attributes. Index fields and attribute fields across all document databases are randomly assigned to one of the threads in this thread pool. A field that is costly to write or update might become the bottleneck during feeding.

Threads Many, controlled by feeding concurrency.
Instances One instance shared between all document databases.
Metric prefix content.proton.executor.field_writer.
shared

The threads in this thread pool are among other used to compress and de-compress documents in the document store, merge files as part of disk index fusion, and prepare for inserting a vector into a HNSW index.

Threads Many, controlled by feeding concurrency.
Instances One instance shared between all document databases.
Metric prefix content.proton.executor.shared.

Multivalue attribute

Multivalued attributes are weightedset, array of struct/map, map of struct/map and tensor. The attributes have different characteristics, which affects write performance. Generally, updates to multivalue fields are more expensive as the field size grows:

Attribute Description
weightedset Memory-only operation when updating: read full set, update, write back. Make the update as inexpensive as possible using numeric types instead of strings, where possible Example: a weighted set of string with many (1000+) elements. Adding an element to the set means an enum store lookup/add and add/sort of the attribute multivalue map - details in attributes. Use a numeric type instead to speed this up - this has no string comparisons.
array/map of struct/map Update to array of struct/map and map of struct/map requires a read from the document store and will reduce update rate - see #10892.
tensor Updating tensor cell values is a memory-only operation: copy tensor, update, write back. For large tensors, this implicates reading and writing a large chunk of memory for single cell updates.

Parent/child

Parent documents are global, i.e. has a replica on all nodes. Writing to fields in parent documents often simplify logic, compared to the de-normalized case where all (child) documents are updated. Write performance depends on the average number of child documents vs number of nodes in the cluster - examples:

  • 10-node cluster, avg number of children=100, redundancy=2: A parent write means 10 writes, compared to 200 writes, or 20x better
  • 50-node cluster, avg number of children=10, redundancy=2: A parent write means 50 writes, compared to 20 writes, or 2.5x worse

Hence, the more children, the better performance effect for parent writes.

Conditional updates

A conditional update looks like:

{
    "update" : "id:namespace:myDoc::1",
    "condition" : "myDoc.myField == \"abc\"",
    "fields" : { "myTimestamp" : { "assign" : 1570187817 } }
}

If the document store is accessed when evaluating the condition, performance drops. Conditions should be evaluated using attribute values for high performance - in the example above, myField should be an attribute.

Note: If the condition uses struct or map, values are read from the document store:

    "condition" : "myDoc.myMap{1} == 3"

This is true even though all struct fields are defined as attribute. Improvements to this is tracked in #10892.

Client roundtrips

Consider the difference when sending two fields assignments to the same document:

{
    "update" : "id:namespace:doctype::1",
    "fields" : {
        "myMap{1}" : { "assign" : { "timestamp" : 1570187817 } }
        "myMap{2}" : { "assign" : { "timestamp" : 1570187818 } }
    }
}

vs.

{
    "update" : "id:namespace:doctype::1",
    "fields" : {
        "myMap{1}" : { "assign" : { "timestamp" : 1570187817 } }
    }
}
{
    "update" : "id:namespace:doctype::1",
    "fields" : {
        "myMap{2}" : { "assign" : { "timestamp" : 1570187818 } }
    }
}

In the first case, one update operation is sent from vespa feed - in the latter, the client will send the second update operation after receiving an ack for the first. When updating multiple fields, put the updates in as few operations as possible. See ordering details.

A content node normally has a fixed set of resources (CPU, memory, disk). Configure the CPU allocation for feeding vs. searching in concurrency - value from 0 to 1.0 - a higher value means more CPU resources for feeding.

In addition you can also control priority of feed versus search, or rather how nice feeding shall be. Since a process needs root privileges for increasing feed, we have opted to reduce priority (be nice) of feeding. This is controlled by a niceness number from 0 to 1.0 - higher value will favor search over feed. 0 is default.

Feed testing

When testing for feeding capacity:

  1. Use vespa feed.
  2. Test using one content node to find its capacity.
  3. Test feeding performance by adding feeder instances. Make sure network and CPU (content and container node) usage increases, until saturation.
  4. See troubleshooting at end to make sure there are no errors.

Other scenarios: Feed testing for capacity for sustained load in a system in steady state, during state changes, during query load.

Troubleshooting

Metrics

Use metrics from content nodes and look at queues - queue wait time and queue size (all metrics in milliseconds):

vds.filestor.averagequeuewait.sum
vds.filestor.queuesize

Check content node metrics across all nodes to see if there are any outliers. Also check latency metrics per operation type:

vds.filestor.allthreads.put.latency
vds.filestor.allthreads.update.latency
vds.filestor.allthreads.remove.latency
Bottlenecks

One of the threads used to handle write operations might become the bottleneck during feeding. Look at the .utilization metrics for all thread pools:

content.proton.documentdb.threading_service.master.utilization
content.proton.documentdb.threading_service.index.utilization
content.proton.documentdb.threading_service.summary.utilization
content.proton.executor.field_writer.utilization
content.proton.executor.shared.utilization

If utilization is high for field writer or shared, adjust feeding concurrency to allow more CPU cores to be used for feeding.

For the field writer also look at the .saturation metric:

content.proton.executor.field_writer.saturation

If this is close to 1.0 and higher than .utilization it indicates that one of its worker threads is a bottleneck. The reason can be that this particular thread is handling a large index or attribute field that is naturally expensive to write and update. Use the custom component state API to find which index and attribute fields are assigned to which thread (identified by executor_id), and look at the detailed statistics of the field writer to find which thread is the actual bottleneck:

state/v1/custom/component/documentdb/mydoctype/subdb/ready/index
state/v1/custom/component/documentdb/mydoctype/subdb/ready/attributewriter
state/v1/custom/component/threadpools/fieldwriter
Failure rates

Inspect these metrics for failures during load testing:

vds.distributor.updates.latency
vds.distributor.updates.ok
vds.distributor.updates.failures.total
vds.distributor.puts.latency
vds.distributor.puts.ok
vds.distributor.puts.failures.total
vds.distributor.removes.latency
vds.distributor.removes.ok
vds.distributor.removes.failures.total
Blocked feeding

This metric should be 0 - refer to feed block:

content.proton.resource_usage.feeding_blocked
Concurrent mutations

Multiple clients updating the same document concurrently will stall writes:

vds.distributor.updates.failures.concurrent_mutations
Mutating client operations towards a given document ID are sequenced on the distributors. If an operation is already active towards a document, a subsequently arriving one will be bounced back to the client with a transient failure code. Usually this happens when users send feed from multiple clients concurrently without synchronisation. Note that feed operations sent by a single client are sequenced client-side, so this should not be observed with a single client only. Bounced operations are never sent on to the backends and should not cause elevated latencies there, although the client will observe higher latencies due to automatic retries with back-off.
Wrong distribution

vds.distributor.updates.failures.wrongdistributor
Indicates that clients keep sending to the wrong distributor. Normally this happens infrequently (but is does happen on client startup or distributor state transitions), as clients update and cache all state required to route directly to the correct distributor (Vespa uses a deterministic CRUSH-based algorithmic distribution). Some potential reasons for this:
  1. Clients are being constantly re-created with no cached state.
  2. The system is in some kind of flux where the underlying state keeps changing constantly.
  3. The client distribution policy has received so many errors that it throws away its cached state to start with a clean slate to e.g. avoid the case where it only has cached information for the bad side of a network partition.
  4. The system has somehow failed to converge to a shared cluster state, causing parts of the cluster to have a different idea of the correct state than others.
Cluster out of sync

update_puts/gets indicate "two-phase" updates:

vds.distributor.update_puts.latency
vds.distributor.update_puts.ok
vds.distributor.update_gets.latency
vds.distributor.update_gets.ok
vds.distributor.update_gets.failures.total
vds.distributor.update_gets.failures.notfound

If replicas are out of sync, updates cannot be applied directly on the replica nodes as they risk ending up with diverging state. In this case, Vespa performs an explicit read-consolidate-write (write repair) operation on the distributors. This is usually a lot slower than the regular update path because it doesn't happen in parallel. It also happens in the write-path of other operations, so risks blocking these if the updates are expensive in terms of CPU.

Replicas being out of sync is by definition not the expected steady state of the system. For example, replica divergence can happen if one or more replica nodes are unable to process or persist operations. Track (pending) merges:

vds.idealstate.buckets
vds.idealstate.merge_bucket.pending
vds.idealstate.merge_bucket.done_ok
vds.idealstate.merge_bucket.done_failed