Proton is Vespa's search core. Proton maintains disk and memory structures for documents. As the data is dynamic, these structures are periodically optimized by maintenance jobs and the resource footprint of these background jobs are mainly controlled by the concurrency setting.
The internal structure of a proton node:
The search node consists of a bucket management system which sends requests to a set of document databases, which each consists of three sub-databases.
When the node starts up it first needs to get an overview of what partitions it have, and what buckets each partition currently stores. Once it knows this, it is in initializing mode, able to handle load, but distributors do not yet know bucket metadata for all the buckets, and thus can't know whether buckets are consistent with copies on other nodes. Once metadata for all buckets are known, the content nodes transitions from initializing to up state. As the distributors wants quick access to bucket metadata, it keeps an in-memory bucket database to efficiently serve these requests.
It implements elasticity support in terms of the SPI. Operations are ordered according to priority, and only one operation per bucket can be in-flight at a time. Below bucket management is the persistence engine, which implements the SPI in terms of Vespa search. The persistence engine reads the document type from the document id, and dispatches requests to the correct document database.
Each document database is responsible for a single document type. It has a component called FeedHandler which takes care of incoming documents, updates, and remove requests. All requests are first written to a transaction log, then handed to the appropriate sub-database, based on the request type.
There are three types of sub-databases, each with its own document meta store and document store. The document meta store holds a map from the document id to a local id. This local id is used to address the document in the document store. The document meta store also maintains information on the state of the buckets that are present in the sub-database.
The sub-databases are maintained by the index maintainer. The document distribution changes as the system is resized. When the number of nodes in the system changes, the index maintainer will move documents between the Ready and Not Ready sub-databases to reflect the new distribution. When an entry in the Removed sub-database gets old it is purged. The sub-databases are:
|Not Ready||Holds the redundant documents that are not searchable, i.e. the not ready documents. Documents that are not ready are only stored, not indexed. It takes some processing to move from this state to the ready state.|
Maintains an index of all ready documents and keeps them searchable.
One of the ready copies is active while the rest are not active:
|Removed||Keeps track of documents that have been removed. The id and timestamp for each document is kept. This information is used when buckets from two nodes are merged. If the removed document exists on another node but with a different timestamp, the most recent entry prevails.|
Content nodes have a transaction log to persist mutating operations. The transaction log persists operations by file append. Having a transaction log simplifies proton's in-memory index structures and enables steady-state high performance, read more below.
Operations are written immediately, irrespective of visibility-delay. Operations are not sync'ed to the file system - hence, writes are not guaranteed to exist in the transaction log. Flush jobs (below) sync the transaction log, for consistency.
Default, proton will flush components like attribute vectors and memory index on shutdown, for quicker startup.
Index fields are string fields, used for text search. Other field types are attributes.
For all indexed fields, proton has a memory index for the recent changes, implemented using B-trees. This is periodically flushed to a disk-based posting list index. Disk-based indexes are subsequently merged.
Updating the in-memory B-trees is lock-free, implemented using copy-on-write semantics. This gives high performance, with a predictable steady-state CPU/memory use. The driver for this design is the requirement for a sustained, high change rate, with stable, predictable read latencies and small temporary increases in CPU/memory. This compared to index hierarchies, merging smaller real-time indices into larger, causing temporary hot-spots.
When updating an indexed field, the document is read from the document store, the field is updated, and the full document is written back to the store. At this point, the change is searchable, and an ACK is returned to the client. Use attributes to avoid such document disk accesses and hence increase performance for partial updates. Find more details in feed performance.
To increase write throughput, writes to the memory index can be batched using visibility-delay.
Proton stores position information in text indices by default, for proximity relevance - posocc (below). All the occurrences of a term is stored in the posting list, with its position. This provides superior ranking features, but is somewhat more expensive than just storing a single occurrence per document. For most applications it is the correct tradeoff, since most of the cost is usually elsewhere and relevance is valuable.
Applications that only need occurrence information for filtering can use rank: filter to optimize performance, using only boolocc-files (below).
The memory index has a dictionary per index field. This contains all unique words in that field with mapping to posting lists with position information. The position information is used during ranking, see nativeRank for details on how a text match score is calculated.
The disk index stores the content of each index field in separate folders. Each folder contains:
- Dictionary. Files: dictionary.pdat, dictionary.spdat, dictionary.ssdat.
- Compresssed posting lists with position information. File: posocc.dat.compressed.
- Posting lists with only occurrence information (bitvector). These are generated for common words. Files: boolocc.bdat, boolocc.idx.
Proton maintenance jobs
There is only one instance of each job at a time - e.g. attributes are flushed in sequence. When a job is running, its metric is set to 1 - otherwise 0. Use this to correlate observed performance with job runs - see Run metric.
Flush an attribute vector from memory to disk,
based on configuration in the
This controls memory usage and query performance.
This also makes proton starts quicker - see
flush on shutdown.
|memory index flush||
Flush a memory index to disk, then trigger disk index fusion.
The goal is to shrink memory usage by adding to the disk-backed indices.
Performance characteristics for this flush is similar to indexing.
Note: A high feed rate can cause multiple smaller flushed indices, like
- see the high index number.
Multiple smaller indices is a symptom of too small memory indices compared to feed rate -
to fix, increase
flushstrategy > native > component > maxmemorygain.
|disk index fusion||
Merge the primary disk index with smaller indices generated by
memory index flush -
triggered by the memory index flush.
|document store flush||
Flushes the document store.
|document store compaction||
Defragment and sort
document store files as documents are updated and deleted,
in order to reduce disk space and speed up
The file is sorted in bucket order on output. Triggered by
Triggered by nodes going up/down, refer to
elastic Vespa and
Causes documents to be indexed or de-indexed, similar to feeding.
This moves documents in or out of ready/active sub-databases.
As bucket move, however moves documents within a sub-database.
This is often triggered when a cluster grows with more nodes,
documents are redistributed to new nodes and each nodes has less documents -
a LIDspace compaction is hence triggered. This inplace defragments the
document meta store.
Resources are freed on a subsequent attribute flush.
|removed documents pruning||
Prunes the deleted documents sub-database which keeps IDs for deleted documents.
Default runs once per hour.