The root element of a Content cluster definition.
Creates a content cluster. A content cluster stores and/or indexes documents.
The xml file may have zero or more such tags.
Name of the content cluster.
If none is supplied, the cluster name will be content.
Cluster names must be unique within application,
if multiple clusters are configured, name must be set for all but one at minimum.
Renaming a cluster is the same as dropping the current cluster and adding a new one.
This makes data unavailable or lost, depending on hosting model.
Deploying with a changed cluster id will therefore fail with a validation override requirement:
Content cluster 'music' is removed. This will cause loss of all data in this cluster.
To allow this add <allow until='yyyy-mm-dd'>content-cluster-removal</allow> to validation-overrides.xml,
Contained in content.
Defines which document types should be routed to this content cluster using the default route,
and what documents should be kept in the cluster if the garbage collector runs.
Read more on expiring documents.
Also have some backend specific configuration for whether documents should be searchable or not.
A document selection,
restricting documents that are routed to this cluster.
Defaults to a selection expression matching everything.
This selection can be specified to match document identifier specifics
that are independent of document types.
For restrictions that apply only to a specific document type,
this must be done within that particular document type's
document element.
Trying to use document type references in this selection makes an error during deployment.
The selection given here will be merged with per-document
type selections specified within document tags, if any,
meaning that any document in the cluster must match both selections to be accepted and kept.
If true, regularly verify the documents stored in the cluster to see if
they belong in the cluster, and delete them if not.
If false, garbage collection is not run.
Time (in seconds) between garbage collection cycles.
Note that the deletion of documents is spread over this interval, so more resources will be
used for deleting a set of documents with a small interval than with a larger interval.
The mode of storing and indexing.
In this documentation, index is assumed unless explicitly mentioned
streaming or store-only.
Refer to streaming search for store-only, as documents are stored
the same way for both cases.
Changing mode requires an indexing-mode-changevalidation override,
and documents must be re-fed.
A document selection,
restricting documents that are routed to this cluster.
Defaults to a selection expression matching everything.
This selection must apply to fields in this document type only.
Selection will be merged together with selection for other types and global selection from
documents to form a full expression for what documents belong to this cluster.
true / false
Set to true to distribute all documents of this type to all nodes
in the content cluster it is defined.
Fields in global documents can be imported into documents to implement joins -
read more in parent/child.
Vespa will detect when a new (or outdated) node is added to the cluster
and prevent it from taking part in searches until it has received all global documents.
Changing from false to true or vice versa requires a global-document-changevalidation override.
First, stop services
on all content nodes.
Then, deploy with the validation override.
Finally, start services
on all content nodes.
Note: global is only supported for mode="index".
Contained in documents.
Vespa Search specific configuration for which document processing cluster and chain to run index preprocessing.
Container cluster on content node
Name of a document-processing
container cluster that does index preprocessing.
Use cluster to specify an alternative cluster, other than the default cluster on content nodes.
indexing chain
A document processing chain in the container cluster specified by cluster
to use for index preprocessing.
The chain must inherit the indexing chain.
Example - the container cluster enables document-processing,
referred to by the content cluster:
To add document processors either before or after the indexer,
declare a chain (inherit indexing) in a document-processing container cluster
and add document processors.
Annotate document processors with before=indexingStart or after=indexingEnd.
Configure this cluster and chain as the indexing chain in the content cluster - example:
Also note the document-api configuration,
applications can set up this API on the same nodes as document-processing -
find details in indexing.
Contained in content.
The minimum total data copies the cluster will maintain.
This can be set instead of (or in addition to) redundancy to ensure that a
minimum number of copies are always maintained regardless of other configuration.
Example: If min-redundancy is 2 and there is 1 content group, there will be 2
data copies in the group (2 copies for the cluster). If the number of groups is
changed to 2 there will be 1 data copy in each group (still 2 copies for the cluster).
Contained in content.
The total data copies the cluster will maintain to avoid data loss (or, the data copies per group in
the Vespa Cloud documentation).
Example: with a redundancy of 2, the system tolerates 1 node failure before data becomes unavailable
(until the system has managed to create new replicas on other online nodes).
Redundancy can be changed without node restart - replicas will be added or removed automatically.
Contained in content.
Defines the set of content nodes in the cluster - parent for node-elements.
Contained in nodes or group.
Configures a content node to the cluster.
Sets the distribution key of a node. It is not recommended changing this for a given node.
It is recommended (but not required) that the set of distribution keys
in the cluster are contiguous and starting at 0.
Example: If the biggest distribution key is 499, then the distribution algorithm
needs to calculate 500 random numbers to calculate the correct target.
It is hence recommended to not leave too many gaps in the distribution key range.
Distribution keys are used to identify nodes and groups for the
distribution algorithm.
If a node changes distribution key, the distribution algorithm regards it as a new node,
so buckets are redistributed.
When merging clusters, one might need to change distribution keys -
Content nodes need unique node distribution keys across the whole cluster,
as the key is also used as a node identifier where group information is not specified.
Capacity of this node, relative to other nodes.
A node with capacity 2 will get double the data and feed requests of a node with capacity 1.
This feature is expert mode only. Don't use if you don't know what you are doing.
Contained in content or
group - groups can be nested.
Defines the hierarchical structure of the cluster.
Can not be used in conjunction with the nodes element.
Groups can contain other groups or nodes, but not both.
There can only be a single level of leaf groups under the top group.
Sets the distribution key of a group. It is not allowed to change this for a given group.
Group distribution keys only need to be unique among groups that share the same parent group.
The name of the group, used for access from status pages and the like.
There is no deployment-time verification that the distribution key remains unchanged for any given node or group.
Consequently, take great care when modifying the set of nodes in a content cluster.
Assigning a new distribution key to an existing node is undefined behavior;
Best case, the existing data will be temporarily unavailable until the error has been corrected.
Worst case, risk crashes or data loss.
See Vespa Serving Scaling Guide
for when to consider using grouped distribution
and Examples for example deployments
using flat and grouped distribution.
distribution (in group)
Contained in group.
Defines the data distribution to subgroups of this group.
distribution should not be in the lowest level group containing storage nodes,
as here the ideal state algorithm is used directly.
In higher level groups, distribution is mandatory.
required if there are subgroups in the group
String conforming to the partition specification:
Partition specification
Distribute all copies over 1 of N groups
Distribute all copies over 2 of N groups
Distribute all copies over 3 of N groups
The partition specification is used to evenly distribute content copies across groups.
Set a number or * per group separated by pipes (e.g. 1|* for two groups).
See sample deployment configurations.
Contained in content.
Specify the content engine to use, and/or adjust tuning parameters for the engine.
Allowed engines are proton and dummy,
the latter being used for debugging purposes. If no engine is given, proton is used.
Sub-element: proton.
Contained in proton.
Default value is 2, or redundancy, if lower.
If set to less than redundancy, only some of the stored copies are ready for searching at any time.
This means that node failures causes temporary data unavailability
while the alternate copies are being indexed for search.
The benefit is using less memory, trading off availability during transitions.
Refer to bucket move.
If updating documents or using document selection for garbage collection,
consider setting fast-access
on the subset of attribute fields used for this to make sure that these attributes are always kept
in memory for fast access.
Note that this is only useful if searchable-copies is less than redundancy.
Read more in proton.
searchable-copies can be changed without node restart. Note that when reducing
searchable-copies resource usage will not be reduced until content nodes are restarted.
Contained in proton, optional.
Tune settings for the search nodes in a content cluster - sub-element:
Contained in searchnode, optional.
Tune the number of request threads used on a content node,
see thread-configuration for details.
Number of search threads.
Number of search threads.
Number of search threads used per search,
see the Vespa serving scaling guide
for an introduction of using multiple threads per search per node to reduce query latency.
Number of threads per search can be adjusted down per rank-profile
using num-threads-per-search.
The total maximum memory gain (in bytes) for all components
before running flush, default 4294967296 (4 GB)
Trigger flush if the total disk gain (in bytes) for all components is larger
than the factor times current total disk usage, default 0.2
The maximum memory gain (in bytes) by a single component
before running flush, default 1073741824 (1 GB)
Trigger flush if the disk gain (in bytes) by a single component is larger than
the given factor times the current disk usage by that component, default 0.2
The maximum age (in seconds) of unflushed content for a single component
before running flush, default 111600 (31h)
The total maximum size (in bytes) of transaction logs
for all document types before running flush, default 21474836480 (20 GB)
When resource-limits (in proton) for memory is reached,
flush more often by downscaling total.maxmemorygain and
component.maxmemorygain, default 0.5
When resource-limits (in proton) for disk is reached,
flush more often by downscaling transactionlog.maxsize, default 0.5
Contained in searchnode, optional.
Tune settings related to how the search node (proton) is initialized. Optional sub-elements:
The number of initializer threads used for loading structures from disk at proton startup.
The threads are shared between document databases when the value is larger than 0.
Default value is the number of document databases + 1.
When set to larger than 1, document databases are initialized in parallel
When set to 1, document databases are initialized in sequence
When set to 0, 1 separate thread is used per document database,
and they are initialized in parallel.
Contained in searchnode, optional.
Tune settings related to how lidspace is managed. Optional sub-elements:
Maximum bloat allowed before lidspace compaction is started. Compaction is moving a document
from a high lid to a lower lid. Cost is similar to feeding a document and removing it.
Also see description in lidspace compaction maintenance job.
Default value is 0.01 or 1% of total lidspace. Will be increased to target of 0.50 or 50%.
Contained in searchnode, optional.
Tune proton settings for feed operations. Optional sub-elements:
A number between 0.0 and 1.0 that specifies the concurrency when handling feed operations, default 0.5.
When set to 1.0, all cores on the cpu can be used for feeding. Changing this value requires restart of
node to take effect.
A number between 0.0 and 1.0 that specifies the niceness of the feeding threads, default 0.0 => not any nicer than anyone else.
Increasing this number will reduce priority of feeding compared to search. The real world effect is hard to predict as the magic
exists in the OS level scheduler. Changing this value requires restart of node to take effect.
Contained in searchnode, optional.
Tune various aspect with the handling of disk and memory indexes. Optional sub-elements:
Controls io read options used during search,
values={mmap,populate}, default mmap. Using populate will eagerly touch all pages when index is loaded (after re-start or after index fusion is complete).
Specifies in seconds how long the index shall be warmed up before being switch in for serving.
During warmup, it will receive queries and posting lists will be iterated, but results ignored
as they are duplicates of the live index. This will pull in the most important ones in the cache.
However, as warming up an index will occupy more memory do not turn it on unless you suspect you need it.
And always benchmark to see if it is worth it.
It's only potentially relevant for fields with indexing setting index,
which have regular disk based indexes,
and where the disk indexes are merged/fused in the background.
When switching the index, warmup can be used.
Also note that state-v1-health
is independent of warmup - the node can be "up" before warmup.
Controls whether all posting features are pulled in to the cache, or only the most important.
values={true, false}, default false.
Contained in searchnode, optional.
Tune various aspect of the db of removed documents. Optional sub-elements:
Specifies how long (in seconds) we must remember removed documents before we can prune them away.
Default is 2 weeks.
This sets the upper limit on long a node can be down and still be accepted back in the system,
without having the index wiped.
There is no point in having this any higher than the age of the documents.
If corpus is re-fed every day, there is no point in having this longer than 24 hours.
Specifies how often (in seconds) to prune old documents. Default is 3.36 hours (prune age / 100).
No need to change default. Exposed here for reference and for testing.
Contained in searchnode, optional.
Tune various aspect with the handling of document summary. Optional sub-elements:
Controls io read options used during reading of stored documents.
Values are directiommappopulate.
Default is mmap. populate will do an eager mmap and touch all pages.
cache: Used to tune the cache used by the document store.
Enabled by default, using up to 5% of available memory.
The maximum size of the cache in bytes.
If set, it takes precedence over maxsize-percent.
Default is unset.
The maximum size of the cache in percent of available memory. Default is 5%.
The compression type of the documents while in the cache.
Possible values are , nonelz4zstd.
Default is lz4
The compression level of the documents while in cache.
Default is 6
Used to tune the actual document store implementation (log-based).
The maximum size (in bytes) per summary file on disk. Default value is 1GB.
Maximum size (in bytes) of a chunk. Default value is 64KB.
Compression type for the documents, nonelz4zstd.
Default is zstd.
Compression level for the documents. Default is 3.
Contained in proton. Default value is true.
If set to true, search nodes will flush a set of components (e.g. memory index, attributes) to disk
before shutting down such that the time it takes to flush these components
plus the time it takes to replay the transaction log
after restart is as low as possible.
The time it takes to replay the transaction log depends on the amount of data to replay,
so by flushing, some components before restart the transaction log will be pruned,
and we reduce the replay time significantly.
Refer to Proton maintenance jobs.
Contained in proton. Default value is true.
If true, the transactionlog is synced to disk after every write.
This enables the transactionlog to survive power failures and kernel panic.
The sync cost is amortized over multiple feed operations.
The faster you feed the more operations it is amortized over.
So with a local disk this is not known to be a performance issue.
However, if using NAS (Network Attached Storage) like EBS on AWS one can see significant
feed performance impact. For one particular case, turning off sync-transactionlog for EBS gave a 60x improvement.
resource-limits (in proton)
Contained in proton.
Specifies resource limits used by proton to reject both external and internal write operations (on this content node) when a limit is reached.
These proton limits should almost never be changed directly.
Instead, change resource-limits
that controls when external write operations are blocked in the entire content cluster.
Be aware of the risks of tuning resource limits as seen in the link.
The local proton limits are derived from the cluster limits if not specified, using this formula:
Contained in search.
Specifies the query timeout in seconds for queries against the search interface on the content nodes.
The default is 0.5 (500ms), the max is 600.0.
For query timeout also see the request parameter timeout.
One can not override this value using the
timeout request parameter.
Contained in search.
Declares search coverage configuration for this content cluster. Optional sub-elements are
min-wait-after-coverage-factor and
Search coverage configuration controls how many nodes the query dispatcher process
should wait for, trading search coverage versus search performance.
Contained in coverage.
Declares the minimum search coverage required before returning the results of a query.
This number is in the range [0, 1], with 0 being no coverage and 1 being full coverage.
The default is 1; unless configured otherwise a query will not return
until all search nodes have responded within the specified timeout.
Contained in coverage.
Declares the minimum time for a query to wait for full coverage once the declared
minimum has been reached. This number is a factor that is
multiplied with the time remaining at the time of reaching minimum coverage.
The default is 0; unless configured otherwise a query will return as soon as the
minimum coverage has been reached, and the remaining search nodes appear to be lagging.
Contained in coverage.
Declares the maximum time for a query to wait for full coverage once the declared
minimum has been reached.
This number is a factor that is multiplied with the time remaining
at the time of reaching minimum coverage.
The default is 1; unless configured otherwise a query is allowed to wait its full
timeout for full coverage even after reaching the minimum.
Contained in tuning.
The bucket is the fundamental unit of distribution
and management in a content cluster.
Buckets are auto-split, no need to configure for most applications.
Maximum number of documents per content bucket.
Buckets are split in two if they have more documents than this.
Keep this value below 16K.
Maximum size (in bytes) of a bucket.
This is the sum of the serialized size of all documents kept in the bucket.
Buckets are split in two if they have a larger size than this.
Keep this value below 100 MiB.
Override the ideal distribution bit count configured for this cluster.
Prefer to use the distribution type
setting instead if the default distribution bit count does not fit the cluster.
This variable is intended for testing and to work around possible distribution bit issues.
Most users should not need this option.
This is configuration for the cluster controller.
Most users are normally looking for
which controls how many nodes can be down before query load is routed to other groups.
Contained in tuning.
States a lower bound requirement on the ratio of nodes within individualgroups
that must be online and able to accept traffic before the entire group is automatically taken out of service.
Groups are automatically brought back into service when the availability
of its nodes has been restored to a level equal to or above this limit.
Elastic content clusters are often configured to use multiple groups
for the sake of horizontal traffic scaling and/or data availability.
The content distribution system will try to ensure a configured number of replicas is always present
within a group in order to maintain data redundancy.
If the number of available nodes in a group drops too far,
it is possible for the remaining nodes in the group to not have sufficient capacity to take over
storage and serving for the replicas they now must assume responsibility for.
Such situations are likely to result in increased latencies and/or feed rejections caused by resource exhaustion.
Setting this tuning parameter allows the system to instead automatically take down the remaining nodes in the group,
allowing feed and query traffic to fail completely over to the remaining groups.
Valid parameter is a decimal value in the range [0, 1].
Default is 0, which means that the automatic group out-of-service functionality will not automatically take effect.
Example: assume a cluster has been configured with n groups of 4 nodes each
and the following tuning config:
This tuning allows for 1 node in a group to be down. If 2 or more nodes go down,
all nodes in the group will be marked as down, letting the n-1 remaining groups handle all the traffic.
This configuration can be changed live as the system is running and altered limits will take effect immediately.
distribution (in tuning)
Contained in tuning.
Tune the distribution algorithm used in the cluster.
loose | strict | legacy
When the number of a nodes configured in a system changes over certain limits, the system will
automatically trigger major redistributions of documents. This is to ensure that
the number of buckets is appropriate for the number of nodes in the cluster. This enum
value specifies how aggressive the system should be in triggering such distribution changes.
The default of loose strikes a balance between rarely altering the distribution
of the cluster and keeping the skew in document distribution low. It is recommended that you
use the default mode unless you have empirically observed that it causes too much skew in load
or document distribution.
Note that specifying minimum-bits under bucket-splitting
overrides this setting and effectively "locks" the distribution in place.
Contained in tuning.
Controls the running time of the bucket maintenance process.
Bucket maintenance verifies bucket content for corruption.
Most users should not need to tweak this.
Start of daily maintenance window, e.g. 02:00
End of daily maintenance window, e.g. 05:00
Day of week for starting full file verification cycle, e.g. monday.
The full cycle is more costly than partial file verification
Contained in tuning.
Defines throttling parameters for bucket merge operations.
Maximum number of parallel active bucket merge operations.
Maximum size of the merge bucket queue, before reporting BUSY back to the distributors.
Contained in tuning.
Defines the number of persistence threads per partition on each content node.
A content node executes bucket operations against the persistence engine synchronously in each of these threads.
8 threads are used by default. Override with the count attribute.
Contained in tuning.
Tuning parameters for visitor operations.
Might contain max-concurrent.
The maximum number of threads in which to execute visitor operations.
A higher number of threads may increase performance, but may use more memory.
Maximum size of the pending visitor queue, before reporting BUSY back to the distributors.
Contained in visitors.
Defines how many visitors can be active concurrently on each storage node.
The number allowed depends on priority - lower priority visitors should not block higher priority visitors completely.
To implement this, specify a fixed and a variable number.
The maximum active is calculated by adjusting the variable component using the priority,
and adding the fixed component.
The variable component of the maximum active count
Contained in tuning.
Specifies resource limits used to decide whether external write operations should be blocked in the entire content cluster,
based on the reported resource usage by content nodes.
See feed block for more details.
Warning: The content nodes require resource headroom to handle
extra documents as part of re-distribution during node failure,
and spikes when running
maintenance jobs.
Tuning these limits should be done with extreme care,
and setting them too high might lead to permanent data loss.
They are best left untouched, using the defaults, and cannot be set
in Vespa Cloud.
float [0, 1]
Fraction of total space on the disk partition used on a content node before feed is blocked
float [0, 1]
Fraction of physical memory that can be resident memory in anonymous mapping on a content node before feed is blocked.
Total physical memory is sampled as the minimum of sysconf(_SC_PHYS_PAGES) * sysconf(_SC_PAGESIZE)
and the cgroup (v1 or v2) memory limit. Nodes with 8 Gib or less memory in Vespa Cloud has a limit of 0.75.
Contained in tuning.
Tune the query dispatch behavior - child elements:
No capping: Return all
Maximum number of hits to return from a content node.
By default, a query returns the requested number of hits +
offset from every content node to the container.
The container orders the hits globally according to the query,
then discards all hits beyond the number requested.
In a system with a large fan-out,
this consumes network bandwidth and the container nodes easily network saturated.
Containers will also sort and discard more hits than optimal.
When there are sufficiently many search nodes,
assuming an even distribution of the hits,
it suffices to only return a fraction of the request number of hits from each node.
Note that changing this number will have global ordering impact.
See top-k-probability below for improving performance with fewer hits.
best-of-random-2 / adaptive
Configure policy for choosing which group shall receive the next query request.
However, multiphase requests that either requires or benefits from hitting the same group
in all phases are always hashed.
Relevant only for grouped distribution:
Selects 2 random groups and selects the one with the lowest latency.
measures latency, preferring lower latency groups, selecting group with probability latency/(sum latency over all groups)
With grouped distribution:
If true, or by default, all groups that are within min-active-docs-coverage of the median
of the document count of other groups will be used to service queries. If set to false, only
groups within min-active-docs-coverage of the max document count will be used,
with the consequence that full coverage is prioritized over availability when
multiple groups are lacking content, since the remaining groups may not be able
to service the full query load.
A float percentage
With grouped distribution:
The percentage of active documents one needs to have
compared to average of other groups in order to be active for serving queries.
Because of measurement timing differences, it is not advisable to tune this above 99 percent.
Probability that the top K hits will be the globally best.
Based on this probability, the dispatcher will fetch enough hits from each node to achieve this.
The only way to guarantee a probability of 1.0 is to fetch K hits from each partition.
However, by reducing the probability from 1.0 to 0.99999, one can significantly reduce number of hits fetched
and save both bandwidth and latency.
The number of hits to fetch from each partition is computed as:
where qT is a Student's t-distribution.
With n=10 partitions, k=200 hits and p=0.99999, only 45 hits per partition is needed,
as opposed to 200 when p=1.0.
Use this option to reduce network and container cpu/memory in clusters with many nodes per group -
see Vespa Serving Scaling Guide.
Contained in tuning.
Tuning parameters for the cluster controller managing this cluster - child elements:
If the initialization progress count have not been altered for this amount of seconds,
the node is assumed to have deadlocked and is set down.
Note that initialization may actually be prioritized lower now,
so setting a low value here might cause false positives.
Though if it is set down for wrong reason,
when it will finish initialization and then be set up again.
The transition time states how long (in seconds) a node will be in maintenance mode
during what looks like a controlled restart.
Keeping a node in maintenance mode during a restart allows a restart
without the cluster trying to create new copies of all the data immediately.
If the node has not started or got back up within the transition time,
the node is set down, in which case, new full bucket copies will be created.
Note separate defaults for distributor and storage (i.e. search) nodes.
The maximum number of crashes allowed before a content node is permanently
set down by the cluster controller.
If the node has a stable up or down state for more than the stable-state-period,
the crash count is reset.
However, resetting the count will not re-enable the node again if it has been disabled -
restart the cluster controller to reset.
A ratio for the number of content groups that are allowed to be
down simultaneously. A value of 0.5 means that 50% of the groups are
allowed to be down. The default is to allow only one group to be down
at a time.