services.xml - 'content'
content [version, id, distributor-base-port] documents [selection, garbage-collection, garbage-collection-interval] document [type, selection, mode] document-processing [cluster, chain] redundancy nodes node [baseport, hostalias, jvmargs??, preload, distribution-key, capacity] group [distribution-key, name] distribution node [baseport, hostalias, jvmargs??, preload, distribution-key, capacity] group [distribution-key, name] engine proton searchable-copies tuning flush-on-shutdown resource-limits disk memory search query-timeout visibility-delay coverage minimum min-wait-after-coverage-factor max-wait-after-coverage-factor dispatch num-dispatch-groups group node [distribution-key] tuning bucket-splitting [max-documents, max-size, minimum-bits] min-node-ratio-per-group distribution [type] maintenance [start, stop, high] merges [max-per-node, max-queue-size] persistence-threads [lowest-priority-to-block-others, highest-priority-to-block] thread [lowest-priority, count] visitors [thread-count, max-queue-size] max-concurrent [fixed, variable] dispatch max-hits-per-partition dispatch-policy min-group-coverage min-active-docs-coverage use-local-node cluster-controller init-progress-time transition-time max-premature-crashes stable-state-period min-distributor-up-ratio min-storage-up-ratio
The root element of a Content cluster definition. Creates a content cluster. A content cluster stores and/or indexes documents. The xml file may have zero or more such tags.
- version (required): Must be set to '1.0' in this version of Vespa.
- id (required for multiple clusters): Name of the content cluster. If none is supplied, the cluster name will be 'content'. Cluster names must be unique within application, so if several clusters are setup, name must be set for all but one at minimum. Suggested set by everyone for cluster to have a meaningful name. Allows you to add clusters later without having to rename existing one for the names to make sense.
- distributor-base-port (optional): If a specific port is required for access to the distributor, override it with this attribute.
Defines which document types should be routed to this content cluster using the default route,
and what documents should be kept in the cluster if the garbage collector runs.
Read more on expiring documents.
Also have some backend specific configuration for whether documents should be searchable or not. Attributes:
A document selection string,
defaults to a selection expression matching everything -
restricts the documents that are routed to this cluster.
This selection can be specified to match document identifier specifics
that are independent of document types.
For restrictions that apply only to a specific document type, this must be done within
that particular document type's |
|garbage-collection||optional||true / false||false||If true, regularly verify the documents stored in the cluster to see if they belong in the cluster, and delete them if not. If false, garbage collection is not run.|
|garbage-collection-interval||optional||integer||3600||Time (in seconds) between garbage collection cycles.|
The document type to be routed to this content cluster. Attributes:
|type||required||string||Document type name|
|mode||required||index / store-only / streaming||
The mode of storing and indexing. In this documentation, index is assumed unless explicitly mentioned streaming or store-only. Refer to streaming search for store-only, as documents are stored the same way for both cases.
Changing mode requires an indexing-mode-change validation override, and documents must be re-fed.
|selection||optional||string||A document selection string, defaults to a selection expression matching everything - restricts the documents that are routed to this cluster. This selection must apply to fields in this document type only. Selection will be merged together with selection for other types and global selection from documents to form a full expression for what documents belong to this cluster.|
|global||optional||true / false||false||
Set to true to distribute all documents of this type to all nodes. Fields in global documents can be imported into documents to implement joins - read more in parent/child. Vespa will detect when a new (or outdated) node is added to the cluster and prevent it from taking part in searches until it has received all global documents.
Note: global is only supported for mode="index".
Vespa Search specific configuration for which document processing cluster and chain to run index pre processing. Attributes:
|cluster||optional||string||Container cluster on content node||Name of a document-processing container cluster that does index pre processing. Use cluster to specify an alternative cluster, other than the default cluster on content nodes.|
A document processing chain in the container cluster specified by cluster to use for index pre processing.
The chain must inherit the |
<container id="my-indexing-cluster" version="1.0"> <document-processing/> </container> <content id="music" version="1.0"> <documents> <document-processing cluster="my-indexing-cluster"/> </documents> </content>To add document processors either before or after the indexer, declare a chain (inherit indexing) in a document-processing container cluster and add document processors. Annotate document processors with
after=indexingEnd. Configure this cluster and chain as the indexing chain in the content cluster - example:
<container id="my-indexing-cluster" version="1.0"> <document-processing> <chain id="my-document-processors" inherits="indexing"> <documentprocessor id="MyDocproc"> <before>indexingStart</before> </documentprocessor> <documentprocessor id="MyOtherDocproc"> <after>indexingEnd</after> </documentprocessor> </chain> </document-processing> </container> <content id="music" version="1.0"> <documents> <document-processing cluster="my-indexing-cluster" chain="my-document-processors" /> </documents> </content>
Defines the total number of copies of each piece of data the cluster will maintain to avoid data loss.
Example: with a redundancy of 2, the system tolerates 1 node failure before data becomes unavailable
(until the system has managed to create new replicas on other online nodes).
redundancy can be changed without node restart.
Defines the set of content nodes in the cluster - parent for node-elements.
Sets the distribution key of a node. It is not recommended to change this for a given node. It is recommended (but not required) that the set of distribution keys in the cluster are contiguous and starting at 0. Example: If the biggest distribution key is 499, then the distribution algorithm need to calculate 500 random numbers to calculate the correct target. It is hence recommended to not leave too many gaps in the distribution key range.
Distribution keys are used to identify nodes and groups for the distribution algorithm. If a node changes distribution key, the distribution algorithm regards it as a new node, hence buckets are redistributed. When merging clusters, one might need to change distribution keys - details on merging clusters.
Content nodes need unique node distribution keys across the whole cluster, as the key is also used as a node identifier where group information is not specified.
|capacity||optional||double||1||Capacity of this node, relative to other nodes. A node with capacity 2 will get double the data and requests of a node with capacity 1.|
group - groups can be nested.
Defines the hierarchical distribution structure of the cluster.
Can not be used in conjunction with the
If a non-flat structure is desired, use this element instead.
Groups can contain other groups or nodes, but not both.
Read more on using groups. Attributes:
|distribution-key||required||integer||Sets the distribution key of a group. It is not allowed to change this for a given group. Group distribution keys only need to be unique among groups that share the same parent group.|
|name||required||string||The name of the group, used for access from status pages and the like.|
There is currently no deployment-time verification that the distribution key remains unchanged for any given node or group. Consequently, take great care when modifying the set of nodes in a content cluster. Assigning a new distribution key to an existing node is undefined behavior; Best case, the existing data will be temporarily unavailable until the error has been corrected. Worst case, risk crashes or data loss.
Example with two groups, where each group has all copies of half of the data set:
<group name="top-group" distribution-key="0"> <distribution partitions="*"/> <group name="bottom-1" distribution-key="0"> <node distribution-key="0" hostalias="node1"/> </group> <group name="bottom-2" distribution-key="1"> <node distribution-key="1" hostalias="node2"/> </group> </group>
distribution (in group)
Defines the data distribution to subgroups of this group.
distribution should not be in the lowest level group containing storage nodes,
as here the ideal state algorithm is used directly.
In higher level groups, distribution is mandatory. Attributes:
- partitions (required, if there are subgroups in the group): String conforming to the partition specification
|*||Place all copies into 1 of N groups|
|*|*||Place all copies into 2 of N groups|
|*|*|*||Place all copies into 3 of N groups|
|n|*||Place n copies in 1 group, the rest in another|
|n|*|*||Place n copies in 1 group, and divide the rest in 2 other groups|
|n||invalid - use * and set redundancy to specify number of copies|
|n|m||invalid - use n|* and set redundancy to specify number of copies in the second group|
|*|m||invalid - non-asterisk values must be placed first in specification|
|n|*|m||invalid - non-asterisk values must be placed first in specification|
Partitions like 1|2|* and 2|1|* are identical. After replacing asterisks with real numbers depending on the redundancy to split, the partitions will be sorted so the highest numbers appear first. This is because the highest priority child will get the first assignment, and then the second highest and so on. The highest priority groups for a bucket keeps most copies, to reduce amount of copies needing to be created and removed when groups go up or down. Thus, the order of numbers in the partition string is irrelevant.
An asterisk is forced to be in an expression to handle changes to global redundancy. Also, when using multiple group levels one might divide different amount of copies, depending on which group bucket has been assigned to. For instance, if global redundancy is 5 and the top level group partitions into 2|*, then one group gets two copies and another group gets 3. But each of these groups will be primary group (3 copies) and secondary groups (2 copies) for a lot of buckets, but it can only configure one partition specification. So if that subgroup then stores 1|*|* for instance, it will store 1|1|1 for the buckets where it is asked to keep 3 copies, and 1|1 for the buckets where it is asked to keep 2 copies.
If the redundancy at some level is lower than the partition spec, Vespa stores less copies than the partition spec - starting leftwards. If the spec is 2|2|* and configures to store 3 copies only, Vespa first fills the primary group with 2, then gives the secondary group the 1 remaining copy, and there will be no third group as there are no more copies to store. This may be a valid case if using multiple levels of groups and a single group needs to store differently for buckets depending on which buckets it is primary or secondary group for.
Specify the content engine to use, and/or adjust tuning parameters for the engine.
Allowed engines are
the latter being used for debugging purposes. If no engine is given, proton is used.
Sub-elements: one of
If specified, the content cluster will use the Proton content engine.
This engine supports storage, indexed search and secondary indices.
Optional sub-elements are
Default value is 2, or redundancy, if lower.
If set to less than redundancy, only some of the stored copies are ready for searching at any time.
This means that node failures causes temporary data unavailability
while the alternate copies are being indexed for search.
The benefit is using less memory, trading off availability during transitions.
Refer to bucket move.
If updating documents or using document selection for garbage collection,
on the subset of attribute fields used for this to make sure that these attributes are always kept
in memory for fast access.
Note that this is only useful if
searchable-copies is less than
searchable-copies can be changed without node restart.
proton. Default value is true.
If set to true, search nodes will flush a set of components (e.g. memory index, attributes) to disk
before shutting down such that the time it takes to flush these components
plus the time it takes to replay the transaction log
after restart is as low as possible.
The time it takes to replay the transaction log depends on the amount of data to replay,
so by flushing, some components before restart the transaction log will be pruned
and we reduce the replay time significantly.
Refer to Proton maintenance jobs.
|writefilter.disklimit||Fraction of total space on the disk partition used before put and update operations are rejected|
|writefilter.memorylimit||Fraction of physical memory that can be resident memory in anonymous mapping by proton before put and update operations are rejected|
<proton> <resource-limits> <disk>0.90</disk> <memory>0.95</memory>
Specifies the query timeout in seconds for queries against the search interface on the content nodes.
The default is 5.0, the max is 600.0.
For query timeout also see the request parameter timeout.
Note: You will not be able to override the configured value using the request parameter timeout.
Specifies the maximum amount of time (in seconds) that should pass from a write operation is
performed, to the change is visible, in search results.
The default value is 0 seconds.
Configuring a larger value then 0 will add a results-oriented cache at the container level
where time to live (ttl) is set to the same value as the visibility-delay.
Note that by increasing this value you should also expect an increase in throughput during batch feeding.
When benchmarking batch feeding for a given test set, we got the following improvements
in throughput when setting
visibility-delay to 4.0 seconds:
+20% during initial feeding, +15% during re-feeding and +120% during removing of 1M documents.
These improvements depend on how many index and attribute fields are in the search definition,
the content of the documents and the
Benchmarking is required to establish the particular improvements for a given application.
Declares the minimum search coverage required before returning the results of a query.
This number is in the range
[0, 1], with 0 being no coverage and 1 being full coverage.
The default is 1; unless configured otherwise a query will not return until all search nodes have responded.
Declares the minimum time for a query to wait for full coverage once the declared
minimum has been reached. This number is a factor that is
multiplied with the time remaining at the time of reaching minimum coverage.
The default is 0; unless configured otherwise a query will return as soon as the minimum coverage has been reached, and the remaining search nodes appear to be lagging.
Declares the maximum time for a query to wait for full coverage once the declared
minimum has been reached.
This number is a factor that is multiplied with the time remaining
at the time of reaching minimum coverage.
The default is 1; unless configured otherwise a query is allowed to wait its full timeout for full coverage even after reaching the minimum.
Defines the multi-level structure of dispatchers (scatter-gather nodes) in this cluster.
By adding this element we get a hierarchy of mid-level dispatchers, ordered in dispatch groups,
with content/search nodes at the leaf level.
This can be used in a system with a huge amount (hundreds) of content/search nodes
where the fan-out from the top-level dispatchers causes the network to be a bottleneck.
In the following example we create 2 mid-level dispatch groups, each containing 3 content/search nodes (referenced by the distribution key of the actual nodes). Each dispatch group also consists of 3 mid-level dispatchers that will be located on the content/search node hosts. The nodes of a dispatch group will typically be located on the same physical switch in a production setup.
In this setup the top-level dispatchers will see 2 mid-level dispatch groups, and each query is passed to 1 of the 3 dispatchers in each group. The mid-level dispatchers will pass the query to all its underlying content/search nodes:
<dispatch> <group> <node distribution-key='0'/> <node distribution-key='1'/> <node distribution-key='2'/> </group> <group> <node distribution-key='3'/> <node distribution-key='4'/> <node distribution-key='5'/> </group> </dispatch>
Defines the number of dispatch groups to be used in the multi-level dispatch setup.
This can be specified instead of explicit dispatch groups.
In this case the content/search nodes of this cluster is automatically assigned to the specified number of dispatch groups
(in the same order they are specified in this cluster).
NOTE: Should NOT be used for production.
group (in dispatch)
node (in dispatch)
Defines a node in a mid-level dispatch group.
This is a reference to the actual content/search node that should be part of this dispatch group.
A mid-level dispatcher will also be located on the host of the content/search node. Attribute:
- distribution-key (required): Reference to the distribution key of the actual content/search node
content, optional. Optional tuning parameters are:
The bucket is the fundamental unit of distribution
and management in a content cluster.
Buckets are auto-split, no need to configure for most applications.
Streaming search latency is linear with bucket size. Attributes:
|max-documents||optional||integer||1024||Maximum number of documents per content bucket. Buckets are split in two if they have more documents than this. Keep this value below 16K.|
|max-size||optional||integer||32MiB||Maximum size (in bytes) of a bucket. This is the sum of the serialized size of all documents kept in the bucket. Buckets are split in two if they have a larger size than this. Keep this value below 100MiB.|
|minimum-bits||optional||integer||Override the ideal distribution bit count configured for this cluster. Prefer to use the distribution type setting instead if the default distribution bit count does not fit the cluster. This variable is intended for testing and to work around possible distribution bit issues. Most users should not need this option.|
States a lower bound requirement on the ratio of nodes within individual groups
that must be online and able to accept traffic before the entire group is automatically taken out of service.
Groups are automatically brought back into service when the availability
of its nodes has been restored to a level equal to or above this limit.
Elastic content clusters are often configured to use multiple groups for the sake of horizontal traffic scaling and/or data availability. The content distribution system will try to ensure a configured number of replicas is always present within a group in order to maintain data redundancy. If the number of available nodes in a group drops too far, it is possible for the remaining nodes in the group to not have sufficient capacity to take over storage and serving for the replicas they now must assume responsibility for. Such situations are likely to result in increased latencies and/or feed rejections caused by resource exhaustion. Setting this tuning parameter allows the system to instead automatically take down the remaining nodes in the group, allowing feed and query traffic to fail completely over to the remaining groups.
Valid parameter is a decimal value in the range [0, 1]. Default is 0, which means that the automatic group out-of-service functionality will not automatically take effect.
Example: assume a cluster has been configured with n groups of 4 nodes each and the following tuning config:
<tuning> <min-node-ratio-per-group>0.75</min-node-ratio-per-group> </tuning>This tuning allows for 1 node in a group to be down. If 2 or more nodes go down, all nodes in the group will be marked as down, letting the n-1 remaining groups handle all the traffic.
This configuration can be changed live as the system is running and altered limits will take effect immediately.
distribution (in tuning)
Lets you tune the distribution algorithm used in the cluster. Attributes:
type (optional): loose | strict | legacy. Defaults to
When the number of a nodes configured in a system changes over certain limits, the system will automatically trigger major redistributions of documents. This is to ensure that the number of buckets is appropriate for the number of nodes in the cluster. This enum value speficies how aggressive the system should be in triggering such distribution changes.
The default of
loosestrikes a balance between rarely altering the distribution of the cluster and keeping the skew in document distribution low. It is recommended that you use the default mode unless you have empirically observed that it causes too much skew in load or document distribution.
Note that specifying
minimum-bitsunder bucket-splitting overrides this setting and effectively "locks" the distribution in place.
Controls the running time of the bucket maintenance process.
Bucket maintenance verifies bucket content for corruption.
Most users should not need to tweak this. Attributes:
- start (required): Time string in HH:MM form, e.g. 02:00 Start of daily maintenance window.
- stop (required): Time string in HH:MM form, e.g. 05:00 End of daily maintenance window.
- high (required): Week day name string, e.g. monday Day of week for starting full file verification cycle (more costly than partial file verification)
tuning. Defines throttling parameters for bucket merge operations. Attributes:
- max-per-node (optional): Maximum number of parallel active bucket merge operations.
- max-queue-size (optional): Maximum size of the merge bucket queue, before reporting BUSY back to the distributors.
Defines the number of persistence threads per partition on each content node.
A content node executes bucket operations against the persistence engine synchronously in each of these threads.
By default, four threads are created that can handle any priority operation,
as well as two threads reserved for high priority operations.
Optionally, add one or more
thread elements. Attributes:
- lowest-priority-to-block-others (optional): Priority indicator (e.g. VERY_HIGH) If an operation has equal to or higher priority than this, operations with low enough priority to be blocked will not be able to start running in other persistence threads for the same partition.
- highest-priority-to-block (optional): Priority indicator (e.g. NORMAL_1) If an operation has a priority lower than or equal to this priority, and there are already operations being processed that have high enough priority to block others, this operation will not be started yet, even if there is a free persistence thread.
Adds a number of threads to process persistence operations on each partition. Attributes
- lowest-priority (optional):
Priority indicator (e.g. NORMAL_1)
The lowest priority operation these threads are allowed to process. Defaults to LOWEST. Note that in this context LOWEST refers to the lowest possible priority. While in the context of setting operation priority, LOWEST is the lowest user settable priority, but the content layer itself can create lower priority operations if it wants.
Note: You should always have at least 1 thread capable of processing operations with any priority, as the priority of internal operations is undefined from the perspective of the end-user and some of these may have a very low priority (but still be important to eventually process). Failing to do so results in operations filling up partition queues that can never be performed.
- count (optional): The number of these threads to create.
- thread-count (optional): The maximum number of threads in which to execute visitor operations. A higher number of threads may increase performance, but may use more memory.
- max-queue-size (optional): Maximum size of the pending visitor queue, before reporting BUSY back to the distributors.
Defines how many visitors can be active concurrently on each storage node.
The number allowed depends on priority - lower priority visitors should not block higher priority visitors completely.
To implement this, you can specify a fixed and a variable number.
For a given visitor, the maximum active allowed is calculated by adjusting the variable component using the priority,
and adding the fixed component. Attributes:
- fixed (optional): Number. The fixed component of the maximum active count.
- variable (optional): Number. The variable component of the maximum active count.
Tune the query-dispatch behavior - child elements:
|max-hits-per-partition||optional||Declares the maximum number of hits to return from a single node. By default, a query returns the requested number of hits + offset from every search node up to the dispatcher, which in turns orders them according to the query, then discards all hits beyond the number requested. In a system with a large fan-out, this can consume a lot of bandwidth. When there is sufficiently many search nodes, assuming an even distribution of the hits, it should suffice to only return some fraction of the request number of hits from each node. Note that changing this number will have global ordering impact. How much is determined by the total number of search nodes involved in the query and the magnitude of the hits/offset parameters.|
|dispatch-policy||optional||round-robin / adaptive||round-robin||
Configure policy for choosing which group/row shall receive the next request. However multi-phase
requests that either requires or benefits from hitting the same group in all phases are always hashed.
|min-group-coverage||optional||100||Coverage required in order to serve from a group - default full coverage. Relevant only for grouped distribution.|
|min-active-docs-coverage||optional||50||Percentage of active documents a group needs to have compared to average of other groups in order to be active for serving queries. Because of measurement timing differences, it is not advisable to tune this above 99 percent. Relevant only for grouped distribution.|
|use-local-node||optional||true / false||false||Specifies that each dispatcher only uses search node(s) on the same host as the dispatcher, so each query is only sent to local search node(s). Note that this tuning should only be used for grouped distribution where the entire document collection is placed on the search node(s) on each host - otherwise this tuning gives in-complete query results. Refer to QPS Scaling in an Indexed Content Cluster|
Tuning parameters for the cluster controller managing this cluster - child elements:
|init-progress-time||optional||If the initialization progress count have not been altered for this amount of seconds, the node is assumed to have deadlocked and is set down. Note that initialization may actually be prioritized lower now, so setting a low value here might cause false positives. Though if it is set down for wrong reason, when it will finish initialization and then be set up again.|
|transition-time||optional||storage_transition_time distributor_transition_time||The transition time states how long (in milliseconds) a node will be in maintenance mode during what looks like a controlled restart. Keeping a node in maintenance mode during a restart allows a restart without the cluster trying to create new copies of all the data immediately. If the node has not started initializing or got back up within the transition time, the node is set down, in which case, new full bucket copies will be created. Note separate defaults for distributor and storage (i.e. search) nodes.|
|max-premature-crashes||optional||max_premature_crashes||The maximum number of crashes allowed before a content node is permanently set down by the cluster controller. If the node has a stable up or down state for more than the stable-state-period, the crash count is reset. However, resetting the count will not reenable the node again if it has been disabled - restart the cluster controller to reset.|
|stable-state-period||optional||stable_state_time_period||If a content node's state doesn't change for this many seconds, it's state is considered stable, clearing the premature crash count.|
|min-distributor-up-ratio||optional||min_distributor_up_ratio||The minimum ratio of distributors that are required to be up for the cluster state to be up.|
|min-storage-up-ratio||optional||min_storage_up_ratio||The minimum ratio of content nodes that are required to be up for the cluster state to be up.|