This documents the syntax and content of schemas, document types and fields.
This is a reference, see schemas for an overview and examples.
Syntax
Throughout this document, a string in square brackets represents some argument.
The whole string, including the brackets, is replaced by a concrete string in a schema.
Constructs in schemas have a regular syntax.
Each element starts by the element identifier,
possibly followed by the name of this particular occurrence of the element,
possibly followed by a space-separated list of interleaved attribute names
and attribute values, possibly followed by the element body.
Thus, one will find elements of these varieties:
The root element of schemas.
A schema describes a type of data and what we should compute over it.
A schema must be defined in a file named [schema-name].sd.
schema [name] inherits [name] {
[body]
}
The inherits attribute is optional.
If a schema is inherited, this schema will include all the constructs of it as if they
were defined in this (except the parent document type).
The document type in this must declare that it inherits the document type of the parent schema.
A field not contained in the document.
Use synthetic fields (outside document) to derive new field values
to be placed in the indexing structure from document fields.
Find examples in reindexing.
Contained in schema and describes a document type.
This can also be the root of the schema, if the document is not to be queried directly.
document [name] inherits [name-list] {
[body]
}
The document name is optional, it defaults to the containing schema
element's name. If there is no containing schema element, the document name is required. If the document with a name is defined inside a schema, the document name must match the schema element's name. The reference to document type in the documentation refers to the document name defined here.
The inherits attribute is optional
and has as value a comma-separated list of names of other document types.
A document type may inherit the fields of one or more other document types,
see document inheritance for examples.
If no document types are explicitly inherited,
the document inherits the generic document type.
The body of a document type is optional and may contain:
Specifies compression options for documents of this document type in storage.
struct
Contained in document.
Defines a composite type.
A struct consists of zero or more fields that the user can access together as one.
The struct has to be defined before it is used as a type in a field specification.
Contained in schema,
document,
struct or
annotation.
Defines a named value with a type and (optionally) how this field
should be stored, indexed, searched, presented and how it should influence ranking.
field [name] type [type-name] {
[body]
}
Do not use names that are used for other purposes in the indexing language
or other places in the schema file. Reserved names are:
attribute
body
case
context
documentid
else
header
hit
host
if
index
position
reference
relevancy
sddocname
summary
switch
tokenize
Other names not to use include any words that start with a number or includes special characters.
The type attribute is mandatory - supported types:
Field type
Description
annotationreference
Use to define a field (inside annotation, or inside e.g. a
struct used by a field in an annotation) with a reference to another annotation.
Should only be used for fields declared inside annotation,
or as a base type by the use of any of the compound types listed above, inside annotation.
To define a such a field, you must first create an annotation type.
The struct must be defined inside the schema.
To declare an annotationreference field in an annotation, use the annotation name to identify the field type:
annotation foo {
field baz type annotationreference<bar> { }
}
annotation bar { }
Index
N/A
Attribute
N/A
Summary
N/A
array<type>
For single-value (primitive) types, use array<type> to create an array field of the element type:
Index
Each element is indexed separately
Attribute
Added as an array attribute
Summary
Added as an array summary field
Also use to create an array field of the given struct type.
The struct type must be defined separately. Example:
struct person {
field first_name type string {}
field last_name type string {}
}
field people type array<person> {
indexing: summary
summary: matched-elements-only
struct-field first_name {
indexing: attribute
attribute: fast-search
}
}
The entire people field is part of document summary.
The struct fieldfirst_name is defined as an attribute for searching,
with fast-search.
A subset, or all, of the struct fields can be defined as attributes.
Use the sameElement
operator to ensure matches in same struct field instance.
Use matched-elements-only
to reduce the amount of data that is returned in document summary.
Important:key and value are reserved words in an array<struct>,
as these are used to implement map.
Do not use these as struct-field names.
Restrictions:
Array of struct types does not support ranking features
and can only be used for matching and filtering.
All struct arrays can be fed, retrieved and used in document summaries.
Some parts of struct arrays can be searched in indexed search mode,
while all parts of struct arrays can be searched in streaming search.
See table below for supported cases.
Index
Only supported in streaming search.
Set this on the top-level struct array field to make all parts searchable.
Attribute
Only supported for struct fields
that have primitive types (string, int, long, byte, float, double).
Any struct field must be defined as an attribute to be used for searching.
The struct type can still contain fields of non-primitive types,
as long as these are not defined as attributes.
Summary
Added as an array summary field
bool
Use for boolean values.
field alive type bool {
indexing: summary | attribute
}
Index
Not supported
Attribute
Added as a boolean
Summary
Added as a boolean value (true or false)
Important:
Defaults to false if not specified.
byte
Use for single 8-bit numbers.
field smallnumber type byte {
indexing: summary | attribute
}
Index
Not supported. An attribute will automatically be used instead
Attribute
Added as a byte which supports range searches
Summary
Added as a byte
double
Use for high precision floating point numbers (64-bit IEEE 754 double).
field mydouble type double {
indexing: summary | attribute
}
Index
Not supported. An attribute will automatically be used instead
Attribute
Added as a 64-bit IEEE 754 double which supports range searches
Summary
Added as a 64-bit IEEE 754 double
float
Use for floating point numbers (32-bit IEEE 754 float).
field myfloat type float {
indexing: summary | attribute
}
Index
Not supported. An attribute will automatically be used instead
Attribute
Added as a 32-bit IEEE 754 float which supports range searches
Summary
Added as a 32-bit IEEE 754 float
int
Use for single 32-bit integers.
field release_year type int {
indexing: summary | attribute
}
Index
Not supported. An attribute will automatically be used instead
Attribute
Becomes integer attributes, which supports range grouping and range searches
Summary
Added as a 32-bit integer
long
Use for single 64-bit integers.
field bignumber type long {
indexing: summary | attribute
}
Index
Not supported. An attribute will automatically be used instead
Attribute
Becomes a 64-bit integer attribute, which supports range grouping and range searches
Summary
Added as a 64-bit integer
map<key-type,value-type>
Use to create a map where each unique key is mapped to a single value.
Any primitive type can be used as key-type and any primitive type or Vespa struct type as value-type.
Example of a map of primitive types,
where the key and value fields are specified as attributes:
field my_map type map<string, int> {
indexing: summary
struct-field key { indexing: attribute }
struct-field value { indexing: attribute }
}
Note that a Vespa map entry is handled as a struct
with a key and value field
with key-type and value-type as types.
This explains the struct-field syntax above.
The full my_map field is configured into the document summary.
A more complex example is a map of struct:
struct person {
field first_name type string {}
field last_name type string {}
field age type int {}
}
field identities type map<string, person> {
indexing: summary
summary: matched-elements-only
struct-field key {
indexing: attribute
attribute: fast-search
}
struct-field value.first_name {
indexing: attribute
attribute: fast-search
}
struct-field value.last_name {
indexing: attribute
attribute: fast-search
}
}
This example illustrates that the struct elements are configured individually -
there is no field configuration for age -
one can define a subset of the struct fields as attributes.
Here, key, value.first_name and value.last_name are defined as attributes.
This makes them available for searching and grouping.
A common use case is requiring matches in the same map entry
(e.g. match both first and last name for the same person),
see the sameElement operator
for how to implement this.
Use matched-elements-only
to reduce the amount of data in the document summary.
fast-search
is used to make query access faster by creating an index structure for lookups.
As a map alternative,
an array<struct> can contain the same element multiple times and maintains order.
Restrictions:
Map of struct or primitive types do not support ranking features
and can only be used for matching and filtering.
All map types can be fed, retrieved and used in document summaries.
Some map types can be searched in indexed search mode,
while all map types can be searched in streaming search.
See table below for supported cases.
Index
Only supported in streaming search.
Set this on the top-level map field to make all struct fields in the map field searchable.
Attribute
Only supported for struct fields where value-type is either a
primitive type (string, int, long, byte, float, double) or a
struct type with fields of primitive types.
Any struct field must be defined as an attribute to be used for searching.
The value-type struct can still contain fields of non-primitive types,
as long as these are not defined as attributes.
Summary
Added as a map.
position
Used to filter and/or rank documents by distance to a position in the query,
see Geo search.
field location type position {
indexing: attribute
}
Index
Not supported
Attribute
Added as an interleaved 64-bit integer
(see Z-order curve) -
queries are implemented by doing a set of range searches in the attribute.
This attribute has fast-search set implicitly
field predicate_field type predicate {
indexing: attribute
index {
arity: 2 # mandatory
lower-bound: 3
upper-bound: 200
dense-posting-list-threshold: 0.25
}
}
Index
Not supported
Attribute
Indexed in-memory in a variable size binary format that is optimized for application during query evaluation
Summary
Added as-is
raw
Use for binary data
field rawfield type raw {
indexing: summary | attribute
}
Index
Not supported
Attribute
Added as raw data. Not searchable.
Summary
Added as raw data. Outputted as a base64-encoded string.
See JSON feed format for details.
reference<document-type>
A reference<document-type> field is a reference to an instance of a document-type -
i.e. a foreign key.
Reference fields are not supported in
streaming search.
field artist_ref type reference<artist> {
indexing: attribute
}
The reference is the document id of the document-type instance.
References are used to join documents in a parent-child relationship.
A reference can only be made to global documents.
The following type of references are not supported:
Self-reference
Cyclic reference: If document type foo has a reference to bar,
then bar cannot have a reference to foo
A reference attribute field can be searched using the document id of the parent document-type instance as query term.
Note that this will be a linear scan as fast-search is not supported.
Index
Invalid - deployment will fail
Attribute
As string - a reference must be an attribute.
Can be empty string or point to a non-existing document.
The reference size is 24 byte in memory, mapped from the ID string to internal references.
Use for a text field of any length.
String fields may only contain text characters, as defined by isTextCharacter in
com.yahoo.text.Text
field surname type string {
indexing: summary | index
}
Index
Refer to linguistics
for details on normalization, tokenization and stemming.
Attribute
Added as-is. match exact or prefix is
supported types of searches in string attributes. Searches are however
case-insensitive. A query for BritneY.spears will match a
document containing BrItNeY.SpEars
Summary
Added as-is
struct
Use to define a field with a struct datatype.
Create a struct type inside the document definition and
declare the struct field in a document or struct using the struct type name as the field type:
struct person {
field first_name type string {}
field last_name type string {}
}
field my_person type person {
indexing: summary
}
Restrictions:
Struct fields can not be searched in indexed search mode
(but array of struct and
map type are searchable, with some restrictions).
Struct fields can be fed, retrieved and used in document summaries.
Index
Only supported in streaming search.
Set this on the top-level field to make all parts searchable.
field tensorfield type tensor<float>(x{},y{}) {
indexing: attribute | summary
}
field tensorfield type tensor<float>(x[2],y[2]) {
indexing: attribute | summary
}
Index
Supported for tensor types with:
One indexed dimension - single vector per document
One or more mapped dimensions and one indexed dimension - multiple vectors per document
Added as-is in an attribute to be used for ranking and nearest neighbor search.
Summary
Added as-is.
uri
Use for URL type matching.
URI-fields are not supported in
streaming search.
Index
The URL is split into the different components which are indexed
separately. Note that only URLs can be indexed this way, not other URIs.
The different components are as defined by the HTTP standard:
Scheme, hostname, port, path, query and fragment. Example:
It is not possible to index uri-typed fields into a common index, i.e. it has
to be indexed separately from other fields. If you need to combine URLs
with other fields you could store it in a string-field instead, but then
you can not search in the different parts of the URL (scheme, hostname,
port, path, query and fragment).
Aliasing also works different for URL fields - you
are allowed to create aliases both to the index (as usual) and to the
components of it. Use
alias [component]: [alias]
to create an alias to a component. For example, given this field:
field surl type uri {
indexing: summary | index
alias: url
alias hostname: site
}
a search in "surl" and "url" will search in the entire url,
while "surl.hostname" or "site" will search the hostname.
Attribute
Not allowed
Summary
Added as-is as a string
weightedset<element-type>
Use to create a multivalue field of the element type,
where each element is assigned a signed 32-bit integer weight.
field tag type weightedset<string> {
indexing: attribute | summary
}
The element type can be any single value type. Prefer not to use floating point number types
like float or double.
To access a weighted set in ranking when
using attribute, see
attribute the match features,
or convert the weighted set to a tensor using the tensorFromWeightedSet(field, dimensionName) feature.
To access a weighted set in ranking when
using index, see
ranking features
for indexed multivalued fields. Note that when using index with weightedset, queries
are matching across elements in the set.
It is possible to specify that a new key should be created if it does not exist before the update,
and that it should be removed if the weight is set to zero -
see the reference.
The weightedset field does not support filtering on weight.
If you need that use the map type and
sameElement query operator -
see this example.
Index
Each token present in the field is indexed separately.
Information indexed includes element number, element weight and a
list of token occurrence positions for each element in which the token is present
Attribute
Added as a multivalue weighted attribute
Summary
Added as a multivalue summary field if this is an attribute
The body of a field is optional for schema,
document and
struct, and disallowed for
annotation. It may contain the following elements:
Name
Occurrence
Description
alias
Zero to many
Make an index or attribute available in queries under an additional name.
This has minimal performance impact and can safely be added to running applications. Example:
A subfield of a field of type struct. The struct must have been defined to
contain this subfield in the struct definition. If you want the subfield to
be handled differently from the rest of the struct, you may specify it within
the body of the struct-field.
Fields can not have default values.
See the document guide for how to auto-set field values.
It is not possible to query for fields without value (i.e. query for NULL) -
see the query language reference.
Fields without value are not returned in query results.
Fields can be declared outside the document block in the schema.
These fields are not part of the document type but behave like regular fields for queries.
Since they are not part of the document they cannot be written directly,
but instead take their values from document fields, using the input expression:
indexing: input my_document_field | embed | summary | index
This is useful e.g. to index a field in multiple ways,
or to change the field value, something which is not allowed with document fields.
When the document field(s) used as input are updated, these fields are updated with them.
struct-field
Contained in field or
struct-field.
Defines how this struct field (a subfield of a struct) should be stored,
indexed, searched, presented and how it should influence ranking.
The field in which this struct field is contained must be of
type struct or a collection of type struct:
struct-field [name] {
[body]
}
The body of a struct field is optional and may contain the following elements:
The indexing statements used to create index structure additions from this field.
For indexed search only attribute is supported, which makes the struct
field a searchable in-memory attribute that can also be used for e.g. grouping and ranking.
For streaming searchindex and summary are supported in addition.
The fields in the fieldset should be as similar as possible in terms of indexing clause and match mode.
If they are not, test the application thoroughly.
Having different match modes for the fields in the fieldset generates a warning during application deployment.
If specific match settings for the fieldset is needed, such as exact, specify it using match:
fieldset myfieldset {
fields: a,b,c
match {
exact
}
}
Use query-commands in the field set to set search settings. Example:
Contained in document.
If a compression level is set within this element,
lz4 compression is enabled for whole documents.
compression {
[body]
}
The body of a compression specification is optional and may contain:
Name
Occurrence
Description
type
Zero to one
LZ4 is the only valid compression method.
level
Zero to one
Enable compression. LZ4 is linear and 9 means HC(high compression)
threshold
Zero to one
A percentage (multiplied by 100) giving the maximum size that
compressed data can have to keep the compressed value.
If the resulting compressed data is higher than this,
the document will be stored uncompressed. Default value is 95.
rank-profile
Contained in schema or equivalently in separate files in the
application package, named
[profile-name].profile in any directory below schemas/[schema-name]/.
A rank profile is a named set of ranking expression functions and
settings which can be
selected in the query).
Whether defined inline in the schema or in a separate .profile file, the syntax of a rank profile is
The inherits list is optional and may contain the name of other rank profiles
in this schema or one it inherits.
Elements not defined in this rank profile will then be inherited from those profiles.
Inheriting multiple profiles which define the same elements leads to an error at deployment.
Current default is 1. If you suspect the fixed cost per thread is too high
increasing this number might be a good idea.
Especially if most of your queries are cheap,
but you have increased the num-threads-per-search
in order to reduce latency for your costly queries covering a lot of documents.
The default might change, or the optimal value might be adaptive rendering overrides ignored or counterproductive.
num-search-partitions
Zero or one
Number of logical partitions the corpus on a searchnode is divided in.
By default, this is the same as num-threads-per-search.
A partition is the smallest unit a search thread will handle.
If you have a locality in time when searching and feeding documents,
you might want to split it into more, smaller partitions.
That way you avoid that one costly partition leaves some threads idle while others are working hard.
If you have 8 threads per search,
you might have 10x as many partitions at 80 reducing max skew with a similar factor.
Note that a value of zero turns on adaptive partitioning which tries to solve this optimally.
Note:
If num-search-partitions is set to 0 (work sharing is enabled),
make sure termwise-limit is set to 1.0 (termwise evaluation is disabled).
This is to avoid redoing termwise evaluation when work is passed from one thread to another.
termwise-limit
Zero or one
If estimated number of hits > corpus * termwise-limit,
it will prune candidates with a cpu cache friendly
TAAT
with the terms not needed for ranking,
prior to doing DAAT.
Current default is 1.0 which turns it off.
A value between 0.05 and 0.20 can be a good starting point.
This is particularly useful if you have many weak filters.
Note that this is a manual override.
The default might change, or the optimal value might be adaptive rendering overrides ignored or counterproductive.
post-filter-threshold
Zero or one
Threshold value (in the range [0.0, 1.0]) deciding if a query with an approximate
nearestNeighbor
operator combined with filters is evaluated using post-filtering instead of the default pre-filtering.
Post-filtering is chosen when the estimated filter hit ratio of the query is larger than this threshold.
The default value is 1.0, which disables post-filtering.
See
Controlling the filtering behavior with approximate nearest neighbor search
for more details.
With post-filtering the targetHits value
used when searching the HNSW index is auto-adjusted in an effort to expose targetHits hits
to first-phase ranking after post-filtering has been applied. The following formula is used:
Threshold value (in the range [0.0, 1.0]) deciding if a query with an approximate
nearestNeighbor
operator combined with filters is evaluated by searching the
HNSW
graph for approximate neighbors with pre-filtering,
or performing an
exact nearest neighbor search with pre-filtering.
The fallback to exact search is chosen when the estimated filter hit ratio of the query is less
than this threshold.
The default value is 0.05.
See
Controlling the filtering behavior with approximate nearest neighbor search
for more details.
Value (in the range [1.0, inf]) used to control the auto-adjustment of
targetHits used when evaluating an approximate
nearestNeighbor
operator with post-filtering.
The default value is 20.0.
Setting this value to 1.0 disables auto-adjustment of targetHits.
See post-filter-threshold for more details.
Threshold value (in the range [0.0, 1.0]) deciding when matching in index fields should be treated as filters.
This happens for query terms with estimated hit ratios (in the
range [0.0, 1.0]) that are above the filter-threshold.
Use this to optimize query performance when searching large text index fields,
by allowing a per query combination of rank: filter and rank: normal behavior.
This parameter can be overridden per index field, see field-level filter-threshold
for a more detailed description with tradeoffs.
In testing with various text datasets (e.g. Wikipedia), a filter-threshold setting of 0.05 has shown to be a good starting point.
Tunes the weakAnd algorithm to automatically
exclude terms and documents with expected low query significance based on
document frequency
statistics present in the document corpus. This makes matching faster at the cost of potentially
reduced recall.
match-phase
Contained in rank-profile.
The match-phase feature lets you increase performance by limiting hits
exposed to first-phase ranking to the highest (lowest) values of some attribute.
The performance gain may be substantial, especially with an expensive first-phase function.
The quality loss is dependent on how well the chosen attribute correlates with the
first-phase score.
Documents which have no value of the chosen attribute will be taken as having the value 0.
match-phase {
attribute: [numeric single value attribute]
order: [ascending | descending]
max-hits: [integer]
}
Name
Description
attribute
The quality attribute that decides which documents are a match if the match phase
estimates that there will be more than max-hits hits.
The attribute must be single-value numeric with fast-search enabled.
It should correlate with the order which would be produced by a full query evaluation.
No default.
order
Whether the attribute should be used in descending order (prefer documents with a high value)
or ascending order (prefer documents with a low value).
Usually it is not necessary to specify this,
as the default value descending is by far the most common.
max-hits
The max hits each content node should attempt to produce in the match phase.
Usually a number like 10000 works well here.
strict
Contained in rank-profile.
True or false. By default, Vespa uses loose type checking, where any query feature used but not defined in
a query profile type is assumed to be a floats.
Set true to cause a deploy failure on missing query property type definitions instead.
strict: true
diversity
Contained in rank-profile.
Diversity is used to guarantee diversity in the different query phases.
If you have match-phase, it will provide diverse results from match-phase to first-phase.
If you have second-phase, it will provide diverse results from first-phase to second-phase.
Specify the name of an attribute that will be used to provide diversity.
Result sets are guaranteed to get at least min-groups
unique values from the diversity attribute from this phase,
but no more than max-hits. For match-phase max-hits = match-phase max-hits.
For second-phase max-hits = rerank-count
A document is considered as a candidate if:
The query has not yet reached the max-hits
number produced from this phase.
The query has not yet reached the max number of candidates in one group.
This is computed by the max-hits
of the phase divided by min-groups
Which attribute to use when deciding diversity.
The attribute must be a single-valued numeric, string or reference type.
min-groups
Specifies the minimum number of groups returned from the phase.
Using this with match-phase
often means one can reduce max-hits.
In second-phase
you might reduce rerank-count and still good and diverse results.
first-phase
Contained in rank-profile.
The config specifying the first phase of ranking.
See
phased ranking with Vespa.
This is the initial ranking performed on all matching documents, you should therefore avoid doing
computationally expensive relevancy calculations here.
By default, this will use the ranking feature nativeRank.
first-phase {
[body]
}
The body of a firstphase-ranking statement consists of:
Specify the ranking expression to be used for first phase of ranking -
see ranking expressions.
keep-rank-count
How many documents to keep the first phase top rank values for. Default value is 10000.
rank-score-drop-limit
Drop all hits with a first phase rank score less than or equal to this floating point number.
Use this to implement a rank cutoff.
Default is -Double.MAX_VALUE
The second format is primarily a convenience feature when using long expressions, enabling them
to be split over multiple lines.
Expressions can also be loaded from a separate file. This is useful when dealing with the long
expressions generated by e.g. MLR. The syntax is:
expression: file:[path-to-expressionfile]
The path is relative to the location of the schema definition file.
The file itself must end with .expression. This suffix is optional in the schema.
Therefore expression: file:mlrranking.expression and
expression: file:mlrranking are identical.
Both refer to a file called mlrranking.expression in the schemas directory.
Any number of ranking features can be listed on each line, separated by space.
inputs
Contained in rank-profile.
List of inputs: Query features consumed by ranking expressions in this profile.
Query features are set either as a request property,
or equivalently from a Searcher, by calling
query.getRanking().getFeatures().put("query(myInput)", myValue).
Query feature types can also be declared in
query profile types,
but declaring inputs in the profile needing them is usually preferable.
Inputs are inherited from inherited profiles.
inputs {
name [type]? (: value)?
}
Name
Description
name
The name of the inputs, written either the full feature name query(myName), or just as name.
type
The type of the constant, either double or a tensor type.
If omitted, the type is double.
value
An optional default module, used if this input is not set in the query.
A number, or a tensor on literal form.
Contained in rank-profile.
List of constants available in ranking expressions, resolved and optimized at configuration time.
Constants are inherited from inherited profiles, and from the schema itself.
constants {
name [type]?: value|file:[path]
}
Name
Description
name
The name of the constant, written either the full feature name constant(myName), or just as name.
type
The type of the constant, either double or a tensor type.
If omitted, the type is double.
value
A number, a tensor on literal form,
or file: followed by a path from the application package root to a
file containing the constant.
The file must be stored in a valid tensor JSON Format
and end with .json. The file may be lz4 compressed, in which case the ending must be
.json.lz4.
Contained in rank-profile.
List of generic properties, in the form of key/value pairs to be used by ranking features.
Examples.
rank-properties {
key: value
}
Name
Description
key
Name of the property.
value
A number or any string. Must be quoted if it contains spacing.
function (inline)? [name]
Contained in rank-profile.
Define a named function that can be referenced as a part of the ranking expression,
or (if having no arguments) as a feature.
A function accepts any number of arguments.
function [name]([arg1], [arg2], [arg3]) {
expression: …
}
You can not include functions that accept arguments in summary features.
Adding the inline modifier will inline this function in the calling expression
if it also has no arguments.
This is faster for small and cheap functions (and more expensive for others).
second-phase
Contained in rank-profile.
The config specifying the second phase of ranking.
See
phased ranking with Vespa.
This is the optional re-ranking phase performed on the top ranking hits from the
first-phase, and where you should put any advanced relevancy calculations.
For example Machine Learned Ranking (MLR) models.
By default, no second-phase ranking is performed.
second-phase {
[body]
}
The body of a secondphase-ranking statement consists of:
Specify the ranking expression to be used for second phase of ranking. (for a description,
see the ranking expression documentation.
Hits not reranked might be re-scored using a linear function to avoid a greater rank score than the worst reranked hit. This linear function will normally attempt to map the first phase rank score range of reranked hits to the reranked rank score range
rank-score-drop-limit
When set, drop all hits with a second phase rank score (possibly a re-scored rank score) less than or equal to this floating point number.
Use this to implement a second phase rank cutoff.
By default, this value is not set. This can also be set in the query.
rerank-count
Optional argument. Specifies the number of hits to be re-ranked in the second phase.
Default value is 100. This can also be set in the query.
Note that this value is local to each node involved in a query.
Hits not reranked might be re-scored.
global-phase
Contained in rank-profile.
The config specifying the global phase of ranking.
See phased ranking with Vespa.
This is an optional re-ranking phase performed on the top ranking hits in
the stateless container after merging hits from all the content nodes.
The "top ranking" here means as scored by the first-phase ranking expression
or (if specified) second-phase ranking expression.
Typically used for computing large ONNX models which would be expensive
to compute on all content nodes.
By default, no global-phase ranking is performed.
global-phase {
[body]
}
The body of a global-phase ranking statement consists of:
Specify the ranking expression to be used for global phase of ranking. (for a description,
see the ranking expression documentation.
rerank-count
Optional argument. Specifies the number of hits to be re-ranked in the global phase.
Default value is 100.
Note for complex setups: Applied to hits from one schema at a time, so if
a query searches in multiple schemas simultaneously, global-phase may run for 100 hits
per schema as default.
rank-score-drop-limit
When set, drop all hits with a global phase rank score (possibly a re-scored rank score) less than or equal to this floating point number.
Use this to implement a global phase rank cutoff.
By default, this value is not set. This can also be set in the query.
If not specified, the features are as specified in the parent profile (if any).
To inherit the features from the parent profile and specify additional features,
specify explicitly that the features should be inherited from the parent as shown below.
Refer to schema inheritance for examples.
The rank features specified here are computed in the
fill phase of multiphased queries.
Note:
Rank-features references in summary-features
are re-calculated during the fill protocol phase
for the hits which made it into the global top ranking hits (from all nodes).
See match-features for an alternative.
If not specified, the features are as specified in the parent profile (if any).
To inherit the features from the parent profile and specify additional features,
specify explicitly that the features should be inherited from the parent as shown below,
also see schema inheritance.
To disable match-features from parent rank profiles, use match-features {}.
match-features is similar to summary-features,
but the rank features specified here are computed in the first protocol phase of
multiprotocol query execution,
also called the match protocol phase.
This gives a different performance trade-off, for details see
feature values in results.
Any number of ranking features separated by space can be listed on each line.
mutate
Contained in rank-profile.
Specifies mutating operations you can do to each of the documents that make it through the 4 query phases,
on-match, on-first-phase, on-second-phase and on-summary.
Prefer to define constants in the rank profiles that need them,
with rank profile inheritance to avoid repetition.
See constants.
Contained in schema.
This defines a named constant tensor located in a file with a given type
that can be used in ranking expressions using the rank feature
constant(name):
constant [name] {
[body]
}
The body of a constant must contain:
Name
Description
Occurrence
file
Path to the file containing this constant, relative from the application package root.
The file must be stored in a valid tensor JSON Format
and end with .json. The file may be lz4 compressed, in which case the ending must be
.json.lz4.
One
type
The type of the constant tensor, refer to
tensor-type-spec for reference.
When an application with tensor constants is deployed,
the files are distributed to the content nodes
before the new configuration is being used by the search nodes.
Incremental changes to constant tensors is not supported.
When changed, replace the old file with a new one and re-deploy the application
or create a new constant with a new name in a new file.
raw-as-base64-in-summary
Contained in schema.
Whether raw fields should be rendered as a base64 encoded string in summary,
the same way as in json feed format,
rather than an escaped string. This is default true.
onnx-model
Contained in rank-profile or schema.
This defines a named ONNX model located in a file
that can be used in ranking expressions using the "onnx" rank feature.
Prefer to define onnx models in the rank profiles using them.
Onnx models are inherited from parent profiles, and from the schema.
onnx-model [name] {
[body]
}
The body of an ONNX model must contain:
Name
Occurrence
Description
file
One
Path to the location of the file containing the ONNX model.
The path is relative to the root of the application package containing this schema.
input
Zero to many
An input to the ONNX model. The ONNX name as given in the model as well as the
source for the input is specified.
output
Zero to many
An output of the ONNX model. The ONNX name as given in the model as well as the
name for use in Vespa is specified. If no output are defined and are not
referred to from the rank feature, the first output defined in the model is used.
gpu-device
Zero or one
Set the GPU device number to use for computation, starting at 0, i.e.
if your GPU is /dev/nvidia0 set this to 0. This must be an Nvidia CUDA-enabled GPU. Currently only
models used in global-phase can make use of GPU-acceleration.
intraop-threads
Zero or one
The number of threads available for running operations with multithreaded implementations.
interop-threads
Zero or one
The number of threads available for running multiple operations in parallel.
This is only applicable for parallel execution mode.
execution-mode
Zero or one
Controls how the operators of a graph are executed, either sequential or parallel.
Contained in schema.
An explicitly defined document summary. By default, a document summary
named default is created. Using this element, other document
summaries containing a different set of fields can be created.
The inherits attribute is optional.
If defined, it contains the name of other document summaries in the same schema (or a parent)
which this should inherit the fields of.
Refer to schema inheritance for examples.
Specifies that summary-features should be omitted from this document summary.
Use this to reduce CPU cost in
multiphase searching
when using multiple document summaries to fill hits,
and only some of them need the summary features that are specified in the
rank-profile.
Contained in field,
schema or
index.
Sets how to stem a field or an index, or how to stem by default.
Read more on stemming.
stemming: [stemming-type]
The stemming types are:
Type
Description
none
No stemming: Keep words unchanged
best
Use the 'best' stem of each word according to some heuristic scoring.
This is the default setting
shortest
Use the shortest stem of each word
multiple
Use multiple stems. Retains all stems returned from the linguistics library
Note:
When combining multiple fields in a fieldset,
all fields should use the same stemming type.
normalizing
Contained in field.
Sets normalizing to be done on this field.
Default is to normalize.
normalizing: [normalizing-type]
Type
Description
none
No normalizing.
dictionary
Contained in field,
and specifies details of the dictionary used in the inverted index of the field.
Applies only to attributes annotated with fast-search.
You can specify either btree or hash, or both.
Note:
Note that prefix search for strings and range search for numeric fields will fall back to full scan
if using hash.
It is primarily intended for use when you have many unique terms with few occurrences (short posting lists),
where the dictionary lookup cost could be significant.
Normally, btree is your best choice as it offers reasonable performance
for both exact, prefix and range type of dictionary lookups.
This is also the default.
Find more details in attribute index structures.
Use hash for fields with high uniqueness (high cardinality),
for example an 'id' field which is unique in the corpus where the posting list is always of size 1.
In addition, one can specify uncased or cased dictionary for string attributes,
default is uncased.
This setting is sanity checked against the field match:casing setting.
In an uncased dictionary,
casing is normalized by lowercasing so that 'bear' equals 'Bear' equals 'BEAR'.
In a cased dictionary, they will all be different.
Example of a string field with a cased hash dictionary. Note that for string fields with
dictionary type hash, the dictionary block must also include cased.
Only supported for tensor field types with at least one mapped dimension.
Ensures that the per-document tensors are stored in-memory using a format that is more optimal for
ranking expression evaluation. This comes at the cost of using more memory.
Without this setting these tensors are serialized in-memory,
which requires de-serialization as part of ranking expression evaluation.
See tensor performance.
paged
This can reduce memory footprint by allowing paging the attribute data out of memory to disk.
Not supported for tensor with fast-rank and
predicate types.
See paged attributes for details.
Do not enable paged before fully understanding the consequences.
Specifies the distance metric to use with the nearestNeighbor
query operator. Only relevant for tensor attribute fields.
mutable
Marks the attribute as a special mutable attribute that can be updated by a
mutate operation during query evaluation.
An attribute is multivalued
if assigning it multiple values during indexing,
by using a multivalued field type like array or map,
or by using e.g. split /
for_each
or by letting multiple fields write their value to the attribute field.
Note that normalizing and
tokenization
is not supported for attribute fields.
Queries in attribute fields are not normalized, nor stemmed.
Use index on fields to enable.
Both index and attribute can be set on a field.
sorting
Contained in attribute or
field.
Specifies how sorting should be done.
sorting : [property]
or
sorting {
[property]
…
}
Property
Description
order
ascending (default) or descending.
Used unless overridden using order by in query.
function
Sort function:
uca (default), lowercase or raw.
Note that if no language or locale is specified in the query, the field,
or generally for the query, lowercase will be used instead of uca.
See order by for details.
strength
UCA sort strength, default primary -
see strength for values.
Values set in the query overrides the schema definition.
locale
UCA locale, default none,
indicating that it is inferred from query.
It should only be set here if the attribute is filled with data in one language only.
See locale for details.
Values set in the query overrides the schema definition.
distance-metric
Specifies the distance metric to use with the nearestNeighbor
query operator to calculate the distance between document positions and the query position.
Only relevant for tensor attribute fields, where each tensor holds one or multiple vectors.
Which distance metric to use depends on the model used to produce the vectors;
it must match the distance metric used during representation learning (model learning).
If you are using an "off-the-shelf" model to vectorize your data, please ensure
that the distance metric matches the distance metric suggested for use with the model.
Different type of vectorization models use different type of distance metrics.
Important:
When changing the distance-metric or max-links-per-node,
the content nodes must be restarted to rebuild the HNSW index - see
changes that require restart but not re-feed
The calculated distance will be used to
select the closest hits for nearestNeighbor query operator, but also to
build the HNSW index (if specified) and
to produce the distance and
closeness ranking features.
distance-metric: [metric]
These are the available metrics; the expressions given for distance and closeness
assume a query vector qv = [x0, x1, ...] and an attribute vector av = [y0, y1, ...]
with same dimension of size n for all vectors.
Only useful for binary tensors using <int8> precision, see note below.
; range: [0,8*n]
euclidean
The default metric is Euclidean distance
which is just the length of a line segment between the two points.
To compute the Euclidean distance directly in a ranking expression
instead of fetching one already computed in a nearestNeighbor query
operator, use the
Euclidean_distance helper function:
function mydistance() {
expression: euclidean_distance(attribute(myembedding), query(myqueryvector), mydim))
}
angular
The angular distance metric computes the angle between the vectors.
Its range is [0,pi], which is the angular distance.
This is also known as ordering by
cosine similarity
where the score function is just the cosine of the angle.
To compute the angular distance directly in a ranking expression, use the
cosine_similarity helper function:
function angle() {
expression: acos(cosine_similarity(attribute(myembedding), query(myqueryvector), mydim))
}
Conversely, the cosine similarity can be recovered from the
distance rank-feature
when using a nearestNeighbor query operator:
If possible, it's slightly better for performance to normalize both query and document vectors to the same
L2 norm and use the prenormalized-angular metric instead; but note that returned
distance and closeness will be different.
dotproduct
The dotproduct distance metric is used to mathematically
transform a "maximum inner product" search into a form where
it can be solved by nearest neighbor search, where the dotproduct
is used as a score directly (large positive dotproducts are
considered "nearby").
Internally an extra dimension is added (ensuring that all vectors
are normalized to the same length) and a distance similar to
prenormalized-angular is used to build the HNSW index.
For details, see
this high level guide based on
section 3.1 Order Preserving Transformations in this paper.
Note that the distance and closeness rank-features will not have the
usual semantic meanings when using the dotproduct distance metric.
In particular, closeness will just return the dot product
which may have any negative
or positive value, and distance is just the negative dot product.
If a normalized closeness in range [0,1] is needed, an appropriate
sigmoid function must be applied.
For example, if your attribute is named "foobar", and the maximum dotproduct seen is
around 4000, the expression
sigmoid(0.001*closeness(field,foobar))
could be a possible choice.
The dotproduct distance metric is useful for some vectorization
models, including matrix factorization, that use "maximum inner
product" (MIP), with vectors that aren't normalized.
These models use both direction and magnitude.
prenormalized-angular
The prenormalized-angular distance metric
must only be used when both query and document vectors are normalized.
This metric was previously named "innerproduct" and required unit length vectors;
the new version computes the length of the query vector once and
assumes all other vectors are of the same length.
Using prenormalized-angular with vectors that are not
normalized causes unpredictable nearest neighbor search, and is
observed to give very bad results both for performance and quality.
The length, magnitude, or norm of a vector x is calculated as length = sqrt(sum(pow(xi,2))).
The unit length normalized vector is then given by [xi/length].
Zero vectors may not be used at all.
The Vespa prenormalized-angular computes the
cosine similarity
and uses 1.0 - cos(angle) as the distance metric.
It gives exactly the same ordering as angular distance,
but with a distance in the range [0,2], since cosine similarity has range [1,-1],
so the end result is 0.0 for same direction vectors,
1.0 for a right angle, and 2.0 for vectors with exactly opposite directions.
Getting the cosine score (or angle) is therefore easy:
To compute the cosine similarity directly in a ranking expression
instead of fetching one already computed in a nearestNeighbor query
operator, use the
cosine_similarity helper function:
function mysimilarity() {
expression: cosine_similarity(attribute(myembedding), query(myqueryvector), mydim))
}
geodegrees
The geodegrees distance metric is only valid for
geographical coordinates (two-dimensional vectors containing
latitude and longitude on Earth, in degrees). It computes the
great-circle distance (in kilometers) between two geographical
points using the
Haversine formula.
See
geodegrees system test
for an example.
hamming
The Hamming distance
metric counts the number of dimensions where the
vectors have different coordinates.
This isn't useful for floating-point data since it means you only
get 1 bit of information from each floating-point number.
Instead, it should be used for binary data where each bit is
considered a separate coordinate. Practically, this means you
should use the
int8cell value type
for your tensor, with the usual encoding from bit pattern to numerical value, for example:
00000000 → 0 (hex 00)
00010001 → 17 (hex 11)
00101010 → 42 (hex 2A)
01111111 → 127 (hex 7F)
10000000 → -128 (hex 80)
10000001 → -127 (hex 81)
11111110 → -2 (hex FE)
11111111 → -1 (hex FF)
Feeding data for this use case may be done with
"hex dump" format
instead of numbers in range [-128,127] both to have a more natural format
for representing binary data, and to avoid the overhead of parsing a large json array of numbers.
bolding
Contained in field or summary.
Highlight matching query terms in the summary:
bolding: on
The default is no bolding, set bolding: on to enable it.
Note that this command is overridden by summary: dynamic.
If both are specified, bolding will be ignored.
The difference between using bolding instead of summary: dynamic
is the latter will provide a dynamic abstract in addition to highlighting query terms,
while the first only highlights.
Bolding is only supported for index fields of type string or array<string>.
The default XML element used to highlight the search terms is <hi> -
to override, set container.qr-searchers configuration. Example using <strong>:
Maximum field byte length for bolding is 64Mb -
field values larger than this will be represented as a snippet as in summary: dynamic.
id
Contained in field.
Sets the numerical id of this field.
All fields have a document-internal id internally for transfer and storage.
Ids are usually determined programmatically as a 31-bit number.
Some storage and transfer space can be saved by instead explicitly setting id's to a 7-bit number.
id: [positive integer]
An id must satisfy these requirements:
Must be a positive integer
Must be less than 100 or larger than 127
Must be unique within the document and all documents this document inherits
index
Contained in field or schema.
Sets index parameters.
Content in string-fields with index is normalized and
tokenized by default.
The field can be single- or multivalued (e.g. array<string>).
Deprecated:
If index-name is specified, it will be used instead of the field name as name of the index.
This use is deprecated, use a synthetic field with the wanted name outside the document block instead -
see an example.
Set the stemming of this index.
Indexes without a stemming setting get their stemming setting from
the fields added to the index. Setting this explicitly is useful if
fields with conflicting stemming settings are added to
this index.
Enable this index field to be used with the
bm25 rank feature.
This creates posting lists for the
indexes
for this field that have interleaved features in the document id streams.
This makes it fast to compute the bm25 score.
See the BM25 reference for details and example use.
Specifies optional parameters for an HNSW index to enable faster, approximate nearest neighbor search.
Only supported for tensor attribute fields with tensor types with:
One indexed dimension - single vector per document
One or more mapped dimensions and one indexed dimension - multiple vectors per document
Used in combination with the
nearestNeighbor
query operator.
hnsw
Contained in index.
Specifies optional parameters for an HNSW index to enable faster, approximate nearest neighbor search
using the nearestNeighbor query operator.
Note:
Specifying the index keyword in the indexing statement of a tensor creates an HNSW index with default settings, even if this block is not specified!
This implements a modified version of the
Hierarchical Navigable Small World (HNSW) graphs algorithm (paper).
Only supported for the following tensor attribute field types:
Single vector per document: Tensor type with one indexed dimension.
Example: tensor<float>(x[3])
Multiple vectors per document: Tensor type with one or more mapped dimensions and one indexed dimension.
Examples: tensor<float>(m{},x[3]), tensor<float>(m{},n{},x[3])
The following parameters are used when building the index graph:
Parameter
Description
max-links-per-node
Specifies how many links per HNSW node to select when building the graph. Default value is 16.
In HNSWlib (implementation based on the paper)
this parameter is known as M.
neighbors-to-explore-at-insert
Specifies how many neighbors to explore when inserting a document in the HNSW graph. Default value is 200.
In HNSWlib this parameter is known as ef_construction.
The distance metric specified on the attribute
is used when building and searching the graph. Example:
field text_embedding type tensor<float>(x[384]) {
indexing: summary | attribute | index
attribute {
distance-metric: prenormalized-angular
}
index {
hnsw {
max-links-per-node: 24
neighbors-to-explore-at-insert: 200
}
}
}
Contained in field or
struct-field.
One or more Indexing Language instructions used to produce index, attribute
and summary data from this field. Indexing instructions has pipeline
semantics similar to unix shell commands. The value of the field
enters the pipeline during indexing and the pipeline puts the value
into the desired index structures, possibly doing transformations and
pulling in other values along the way.
If the field containing this is defined outside the document, it
must start by an indexing statement which
outputs a value (either "input [fieldname]" to fetch a field value,
or a literal, e.g "some-value" ). Fields in documents will use the value of the enclosing
field as input (input [fieldname]) if one isn't explicitly provided.
Specify the operations separated by the pipe (|) character.
For advanced processing needs,
use the indexing language,
or write a document processor.
Supported expressions for fields are:
expression
description
attribute
Attribute is used to make a field available for sorting,
grouping, ranking and searching using match mode word.
index
Creates a searchable index for the values of this field
using match mode text.
All strings are lower-cased before stored in the index.
By default, the index name will be the same as the name of the schema field.
Use a fieldset to combine fields in the same set for searching.
Includes the value of this field in a summary field.
Modify summary output by using summary: (e.g. to generate dynamic teasers).
When combining both index and attribute in the indexing statement for a field,
e.g indexing: summary | attribute | index,
the match mode becomes text for the field.
So searches in this field will not search the contents in the attribute but the index.
Find examples and more details in the Text Matching guide.
match
Contained in field, fieldset or
struct-field.
Sets the matching method to use for this field to something else than the default token matching.
match: [property]
or
match {
[property]
[property]
…
}
Whether the match type is text, word or exact,
all term matching will be done after normalization
and locale independent lowercasing (in that order).
Default for string fields with index. Can not be combined with exact matching.
The field is matched per token.
exact
index, attribute
Can not be combined with text matching.
The field is matched exactly:
Strings containing any characters whatsoever will be indexed and matched as-is.
In queries, the exact match string ends at the exact match terminator (below).
A field with match: exact is considered to be
a filter field, just as if rank: filter was specified.
This is because there is only one word per field
(or per item in the case of multivalued types such as array<string>),
so there is little ranking information.
Turn off the implicit rank: filter by adding rank: normal.
exact-terminator
index, attribute
Only valid for match: exact. Default is @@.
Specify terminator in query strings.
If the query strings can contain @@, set a different terminator,
or use match: word, see below. Example - use:
match {
exact
exact-terminator: "@%"
}
on a field called tag to make query tag:a b c!@%
match documents with the string a b c!
Example using the default terminator: If tag is an exact match field, the query:
someword AND (tag:!*!@@ OR tag:(kanoo)@@)
matches documents with someword
and either !*! or (kanoo) as a tag.
Note that without the @@ terminating the second tag string,
the second tag value would be (kanoo)).
word
index, attribute
This is the default matching mode for string attributes.
It cannot be combined with text matching.
Match word means that the entire content of the field is indexed as a single word.
Word matching is like exact matching, but with more advanced query parsing.
The query terms are heuristically parsed, taking into account some usual query syntax characters;
one can also use double quotes to include space, star, or exclamation marks.
Example: If artist is a string attribute, the query:
foo AND (artist:"'N Sync" OR artist:"*NSYNC" OR artist:A*teens OR artist:"Wham!")
matches documents with foo and at least one of
'N Sync or *NSYNC or A*teens or Wham!
in the artist field
Note that without the quotes, the space in 'N Sync would end that word
and would result in a search for just 'N,
similarly the ! would mean to increase the weight of a Wham term if not quoted.
Set default match mode to substring for the field.
Only available in streaming search.
As the data structures in streaming search support substring searches,
one can always set substring matching in the query,
without setting the field to substring default.
Also see regular expressions.
Enable case-sensitive matching. Only relevant for string attributes.
uncased
index, attribute
Enable case-insensitive matching. This is the default for all string fields.
max-length
index
Limit the length of the field that will be used for matching.
By default, only the first 1M characters are indexed.
Example.
When adjusting this limit, it might also be needed to adjust
max-occurrences.
max-occurrences
index
Configure the max number of occurrences that will be indexed for
each unique token/term in the field for a given document.
It this limit is reached, consecutive occurrences
of the same token/term are ignored for that document.
The default value is 10000.
Adjusting this limit might be needed when using the
phrase,
near, or
onear
query operators
to query documents with large field values (see max-length)
that contain more than 10000 occurrences of common tokens/terms.
When using these operators it is only possible to match among the
first max-occurrences of a token/term in a document.
max-token-length
index
Configure the max length of tokens that will be indexed for the field.
Longer tokens are silently ignored.
The unit is characters (cf. java.lang.String.length()).
The default value is 1000.
gram
index
This field is matched using n-grams.
For example, with the default gram size 2 the string "hi blue" is tokenized to "hi bl lu ue"
both in the index and in queries to the index.
N-gram matching is useful mainly as an alternative to
segmentation in CJK languages.
Typically, it results in increased recall and lower precision.
However, as Vespa usually uses proximity in ranking,
the precision offset may not be of much importance.
Grams consume more resources than other matching methods
because both indexes and queries will have more terms,
and the terms contains repetition of the same letters.
On the other hand, CPU intensive CJK segmentation is avoided.
It may also be used for substring matching in general.
gram-size
index
A positive, nonzero, number, default 2.
Sets the gram size when gram matching is used. Example:
match {
gram
gram-size: 3
}
rank
Contained in field, struct-field or
rank-profile.
Set the kind of ranking calculations which will be done for the field. Even though the
actual ranking expressions decide the ranking, this setting tells Vespa which preparatory calculations
and which data structures are needed for the field.
rank [field-name]: [ranking settings]
or
rank {
[ranking setting]
}
The field name should only be specified when used inside a rank-profile.
The following ranking settings are supported in addition to the default:
Ranking setting
Description
filter
Indicates that matching in this field should use fast bit vector data structures only.
This saves CPU during matching, but only a few simple ranking features will be available for the field.
This setting is appropriate for fields typically used for filtering or simple boosting purposes,
like filtering or boosting on the language of the document.
For index fields, this setting does not change
index formats but helps choose the most compact representation when matching
against the field.
For attribute fields with fast-search this setting builds additional
posting list representations (bit vectors) which can speed up query evaluation significantly.
See feature tuning and
the practical search performance guide.
normal
The reverse of filter.
Matching in this field will use normal data structures and give normal match information for ranking.
Used to turn off implicit rank: filter when using match: exact.
If both filter and normal are set somehow,
the effect is as if only normal was specified.
Related: See the filter query annotation
for how to annotate query terms as filters.
Threshold value (in the range [0.0, 1.0]) deciding when matching in this index field should be treated as a filter.
This happens for query terms with estimated hit ratios (in the
range [0.0, 1.0]) that are above the filter-threshold.
Then fast bitvector data structures are used, similar to when the field is set to rank: filter.
This saves CPU and Disk I/O during matching and typically results in faster query evaluation,
with the downside being that only a boolean signal is available for ranking (the document being a match or not).
BM25 handles this by assuming one occurrence of the query term in the document,
and the field length being equal to the average field length.
Use this to optimize query performance when searching large text index fields with e.g.
the weakAnd query operator and BM25 ranking.
Query terms that are common in the corpus (e.g. stopwords) are treated as filters with faster matching and simplified ranking,
while other query terms are handled as usual with full ranking.
In testing with various text datasets (e.g. Wikipedia),
a filter-threshold setting of 0.05 has shown to be a good starting point.
Read more.
This setting is only relevant for index fields,
and cannot be used in combination with rank: filter.
Has no effect in streaming search.
query-command
Contained in fieldset, field or
struct-field.
Specifies a function to be performed on query terms to the indexes of this field when searching.
The Search Container server has support for writing Vespa Searcher plugins which processes these commands.
query-command: [an identifier or quoted string]
If you write a plugin searcher which needs some index-specific
configuration parameter, that parameter can be set here.
There is one built-in query-command available: phrase-segmenting.
If this is set, terms connected by non-word characters in user queries (such as "a.b")
will be parsed to a phrase item, instead of by default, an AND item where these terms have connectivity
set to 1.
rank-type
Contained in field or
rank-profile.
Selects the low-level rank settings to be used for this field when using nativeRank.
rank-type [field-name]: [rank-type-name]
The field name can be skipped inside fields. Defined rank types are:
Type
Description
identity
Used for fields which contains only what this document
is, e.g. "Title". Complete identity hits will get a
high rank.
about
Some text which is (only) about this document,
e.g. "Description". About hits get high rank on partial
matches and higher for matches early in the text and
repetitive matches.
This is the default rank type.
tags
Used for simple tag fields of type tag. The tags rank type uses a logarithmic table to give more relative boost in the low range: As tags are added they should have significant impact on rank score, but as more and more tags are added, each new tag should contribute less.
empty
Gives no relevancy effect on matches. Used for fields you just
want to treat as filters.
For nativeRank one can specify a rank type per field.
If the supported rank types do not meet requirements,
one can explicitly configure the native rank features using rank-properties.
See the native rank reference for more information.
Tunes the weakAnd algorithm to automatically
exclude terms and documents with expected low query significance based on document frequency
statistics present in the document corpus. This makes matching faster at the cost of potentially
reduced recall.
weakand {
[body]
}
Note that all document frequency calculations are done using content node-local document
statistics (i.e. global significance
does not have an effect). This means results may differ across different content nodes and/or
content node groups.
The body of a weakand statement consists of:
Property
Occurrence
Description
stopword-limit
Zero to one
A number in the range [0, 1]. Represents the maximum
normalized document frequency a query term can have in the
corpus (i.e. the ratio of all documents where the term occurs at least once) before
it's considered a stopword and dropped entirely from being a part of the
weakAnd evaluation. This makes matching faster at the cost of
potentially producing more hits. Dropped terms are not exposed as part of ranking.
Example:
stopword-limit: 0.60
This will drop all query terms that occur in at least 60% of the documents.
Using stopword-limit is similar to explicitly removing stopwords
from the query up front, but has the benefit of dynamically adapting to the
actual document corpus and not having to know—or specify—a set of stopwords.
Read more.
adjust-target
Zero to one
A number in the range [0, 1] representing
normalized document frequency.
Used to derive a per-query document score threshold, where documents scoring
lower than the threshold will not be considered as potential hits from the
weakAnd operator. The score threshold is selected to be equal to
that of the query term whose document frequency is closest to the
configured adjust-target value.
This can be used to efficiently exclude documents that only match terms that
occur very frequently in the document corpus. Such terms are likely to be stopwords
that have low semantic value for the query, and excluding documents only containing
them is likely to only have a minor impact on recall.
This makes overall matching faster by reducing the number of hits produced by
the weakAnd operator.
Example:
adjust-target: 0.01
This excludes documents that only have terms that occur in more than approximately 1%
of the document corpus. The actual threshold is query-specific and based on the query
term score whose document frequency is closest to 1%.
adjust-target can be used together with stopword-limit
to efficiently prune both terms and documents with low significance when processing queries.
Read more.
Contained in field or
struct-field.
Specifies the name of the document summaries which should contain this field.
summary-to: [summary-name], [summary-name], …
Fields with summary will always be part of the default summary regardless of this setting. Use explicit document-summary instead.
See also document summaries.
The summary name can be skipped if this is set inside a
field. The name will then be the same as the name of the source
field.
full summary is the default. Long field values (like document
content fields) should be made dynamic.
The body of a summary may contain:
Name
Occurrence
Description
full
Zero to one
Returns the full field value in the summary (the default).
bolding: on
Zero to one
Specifies whether the content of this field should be bolded.
Only supported for index fields of type string or array<string>.
dynamic
Zero to one
Make the value returned in results from this summary field be a dynamic abstract of the source
field by extracting fragments of text around matching query terms. Matching query terms will also be highlighted, in
similarity with the bolding feature.
This highlighting is not affected by the query-argument bolding.
The default XML element used to highlight query terms is
<hi> - refer to bolding for how to configure.
dynamic is only supported for index fields of type string or array<string>.
For array<string> fields, a dynamic abstract is created per string item in the array.
source
Zero to one
Specifies the name of the field or fields from which the value of this summary field should be fetched.
If multiple fields are specified, the value will be taken from the first field if that has a value,
from the second if the first one is empty and so on.
source: [field-name], [field-name], …
When this is not specified, the source field is assumed to be the field with the same name as the summary field.
Specifies the name of the document summaries this should be included in.
to: [document-summary-name], [document-summary-name], …
This can only be specified in fields, not in explicit document
summaries. When this is not specified, the field will go to the
default document summary.
matched-elements-only
Zero to one
Specifies that only the matched elements in a searchable
array of primitive,
weightedset,
array of struct or
map type field are returned as part of document summary.
For array of struct or map type fields this is typically used in accordance with the
sameElement operator,
but it can also be used when searching directly on a sub struct field.
Is also supported when the field is imported.
Make the value returned in results from this summary field be an array
of the tokens indexed in the source field. Multiple tokens at the same
location are put into a nested array.
The source field must be specified,
and it must be an index or
attribute field of type string,
array<string> or weightedset<string>. If the source field
is of type weightedset<string> then the summary field is
rendered as if the source field was of type array<string>,
weights are not shown.
Contained in field.
The weight of a field - the default is 100.
The field weight is used when calculating the rank scores.
weight: [positive integer]
weightedset
Contained in field of type weightedset.
Properties of a weighted set.
weightedset: [property]
or
weightedset {
[property]
[property]
…
}
Property
Occurrence
Description
create-if-nonexistent
Zero to one
If the weight of a key is adjusted in a document using a partial update increment or decrement command,
but the key is currently not present, the command will be ignored by default.
Set this to make keys to be created in this case instead.
This is useful when the weight is used to represent the count of the key.
field tag type weightedset<string> {
indexing: attribute | summary
weightedset {
create-if-nonexistent
remove-if-zero
}
}
remove-if-zero
Zero to one
This is the companion of create-if-nonexistent for the converse case:
By default keys may have zero as weight.
With this turned on, keys whose weight is adjusted (or set) to zero, will be removed.
annotation
Contained in schema.
Defines an annotation type, to be used by the Annotations API.
A name of the annotation is mandatory, the body is optional.
annotation [name] {
[body]
}
import field
Contained in schema.
Using a reference to a document type,
import a field from that document type into this schema to be used for matching, ranking, grouping and sorting.
Only attribute fields can be imported.
Importing fields are not supported in
streaming search.
The imported field inherits all but the following properties from the parent field:
Refer to parent/child for a complete example.
Note that the imported field is put outside the document type:
schema myschema {
document myschema {
field parentschema_ref type reference<parentschema> {
indexing: attribute
}
}
import field parentschema_ref.name as parent_name {}
}
Extra restrictions apply for some of the field types:
Field type
Restriction
array of struct
Can be imported if at least one of the struct fields has an attribute.
All struct fields with attributes must have primitive types.
Only the struct fields with attributes will be visible.
map of struct
Can be imported if the key field has an attribute and
at least one of the struct fields has an attribute.
All struct fields with attributes must have primitive types.
Only the key field and the struct fields with attributes will be visible.
map
Can be imported if both key and value fields have primitive types and have attributes.
position
Can be imported if it has an attribute.
array of position
Can be imported if it has an attribute.
To use an imported field in summary, create an explicit
document summary containing the field.
Imported fields can be used to expire documents, but read this first.
Document and search field types
Note that it is possible to make a document field
of one type into one or more instances of another search field, by
declaring a field outside the document, which uses other fields as
input. For example, to create an integer attribute for a
string containing a comma-separated list of integers in the document,
do like this:
schema example {
document example {
field yearlist type string { # Comma-separated years
}
}
field year type array<int> { # Search field using the yearlist value
indexing: input yearlist | split "," | attribute
}
}
Modifying schemas
This section describes how a schema in a live application can be modified—categories:
When running vespa prepare on a new application package,
the changes in the schema files are compared with the files in the current active package.
If some of the changes require restart or re-feed, the output from vespa prepare
specifies which actions are needed.
Important:
For changes that are not covered below,
and no output is returned from vespa prepare,
the impact is undefined and in no way guaranteed to allow a system to stay live until re-feeding.
Changes not related to the schema are discussed
in admin procedures.
Valid changes without restart or re-feed
Procedure:
Run vespa prepare on the changed application
Run vespa activate. The changes will take effect immediately
Changes:
Change
Description
Add a new document field
Add a new document field as index, attribute, summary or any combinations of these.
Existing documents will implicitly get the new field with no content.
Documents fed after the change can specify the new field.
If the field has existed with same type earlier,
then old content may or may not reappear
Remove a document field
Existing documents will no longer see the removed field,
but the field data is not completely removed from the search node
Add or remove an existing document field from document summary
Add an existing field to summary or any number of summary classes,
and remove an existing field from summary or any number of summary classes. Example:
document-summary short-summary {
summary artist {}
}
A change adding an attribute field with a new name to a summary class
using source does not require restart or re-feed:
field artist type string {
indexing: summary | attribute
}
document-summary rename-summary {
summary artist_name {
source: artist
}
}
Remove the attribute aspect from a field that is also an index field
This is the only scenario of changing the attribute aspect of a document field that is allowed without restart
Add, change or remove field sets
Change fieldsets used to group fields together for searching
Change the alias or sorting attribute settings for an attribute field
Add, change or remove rank profiles
Change document field weights
Add, change or remove field aliases
Add, change or remove rank settings for a field
Exception: Changing rank: filter on an attribute field in mode index requires restart.
See details in next section
Add or remove a schema
Removing a schema definition file will make proton
drop all documents of that type - subsequently releasing memory and disk.
Changes that require restart but not re-feed
Procedure:
Run vespa prepare on the changed application.
Output specifies which restart actions are needed
Run vespa activate
Restart services on the services specified in the prepare output
Changes:
Change
Description
Change the attribute aspect of a document field
Add or remove a field as attribute.
When adding, the attribute is populated based on the field value in stored documents during restart.
When removing, the field value in stored documents is updated based on the content in the attribute during restart.
Change the attribute settings for an attribute field
Change the following attribute settings: fast-search, fast-access, fast-rank, paged.
Change the rank filter setting for an attribute field
Add or remove rank: filter on an attribute field.
Change the hnsw index settings for a tensor attribute field
Adding or removing the hnsw index on a tensor attribute field,
or changing the distance-metric or max-links-per-node requires a restart to rebuild the index.
Changing neighbors-to-explore-at-insert requires a restart, but does not rebuild the index.
Change the distance metric for a tensor attribute field
Change, add or remove the distance metric on a tensor attribute field.
If no distance metric is specified, euclidean is used as the default.
Example: Given a content cluster mycluster with mode index:
schema test {
document test {
field f1 type string { indexing: summary }
}
}
Then add field f1 as an attribute:
schema test {
document test {
field f1 type string { indexing: attribute | summary }
}
}
The following is output from vespa prepare -
which restart actions are needed:
WARNING: Change(s) between active and new application that require restart:
In cluster 'mycluster' of type 'search':
Restart services of type 'searchnode' because:
1) Document type 'test': Field 'f1' changed: add attribute aspect
Changes that require reindexing
All the changes listed below require reindexing
of all documents. Unlike re-feed, which requires an external source of data, reindexing is done
using documents stored in Vespa, and is automatic (once triggered). It can also run concurrently
with feed and serving, but until reindexing is complete, affected fields will be empty
or have potentially wrong annotations not matching the query processing. Procedure:
Run vespa prepare on the changed application.
Output specifies which reindexing actions are needed
Run vespa prepare and vespa activate again to start reindexing process
Changes:
Change
Description
Change index aspect of a document field
This changes the document processing pipeline before documents arrive in the backend.
Only documents fed after index aspect was added
will have annotations and be present in the reverse index.
Only documents fed after index aspect was removed
will avoid disk bloat due to unneeded annotations.
Change fields from static to dynamic summary, or vice versa
Switch stemming/normalizing on or off
This changes the document processing pipeline before documents
arrive in the backend, and what annotations are made for an indexed field.
Important:
If not re-feeding after such a change, serving works,
but recall is undefined as the index has been produced using a different setting
than the one used when doing stemming/normalizing of the query terms.
Switch bolding on or off
Add, change or remove match settings for a field
Example: Adding match: word to a field.
This changes the document processing pipeline before documents
arrive in the backend, and what annotations are made for an indexed field.
Important:
If not reindexing after such a change, serving works,
but recall is undefined as the index has been produced using one match mode
while run-time is using a different match mode.
Add or remove a new non-attribute document field from document summary
A change adding an index or summary field
(without attribute)
with a new name to a summary class using source requires re-index:
field artist type string {
indexing: summary | index
}
document-summary rename-summary {
summary artist_name {
source: artist
}
}
Example: Given a content cluster mycluster with mode index:
schema test {
document test {
field f1 type string { indexing: summary }
}
}
Then add field f1 as an index:
schema test {
document test {
field f1 type string { indexing: index | summary }
}
}
The following is output from vespa prepare -
which reindex actions are needed:
WARNING: Change(s) between active and new application that require re-index:
Reindex document type 'test' in cluster 'mycluster' because:
1) Document type 'test': Field 'f1' changed: add index aspect, indexing script: '{ input f1 | summary f1; }' -> '{ input f1 | tokenize normalize stem:"SHORTEST" | index f1 | summary f1; }'
Changes that require re-feed
All the changes listed below require re-feeding of all documents.
Unless a change is listed in the above sections treat it as if it was listed here.
Until re-feed is complete, affected fields will be empty
or have potentially wrong annotations not matching the query processing. Procedure:
Run vespa prepare on the changed application.
Output specifies which re-feed actions are needed
Stop feeding, wait until done
Run vespa activate
Re-feed all documents
Changes:
Change
Description
Change a document field's data type or collection type
Existing documents will no longer have any content for this field.
To populate the field, re-feed the existing documents using the new type for this field.
There will be no automatic conversion from old to new field type.
Important:
If not re-feeding after such a change, serving works,
but searching this field will not give any results
Change a tensor attribute's tensor type
Example: Given a content cluster mycluster with mode index:
schema test {
document test {
field f1 type string { indexing: summary }
}
}
Then change field f1 to hold an int:
schema test {
document test {
field f1 type int { indexing: summary }
}
}
The following is output from vespa prepare -
which re-feed actions are needed:
WARNING: Change(s) between active and new application that require re-feed:
Re-feed document type 'test' in cluster 'mycluster' because:
1) Document type 'test': Field 'f1' changed: data type: 'string' -> 'int'