Rank feature reference

This is the list of the rank features in Vespa. These features are available during document ranking for combination into a complete rank score by a ranking expression. The features are a combination of coarse grained features suitable for hand-written expressions, and finer grained features suitable for machine learning.

See also the overview of the ranking framework, and rank feature configuration parameters.

  • Types: All rank feature values are floats. Ints are converted to exact whole value floats. String values are converted to exact whole value floats using a hash function. String literals in rank expressions are converted using the same hash function, to enable equality tests on string values.
  • Features which are normalized are between 0 and 1, where 0 is always the minimum and 1 the maximum. Normalized features should normally be preferred because they are more easily combined by ranking expressions into a complete normalized score.
  • A query may override any rank feature value by submitting that value as a feature with the query.
  • Some features have parameters. It is always allowed to quote parameters with ". Nested quotes are not allowed and must be escaped using \. Parameters that can be parsed as feature names may be left unquoted. Examples: foo(bar(baz(5.5))), foo("bar(\"baz(\\\"5.5\\\")\")"), foo("need quote")

Feature list

Feature name Default Description
Query features
query(key) 0 An application specific feature submitted with the query. See the doc on using the query feature for how to set a default value in the rank profile and how to submit the feature value with the query.
term(n).significance 0

A normalized number (between 0.0 and 1.0) describing the significance of the term; used as a multiplier or weighting factor by many other rank features.

This should ideally be set by a searcher in the container for global correctness as each node will estimate the significance values from the local corpus. Query API for setting significance

As a fallback, a significance based on Robertson-Sparck-Jones term weighting is used; it is logarithmic from 1.0 for very rare terms down to 0.5 for very common terms (those occurring in every document seen).

Note that "very rare" is defined as a frequency of 0.000001 or less. This is the term document frequency (how many documents contain the term out of all documents that can be observed), so you cannot get 1.0 as the fallback until you actually have a large number of documents (minimum 1 million) in the same search process.

See numTerms config.

term(n).weight 100 The importance of matching this query term given in the query
term(n).connectedness 0.1 The normalized strength with which this term is connected to the previous term in the query. Must be assigned to query terms in a searcher.
queryTermCount 0 The total number of terms in this query, including both user and synthetic terms in all fields.
Document features
fieldLength(name) 1000000 The number of terms in this field if one or more query term matched the field, 1000000 if no query term matched the field.
attribute(name) null The value of a single value numeric attribute or null/NaN if not set. Use isNan() to check if value is not defined. Using undefined values in ranking expressions leads to undefined behavior.
attribute(name,n) 0 The value at index n (base 0) of a numeric array attribute with the given name. The order of the items in an array attribute is the same as the order they have in the input feed. If items are added using partial updates they are added to the end of the existing items list.
attribute(name,key).weight 0 The weight found at a given key in a weighted set attribute
attribute(name,key).contains 0 1 if the given key is present in a weighted set attribute, 0 otherwise
attribute(name).count 0 The number of elements in the attribute with the given name.
Field match features. These features provide a good measure of the degree to which a query matches the text of a field, but are expensive to calculate and therefore often only suitable for second-phase ranking. See the string segment match document for details on the algorithm computing this set.
- normalized
fieldMatch(name) 0 A normalized measure of the degree to which this query and field matched (default, the long name of this is match). Use this if you do not want to create your own combination function of more fine grained fieldmatch features.
fieldMatch(name).proximity 0

Normalized proximity - a value which is close to 1 when matched terms are close inside each segment, and close to zero when they are far apart inside segments. Relatively more connected terms influence this value more. This is absoluteProximity/average connectedness for the query terms for this field.

Note that if all the terms are far apart, the proximity will be 1, but the number of segments will be high. Proximity is only concerned with closeness within segments, a total score must also take the number of segments into account.

fieldMatch(name).completeness 0

The normalized total completeness, where field completeness is more important:

queryCompleteness * ( 1 - fieldCompletenessImportance ) + fieldCompletenessImportance * fieldCompleteness
fieldMatch(name).queryCompleteness 0

The normalized ratio of query tokens matched in the field:

matches/query terms searching this field
fieldMatch(name).fieldCompleteness 0

The normalized ratio of query tokens which was matched in the field:

matches/fieldLength
fieldMatch(name).orderness 0

A normalized metric of how well the order of the terms agrees in the chosen segments:

1-outOfOrder/pairs
fieldMatch(name).relatedness 0

A normalized measure of the degree to which different terms are related (occurring in the same segment):

1-(segments-1)/(matches-1)
fieldMatch(name).earliness 0

A normalized measure of how early the first segment occurs in this field.

fieldMatch(name).longestSequenceRatio 0

A normalized metric of the relative size of the longest sequence:

longestSequence/matches
fieldMatch(name).segmentProximity 0

A normalized metric of the closeness (inverse of spread) of segments in the field:

1-segmentDistance/fieldLength
fieldMatch(name).unweightedProximity 0 The normalized proximity of the matched terms, not taking term connectedness into account. This number is close to 1 if all the matched terms are following each other in sequence, and close to 0 if they are far from each other or out of order.
fieldMatch(name).absoluteProximity 0 Returns the normalized proximity of the matched terms, weighted by the connectedness of the query terms. This number is 0.1 if all the matched terms are and have default or lower connectedness, close to 1 if they are following in sequence and have a high connectedness, and close to 0 if they are far from each other in the segments or out of order.
fieldMatch(name).occurrence 0

Returns a normalized measure of the number of occurrences of the terms of the query. This is 1 if there are many occurrences of the query terms in absolute terms, or relative to the total content of the field, and 0 if there are none.

This is suitable for occurrence in fields containing regular text.

fieldMatch(name).absoluteOccurrence 0

Returns a normalized measure of the number of occurrence of the terms of the query:

$$\frac{\text{sum over all query terms}(min(\text{number of occurrences of the term},maxOccurrences))}{(\text{query term count} × 100)}$$

This is 1 if there are many occurrences of the query terms, and 0 if there are none.

This number is not relative to the field length, so it is suitable for uses of occurrence to denote relative importance between matched terms (i.e fields containing keywords, not normal text).

fieldMatch(name).weightedOccurrence 0 Returns a normalized measure of the number of occurrence of the terms of the query, weighted by term weight. This number is close to 1 if there are many occurrences of highly weighted query terms, in absolute terms, or relative to the total content of the field, and 0 if there are none.
fieldMatch(name).weightedAbsoluteOccurrence 0

Returns a normalized measure of the number of occurrence of the terms of the query, taking weights into account so that occurrences of higher weighted query terms has more impact than lower weighted terms.

This is 1 if there are many occurrences of the highly weighted terms, and 0 if there are none.

This number is not relative to the field length, so it is suitable for uses of of occurrence to denote relative importance between matched terms (i.e fields containing keywords, not normal text).

fieldMatch(name).significantOccurrence 0

Returns a normalized measure of the number of occurrence of the terms of the query in absolute terms, or relative to the total content of the field, weighted by term significance.

This number is 1 if there are many occurrences of the highly significant terms, and 0 if there are none.

- normalized and relative to the whole query
fieldMatch(name).weight 0

The normalized weight of this match relative to the whole query: The sum of the weights of all matched terms/the sum of the weights of all query terms. If all the query terms were matched, this is 1. If no terms were matched, or these matches has weight zero this is 0.

As the sum of this number over all the terms of the query is always 1, sums over all fields of normalized rank features for each field multiplied by this number for the same field will produce a normalized number.

Note that this scales with the number of matched query terms in the field. If you want a component which does not, divide by matches.

fieldMatch(name).significance 0

Returns the normalized term significance of the terms of this match relative to the whole query: The sum of the significance of all matched terms/the sum of the significance of all query terms. If all the query terms were matched, this is 1. If no terms were matched, or if the significance of all the matched terms is zero, this number is zero.

This metric has the same properties as weight.

See the term(n).significance feature for how the significance for a single term is calculated.

fieldMatch(name).importance 0 Returns the average of significance and weight. This has the same properties as those metrics.
- not normalized
fieldMatch(name).segments 0 The number of field text segments which are needed to match the query as completely as possible
fieldMatch(name).matches 0 The total number of query terms which was matched in this field
fieldMatch(name).degradedMatches 0 The number of degraded query terms which was matched in this field. A degraded term is a term where no occurrence information is available during calculation. The number of degraded matches is less than or equal to the total number of matches.
fieldMatch(name).outOfOrder 0 The total number of out of order token sequences within matched field segments
fieldMatch(name).gaps 0 The total number of position jumps (backward or forward) within field segments
fieldMatch(name).gapLength 0 The summed length of all gaps within segments
fieldMatch(name).longestSequence 0 The size of the longest matched continuous, in-order sequence in the field
fieldMatch(name).head 0 The number of tokens in the field preceding the start of the first matched segment
fieldMatch(name).tail 0 The number of tokens in the field following the end of the last matched segment
fieldMatch(name).segmentDistance 0 The sum of the distance between all segments making up a match to the query, measured as the sum of the number of token positions separating the start of each field adjacent segment.
Query and field similarity. Normalized feature set measuring the approximate similarity between a field and the query. These features are suitable in cases where the query is as large as the field (i.e is a document) such that we are interested in the similarity between the query and the entire field. They are cheap to compute even if the query is very large.
textSimilarity(name) 0 A weighted sum of the individual similarity measures.
textSimilarity(name).proximity 0 A measure of how close together the query terms appear in the field.
textSimilarity(name).order 0 A measure of the order in which the query terms appear in the field compared to the query.
textSimilarity(name).queryCoverage 0 A measure of how much of the query the field covers when a single term from the field can only cover a single term in the query. Query term weights are used during normalization.
textSimilarity(name).fieldCoverage 0 A measure of how much of the field the query covers when a single term from the query can only cover a single term in the field.
Query term and field match features
fieldTermMatch(name,n).firstPosition 1000000 The position of the first occurrence of this query term in this index field. numTerms configuration
fieldTermMatch(name,n).occurrences 0 The number of occurrences of this query term in this index field
matchCount(name) 0 Returns number of times any term in the query matches this index/attribute field.
matches(name) 0 Returns 1 if the index/attribute field with the given name is matched by the query.
matches(name,n) 0 Returns 1 if the index/attribute field with the given name is matched by the query term with position n.
termDistance(name,x,y).forward 1000000 The minimum distance between the occurrences of term x and term y in this index field. Term x occurs before term y.
termDistance(name,x,y).forwardTermPosition 1000000 The position of the occurrence of term x in this index field used for the forward distance.
termDistance(name,x,y).reverse 1000000 The minimum distance between the occurrences of term y and term x in this index field. Term y occurs before term x.
termDistance(name,x,y).reverseTermPosition 1000000 The position of the occurrence of term y in this index field used for the reverse distance.
Features for indexed multi-value string fields
elementCompleteness(name).completeness 0 A weighted combination of fieldCompleteness and queryCompleteness for the element in the field that produces the highest value for this output after the elements weight is factored in. The weighting can be adjusted using elementCompleteness(name).fieldCompletenessImportance.
elementCompleteness(name).fieldCompleteness 0

The field completeness of the best matching element. This is calculated as:

max( (number of query terms matched in the element) / (element size), 1.0).
elementCompleteness(name).queryCompleteness 0

The query completeness of the best matching element. This is calculated as:

(sum of weight for query terms matched in the element) / (sum of weight for query terms searching the field).
elementCompleteness(name).elementWeight 0 The weight of the best matching element.
elementSimilarity(name) 0 Aggregated similarity between the query and individual field elements. The same sub-scores used by the textSimilarity feature are calculated for each individual element in the field. The final output is calculated as the maximum of the combined element similarity measures (similarity measures are combined the same way as the default output of the textSimilarity feature) multiplied with the element weight. This is a very flexible feature; how sub-scores are combined for each element and how element scores are aggregated may be configured. You may also add additional outputs if you want to capture multiple signals from a single field. Use elementSimilarity to customize this feature.
Attribute match features
- normalized
attributeMatch(name) 0 A normalized measure of the degree to which this query and field matched. This is currently the same as completeness. Note that depending on what the attribute is used for, this may or may not be a suitable metric. If the attribute is a weighted set representing counts of items (like tags), normalizedWeight is probably a better metric.
attributeMatch(name).completeness 0

The normalized total completeness, where field completeness is more important:

queryCompleteness * ( 1 - fieldCompletenessImportance + fieldCompletenessImportance * fieldCompleteness )

attributeMatch(name).queryCompleteness 0

The query completeness for this attribute:

matches/the number of query terms searching this attribute
attributeMatch(name).fieldCompleteness 0 The normalized ratio of query tokens which was matched in the field. For arrays: matches/array length For weighted sets: sum of weight of matched terms/sum of weights of entire set. This is relatively expensive to calculate for large weighted sets.
attributeMatch(name).normalizedWeight 0 A number which is close to 1 if the attribute weights of most matches in a weighted set are high (relative to maxWeight), 0 otherwise
attributeMatch(name).normalizedWeightedWeight 0 A number which is close to 1 if the attribute weights of most matches in a weighted set are high (relative to maxWeight), and where highly weighted query terms has more impact, 0 otherwise
closeness(name) 0

A number which is close to 1 if the position in attribute name is close to the query position compared to maxDistance:

max(1-distance(name)/maxDistance , 0)

Scales linearly with distance, see closeness plot.

closeness(name).logscale 0

A logarithmic-shaped closeness; like normal closeness it goes from 1 to 0, but looks like closeness plot. The function is a logarithmic fall-off based on log(distance + scale) and is calculated as:

$$closeness(name).logscale = \frac{(maxDistance + scale) - log(distance(name) + scale))}{(log(maxDistance + scale) - log(scale))}$$

where scale is defined using halfResponse and maxDistance:

$$scale = \frac{halfResponse^2}{(maxDistance - 2 × halfResponse)}$$

When distance(name) == halfResponse the function output is 0.5; halfResponse should be less than maxDistance/2 since that means adding a certain distance when you're close matters more than adding the same distance when you're already far away. (Using configuration variable scaleDistance to specify scale explicitly is deprecated).

freshness(name) 0

A number which is close to 1 if the timestamp in attribute name is recent compared to the current time compared to maxAge:

max( 1-age(name)/maxAge , 0 )

Scales linearly with age, see freshness plot.

freshness(name).logscale 0

A logarithmic-shaped freshness; also goes from 1 to 0, but looks like freshness plot. The function is based on -log(age(name) + scale) and is calculated as:

$$\frac{log(maxAge + scale) - log(age(name) + scale)}{log(maxAge + scale) - log(scale)}$$

where scale is defined using halfResponse and maxAge:

$$\frac{-halfResponse^2}{2 × halfResponse - maxAge}$$

When age(name) == halfResponse the function output is 0.5.

- normalized and relative to the whole query
attributeMatch(name).weight 0 This has the same semantics as fieldMatch(name).weight.
attributeMatch(name).significance 0 This has the same semantics as fieldMatch(name).significance.
attributeMatch(name).importance 0 Returns the average of significance and weight. This has the same properties as those metrics.
- not normalized
attributeMatch(name).matches 0 The number of query terms which was matched in this attribute
attributeMatch(name).totalWeight 0 The sum of the weights of the attribute keys matched in a weighted set attribute
attributeMatch(name).averageWeight 0 totalWeight/matches
distance(name) 6400M The euclidian distance from the query position to the given position attribute in millionths of degrees (about 10cm)
distanceToPath(name).distance 6400M The euclidian distance from a path through 2d space given in the query to the given position attribute in millionths of degrees. This is useful e.g. for finding the closest locations to a given road. The query path is set in the rankproperty.distanceToPath(name).path query parameter, using syntax "(x1,y1,x2,y2,..)" also in millionth of degrees, see the distance to path example. The closest point along the path is referred to as the intersection. NOTE: For documents with multiple locations, only the closest location is used for ranking purposes.
distanceToPath(name).traveled 1 The normalized distance along the query path traveled before intersection (0.0 indicates start of path, 0.5 is middle, and 1.0 is end of path).
distanceToPath(name).product 0 The cross-product of the intersected path segment and the intersection-to-document vector. Given that the document was found to lie closest to the path element A->B, the intersected path segment vector is [ B.x - A.x, B.y - A.y ]. Furthermore, given that the intersection of that path element occurred at point I for document location D, the intersection-to-document vector is [ I.x - D.x, I.y - D.y]. This is useful e.g. for finding what side of a path a document exists by looking at the sign of this value.
age(name) 10B The document age in seconds relative to the unit time value stored in the attribute having this name
Features combining multiple fields and attributes
match 0 A normalized average of the fieldMatch and attributeMatch scores of all the searched fields and attributes, where the contribution of each field and attribute is weighted by its weight setting.
match.totalWeight 0 The sum of the weight settings of all the field and attributes searched by the query
match.weight.name 100 The (search definition) weight setting of a field or attribute
Rank scores
nativeRank 0 A reasonably good rank score which is computed cheaply by Vespa. This value only is a good candidate first phase ranking function. The value computed by this function may change between Vespa versions. See the native rank reference for more information.
nativeRank(field,...) 0 Same as nativeRank, but only the given set of fields are used in the calculation.
nativeFieldMatch 0 Captures how well query terms match in index fields. Used by nativeRank. See the native rank reference for more information.
nativeFieldMatch(field,...) 0 Same as nativeFieldMatch, but only the given set of index fields are used in the calculation.
nativeProximity 0 Captures how near matched query terms occur in index fields. Used by nativeRank. See the native rank reference for more information.
nativeProximity(field,...) 0 Same as nativeProximity, but only the given set of index fields are used in the calculation.
nativeAttributeMatch 0 Captures how well query terms match in attribute fields. Used by nativeRank. See the native rank reference for more information.
nativeAttributeMatch(field,...) 0 Same as nativeAttributeMatch, but only the given set of attribute fields are used in the calculation.
nativeDotProduct(field) 0

Calculates the sparse dot product between query term weights and match weights for the given field. Lets say we have a weighted set string field X with the content

<item weight=10>x</item><item weight=20>y</item><item weight=30>z</item>

for a particular document. If we issue the query (x!2 OR y!4), the nativeDotProduct(X) feature will have the value 100 for that document. (10 * 2 + 20 * 4).

firstPhase 0 The value of the rank score calculated in the first phase (unavailable in first phase rank expressions)
Global features
now n/a Time at which the query is executed in unix-time (seconds since epoch)
random n/a A pseudorandom number in the range [0,1> which is drawn once per document during rank evaluation. By default, the current time in microseconds is used as a seed value. Users can specify a seed value by setting random.seed in the rank profile. If you need several independent random numbers the feature can be named like this: random(foo), random(bar).
random.match n/a A pseudorandom number in the range [0,1> that is stable for a given hit. This means that a hit will always receive the same random score (on a single node). If it is required that the scores be different between different queries, specify a seed value dependent upon the query. By default, the seed value is 1024. Users can specify a seed value by adding the query parameter rankproperty.random.match.seed=<value>. If you need several independent random numbers the feature can be named like this: random(foo).match, random(bar).match.
randomNormal(mean,stddev) 0.0,1.0 Same as random, except the random number is drawn from the Gaussian distribution using the supplied mean and stddev parameters. Can be called without parameters; default values are assumed. Seed is set similarly as random. If you need several independent random numbers with the same parameters, the feature can be named like this: randomNormal(0.0,1.0,foo), randomNormal(0.0,1.0,bar). If the parameters to randomNormal are not the same, you do not need to supply an additional name, e.g. randomNormal(0.0, 0.1) and randomNormal(0.0, 0.5) results in two independent values.
Match operator scores (see Raw scores and query item labeling)
rawScore(field) 0 The sum of all raw scores produced by match operators for this field.
itemRawScore(label) 0 The raw score produced by the query item with the given label.
Utility features
foreach(dimension,variable,feature,condition,operation) n/a

foreach iterates over a set of feature output values and performs an operation on them. Only the values where the condition evaluates to true are considered for the operation. The result of this operation is returned.

  • dimension: Specifies what to iterate over. This can be:
    • terms: All query term indices, from 0 and up to maxTerms.
    • fields: All index field names.
    • attributes: All attribute field names.
  • variable: The name of the variable 'storing' each of the items you are iterating over.
  • feature: The name of the feature you want to use the output value from. Use the variable as part of the feature name, and for each item you iterate over this variable is replaced with the actual item. Note that the variable replacement is a simple string replace so you should use a variable name that is not in conflict with the feature name.
  • condition: The condition used on each feature output value to find out if the value should be considered by the operation. The condition can be:
    • >a: Use feature output if greater than number a.
    • <a: Use feature output if less than number a.
    • true: Use all feature output values.
  • operation: The operation you want to perform on the feature output values. This can be:
    • sum: Calculate the sum of the values.
    • product: Calculate the product of the values.
    • average: Calculate the average of the values.
    • max: Find the max of the values.
    • min: Find the min of the values.
    • count: Count the number of values.

Lets say you want to calculate the average score of the fieldMatch feature for all index fields, but only consider the scores larger than 0. Then you can use the following setup of the foreach feature:

foreach(fields,N,fieldMatch(N),">0",average).

Note that when using the conditions >a and <a they must be quoted in order to pass search definition parsing.

You can also specify a ranking expression in the foreach feature by using the rankingExpression feature. The rankingExpression feature takes the expression as the first and only parameter and outputs the result of evaluating this expression. Lets say you want to calculate the average score of the squared fieldMatch feature score for all index fields. Then you can use the following setup of the foreach feature:

foreach(fields,N,rankingExpression("fieldMatch(N)*fieldMatch(N)"),true,average)

Note that you must quote the expression passed in to the rankingExpression feature.

dotProduct(name,vector) 0

The sparse dot product of the vector represented by the given weighted set attribute and the vector sent down with the query.

You can also do an ordinary full dotproduct by using arrays instead of weighted sets. This will be a lot faster when you have full vectors in the document with more than 5-10% non-zero values. You are also then not limited to integer weights. All the numeric datatypes can be used with arrays so you have full floating point support. The 32 bit floating point type yields the fastest execution.

  • name: The name of the weighted set string/integer or array of numeric attribute.
  • vector: The name of the vector sent down with the query.
Each unique string/integer in the weighted set corresponds to a dimension and the belonging weight is the vector component for that dimension. The query vector is set in the rankproperty.dotProduct.vector query parameter, using syntax {d1:c1,d2:c2,…} where d1 and d2 are dimensions matching the strings/integers in the weighted set and c1 and c2 are the vector components (floating point numbers). The number of dimensions in the weighted set and the query vector do not need to be the same. When calculating the dot product we only use the dimensions present in both the weighted set and the query vector.

When using an array the dimensions is a positive integer starting at 0. If the query is sparse all non given dimensions are zero. That also goes for dimensions that outside of the array size in each document.

Lets say we have a weighted set string attribute X with the content <item weight=10>x</item><item weight=20>y</item> for a particular document. The result of using the feature dotProduct(X,Y) with the query vector rankproperty.dotProduct.Y={x:2,y:4} will then be 100 (10 * 2 + 20 * 4) for this document.

Arrays can be passed down as [w1 w2 w3 …] or on sparse form {d1:c1,d2:c2,…} as is already supported for weighted sets.

NOTE:When the query vector end up being the same as your query, it is better to annotate your query terms with weights (see term weight) and use the nativeDotProduct feature instead. This will run more efficiently and improve the correlation between results produced by the WAND operator and the final relevance score.

Graphs for selected ranking functions

closeness

The plot above shows the possible outputs from the closeness distance rank feature using the default maxDistance of 1000 km. The linear(x) graph shows the default closeness output while the other graphs are logscale output for various values of the scaleDistance parameter: 9013.305 (1 km), 45066.525 (5km - the default value), and 901330.5 (100 km). These values correspond to the following values of the halfResponse parameter: 276154.903 (30.64 km), 593861.739 (65.89 km), and 2088044.581 (231.66 km).

freshness

The plot above shows the possible outputs from the freshness rank feature using the default maxAge of 7776000s (90 days). The linear(x) graph shows the default freshness output while the other graphs are logscale output for various values of the halfResponse parameter: 172800s (2 days), 604800s (7 days - the default value), 1209600s (14 days).