• [+] expand all

Indexing Language Reference

This reference documents the full Vespa indexing language. If more complex processing of input data is required, implement a document processor.

The indexing language is analogous to UNIX pipes, in that statements consists of expressions separated by the pipe symbol where the output of each expression is the input of the next. Statements are terminated by semicolon and are independent of each other (except when using variables).

Indexing script

An indexing script is a sequence of indexing statements separated by a semicolon (;). A script is executed statement-by-statement, in order, one document at a time.

Vespa derives one indexing script per search cluster based on the search definitions assigned to that cluster. As a document is fed to a search cluster, it passes through the corresponding indexing cluster, which runs the document through its indexing script. Note that this also happens whenever the document is reindexed, so expressions such as now must be thought of as the time the document was (last) indexed, not when it was fed.

You can examine the indexing script generated for a specific search cluster by retrieving the configuration of the indexing document processor.

$ vespa-get-config -i search/cluster.<cluster-name> -n vespa.configdefinition.ilscripts

The current execution value is set to null prior to executing a statement.

Indexing statement

An indexing statement is a sequence of indexing expressions separated by a pipe (|). A statement is executed expression-by-expression, in order.

Within a statement, the execution value is passed from one expression to the next.

The simplest of statements passes the value of an input field into an attribute:

input year | attribute year;

The above statement consists of 2 expressions; input year and attribute year. The former sets the execution value to the value of the "year" field of the input document. The latter writes the current execution value into the attribute "year".

Indexing expression

Primitives

A string, numeric literal and true/false can be used as an expression to explicitly set the execution value. Examples: "foo", 69, true).

Outputs

An output expression is an expression that writes the current execution value to a document field. These expressions also double as the indicator for the type of field to construct (i.e. attribute, index or summary). It is important to note that you can not assign different values to the same field in a single document (e.g. attribute | lowercase | index is illegal and will not deploy).

Expression Description
attribute Writes the execution value to the current field. During deployment, this indicates that the field should be stored as an attribute.
index Writes the execution value to the current field. During deployment, this indicates that the field should be stored as an index field.
summary Writes the execution value to the current field. During deployment, this indicates that the field should be included in the document summary.

Arithmetics

Indexing statements can contain any combination of arithmetic operations, as long as the operands are numeric values. In case you need to convert from string to numeric, or convert from one numeric type to another, use the applicable converter expression. The supported arithmetic operators are:

Operator Description
<lhs> + <rhs> Sets the execution value to the result of adding of the execution value of the lhs expression with that of the rhs expression.
<lhs> - <rhs> Sets the execution value to the result of subtracting of the execution value of the lhs expression with that of the rhs expression.
<lhs> * <rhs> Sets the execution value to the result of multiplying of the execution value of the lhs expression with that of the rhs expression.
<lhs> / <rhs> Sets the execution value to the result of dividing of the execution value of the lhs expression with that of the rhs expression.
<lhs> % <rhs> Sets the execution value to the remainder of dividing the execution value of the lhs expression with that of the rhs expression.
<lhs> . <rhs> Sets the execution value to the concatenation of the execution value of the lhs expression with that of the rhs expression. If both lhs and rhs are collection types, this operator will append rhs to lhs (if any operand is null, it is treated as an empty collection). If not, this operator concatenates the string representations of lhs and rhs (if any operand is null, the result is null).

You may use parenthesis to declare precedence of execution (e.g. (1 + 2) * 3). This also works for more advanced array concatenation statements such as (input str_a | split ',') . (input str_b | split ',') | index arr.

Converters

There are several expressions that allow you to convert from one data type to another. These are often used within a for_each to convert e.g. an array of strings to an array of integers.

Converter Input Output Description
embed String A tensor of the type of the receiving field Invokes an embedder to convert a text to a point in a tensor space.
hash String Any string Converts the input to a hash value (using SipHash). The hash will be int or long depending on the target field.
to_array Any Array<inputType> Converts the execution value to a single-element array.
to_byte Any Byte Converts the execution value to a byte. This will throw a NumberFormatException if the string representation of the execution value does not contain a parseable number.
to_double Any Double Converts the execution value to a double. This will throw a NumberFormatException if the string representation of the execution value does not contain a parseable number.
to_float Any Float Converts the execution value to a float. This will throw a NumberFormatException if the string representation of the execution value does not contain a parseable number.
to_int Any Integer Converts the execution value to an int. This will throw a NumberFormatException if the string representation of the execution value does not contain a parseable number.
to_long Any Long Converts the execution value to a long. This will throw a NumberFormatException if the string representation of the execution value does not contain a parseable number.
to_bool Any Bool Converts the execution value to a boolean type. If the input is a string it will become true if it is not empty. If the input is a number it will become true if it is != 0.
to_pos String Position Converts the execution value to a position struct. The input format must be either a) [N|S]<val>;[E|W]<val>, or b) x;y.
to_string Any String Converts the execution value to a string.
to_uri String Uri Converts the execution value to a URI struct
to_wset Any WeightedSet<inputType> Converts the execution value to a single-element weighted set with default weight.

Other expressions

The following are the unclassified expressions available:

ExpressionDescription
_

Returns the current execution value. This is useful e.g to prepend some other value to the current execution value, see this example.

attribute <fieldName>

Writes the execution value to the named attribute field.

base64decode

If the execution value is a string, it is base-64 decoded to a long integer. If it is not a string, the execution value is set to Long.MIN_VALUE.

base64encode

If the execution value is a long integer, it is base-64 encoded to a string. If it is not a long integer, the execution value is set to null.

echo

Prints the execution value to standard output, for debug purposes.

flatten

Sets the execution value to a new string which is the current value with all linguistic annotations written into the string itself. This is useful for testing various tokenization settings on a field.

for_each { <script> }

Executes the given indexing script for each element in the execution value. Here, element refers to each element in a collection, or each field value in a struct.

get_field <fieldName>

Retrieves the value of the named field from the execution value (which needs to be either a document or a struct), and sets it as the new execution value.

get_var <varName>

Retrieves the value of the named variable from the execution context and sets it as the execution value. Note that variables are scoped to the indexing script of the current field.

hex_decode

If the execution value is a string, it is parsed as a long integer in base-16. If it is not a string, the execution value is set to Long.MIN_VALUE.

hex_encode

If the execution value is a long integer, it is converted to a string representation of an unsigned integer in base-16. If it is not a long integer, the execution value is set to null.

hostname

Sets the execution value to the name of the host computer.

if (<lhs> <cmp> <rhs>) {
    <trueScript>
}
[ else { <falseScript> } ]

Executes the trueScript if the conditional evaluates to true, or the falseScript if it evaluates to false. If either lhs or rhs is null, no expression is executed.

index <fieldName>

Writes the execution value to the named index field.

input <fieldName>

Retrieves the value of the named field from the document and sets it as the execution value. The field name may contain '.' characters to retrieve nested struct fields.

join "<delim>"

Creates a single string by concatenating the string representation of each array element of the execution value. This function is useful or indexing data from a multivalue field into a singlevalue field.

lowercase

Lowercases all the strings in the execution value.

ngram <size>

Adds ngram annotations to all strings in the execution value.

normalize

normalize the input data. The corresponding query command for this function is normalize.

now

Outputs the current system clock time as a UNIX timestamp, i.e. seconds since 0 hours, 0 minutes, 0 seconds, January 1, 1970, Coordinated Universal Time (Epoch).

random [ <max> ]

Returns a random integer value. Lowest value is 0 and the highest value is determined either by the argument or, if no argument is given, the execution value.

sub-expression1 || sub-expression2 || ...

Returns the value of the first alternative sub-expression which returns a non-null value. See this example.

select_input {
( case <fieldName>: <statement>; )*
}

Performs the statement that corresponds to the first named field that is not empty (see example).

set_language

Sets the language of this document to the string representation of the execution value. Parses the input value as an RFC 3066 language tag, and sets that language for the current document. This affects the behavior of the tokenizer. The recommended use is to have one field in the document containing the language code, and that field should be the first field in the document, as it will only affect the fields defined after it in the schema. Read linguistics for more information on how language settings are applied.

set_var <varName>

Writes the execution value to the named variable. Note that variables are scoped to the indexing script of the current field.

substring <from> <to>

Replaces all strings in the execution value by a substring of the respective value. The arguments are inclusive-from and exclusive-to. Both arguments are clamped during execution to avoid going out of bounds.

split <regex>

Splits the string representation of the execution value into a string array using the given regex pattern. This function is useful for creating multivalue fields such as an integer array out of a string of comma-separated numbers.

summary <fieldName>

Writes the execution value to the named summary field. Summary fields of type string are limited to 64 kB. If a larger string is stored, the indexer will issue a warning and truncate the value to 64 kB.

switch {
( case '<value>': <caseStatement>; )*
[ default: <defaultStatement>; ]
}

Performs the statement of the case whose value matches the string representation of the execution value (see example).

tokenize [ normalize ] [ stem ]

Adds linguistic annotations to all strings in the execution value. Read linguistics for more information.

trim

Removes leading and trailing whitespace from all strings in the execution value.

uri

Converts all strings in the execution value to a URI struct. If a string could not be converted, it is removed.

Execution value example

Accessing the execution value (the value passed into this expression) explicitly is useful when it is to be used as part of an expression such as concatenation. In this example we have a document with a title and an array of sentences, and we prepend each sentence by the document title (and a space), before converting it to a set of embedding vectors (represented by a 2d mixed tensor).

  input mySentenceArray | for_each { input title . " " . _ } | embed | attribute my2dTensor | index my2dTensor

Choice (||) example

The choice expression is used to provide alternatives if an expression may return null.

  (input myField1 || "") . " " . (input myField2 || "") | embed | attribute | index

In this example two fields are concatenated, but if one of the fields is empty, the empty string is used instead. If the empty string alternatives are not provided, no embedding will be produced if either input field is missing.

select_input example

The select_input expression is used to choose a statement to execute based on which fields are non-empty in the input document:

select_input {
    CX:   input CX | set_var CX;
    CA:   input CA . " " . input CB | set_var CX;
}

This statement executes input CX | set_var CX; unless CX is empty. If so, it will execute input CA . " " . input CB | set_var CX; unless CA is empty.

Switch example

The switch-expression behaves similarly to the switch-statement in other programming languages. Each case in the switch-expression consists of a string and a statement. The execution value is compared to each string, and if there is a match, the corresponding statement is executed. An optional default operation (designated by default:) can be added to the end of the switch:

input mt | switch {
    case "audio": input fa | index;
    case "video": input fv | index;
    default: 0 | index;
};

Indexing statements example

Using indexing statements, multiple document fields can be used to produce one index structure field. For example, the index statement:

input field1 . input field2 | attribute field2;

combines field1 and field2 into the attribute named field2. When partially updating documents which contains indexing statement which combines multiple fields the following rules apply:

  • Only attributes where all the source values are available in the source document update will be updated
  • The document update will fail when indexed (only) if no attributes end up being updated when applying the rule above

Example: If a schema has the indexing statements

input field1 | attribute field1;
input field1 . input field2 | attribute field2;

the following will happen for the different partial updates:

Partial update containsResult
field1field1 is updated
field2The update fails
field1 and field2field1 and field2 are updated