Document selector language

This document describes the document selector language, used to select a subset of documents when feeding, dumping and garbage collecting data. It defines a text string format that can be parsed to build a parse tree, which in turn can answer whether a given document is contained within the subset or not.

Examples

music.author and music.length <= 1000
This expression selects all documents with the above two conditions set. The first condition states that the documents should be of type music, and the author field must exist. The second states that the field length must be set, and be less than 1000.
book.author = "*John*Doe\n" or not book.author
This expression selects all documents where either of the subexpressions are true. The first one states that the author field should include the name John Doe, with anything in between or in front. The \n escape is converted to a newline before the field comparison is done. Thus requiring the field to end with Doe and a newline for a match to be true. The second expression selects all books where no author is defined.
not (music.length > 1000) and (false or music.test)
Here is an example of how parentheses are used to group expressions. Also, a constant value false has been used. Note that the (false or music.test) sub-expression could be exchanged with just music.test without altering the result of the selection. The sub-expression within the not clause selects all documents where the size field is above 1000 and the test field is defined. The not clause inverts the selection, thus selecting all documents with size less than or equal to 1000 or the test field undefined.
Other examples:
  • music.version() == 3 and (music.givenname + " " + music.surname).lowercase() = "bruce springsteen"
  • id.user.hash().abs() % 300 % 7 = 1
  • music.wavstream.hash() == music.checksum
  • music.size / music.length > 10
  • music.expire > now() - 7200

Case sensitiveness

The identifiers used in this language (and or not true false null id scheme namespace specific user group) are not case sensitive. It is recommended to use lower cased identifiers for consistency with the documentation.

Branch operators / precendence

The branch operators are used to combine other nodes in the parse tree generated from the text format. The different branch nodes existing is listed in the table below in order of precedence. Operators listed in order of precedence:

OperatorDescription
NOT Unary prefix operator inverting the selection of the child node
AND Binary infix operator, which is true if all its children are
OR Binary infix operator, which is true if any of its children are
Use parentheses to define own precedence. a and b or c and d is equivalent to (a and b) or (c and d) since and has higher precedence than or. The expression a and (b or c) and d is not equivalent to the previous two, since parentheses have been used to force the or expression to be evaluated first.

Parentheses can also be used in value calculations. Where modulo % has the highest precedence, multiplication * and division / next, addition + and subtractions - have lowest precedence.

Primitives

PrimitiveDescription
Boolean constant The boolean constants true and false can be used to match all/nothing
Document type A document type can be used as a primitive to select a given type of documents
Document field specification A document field specification (doctype.field) can be used as a primitive to select all documents that have field set - a shorter form of doctype.field != null
Comparison The comparison is a primitive used to compare two values

Comparison

Comparisons operators compares two values using an operator. All the operators are infix and take two arguments.

OperatorDescription
> This is true if the left argument is greater than the right one. Operators using greater than or less than notations only makes sense where both arguments are either numbers or strings. In case of strings, they are ordered by their binary (byte-wise) representation, with the first character being the most significant and the last character the least significant. If the argument is of mixed type or one of the arguments are not a number or a string, the comparison will be invalid and not match.
< Matches if left argument is less than the right one
<= Matches if the left argument is less than or equal to the right one
>= Matches if the left argument is greater than or equal to the right one
== Matches if both arguments are exactly the same. Both arguments must be of the same type for a match
!= Matches if both arguments are not the same
= String matching using a glob pattern. Matches only if the pattern given as the right argument matches the whole string given by the left argument. Asterisk * can be used to match zero or more of any character. Question mark ? can be used to match any one character. The pattern matching operators, regex =~ and glob =, only makes sense if both arguments are strings. The regex operator will never match anything else. The glob operator will revert to the behaviour of == if both arguments are not strings.
=~ String matching using a regular expression. Matches if the regular expression given as the right argument matches the string given as the left argument. Regex notation is like perl. Use '^' to indicate start of value, '$' to indicate end of value

Locale / Character sets

The language currently does not support character sets other than ASCII. Glob and regex matching of single characters are not guaranteed to match exactly one character, but might match a part of a character represented by multiple byte values.

Values

The comparison operator compares two values. A value can be any of the following:

Document field specification

Syntax: <doctype>.<fieldpath>

Documents have a set of fields defined, depending on the document type. The field name is the identifier used for the field. This expression returns the value of the field, which can be an integer, a floating point number, a string, an array, or a map of these types.

For multivalues, we support only the equals operator for comparison. The semantics is that the array returned by the fieldvalue must contain at least one element that matches the other side of the comparison. For maps, there must exist a key matching the comparison.

The simplest use of the fieldpath is to specify a field, but for complex types please refer to the field path syntax documentation.

Id

Syntax: id.[scheme|namespace|type|specific|user|group|order(W,d)]

Each document has a Document Id, uniquely identifying that document within a Vespa installation. The id operator returns the string identifier, or if an optional argument is given, a part of the id.

  • scheme (id)
  • namespace (to separate different users' data)
  • type (specified in the id scheme)
  • specific (User specified part to distinguish documents within a namespace)
  • user (The number specified in document ids using the n= modifier)
  • group (The string group specified in document ids using the g= modifier)

null

The value null can be given to specify nothingness. For instance, a field specification for a document not containing the field will evaluate to null, so the comparison 'music.artist == null' will select all documents that don't have the artist field set. 'id.user == null' will match all documents that don't use the n= document id scheme.

Number

A value can be a number, either an integer or a floating point number. Type of number is insignificant. You don't have to use the same type of number on both sides of a comparison. For instance '3.0 < 4' will match, and '3.0 == 3' will probably match (operator == is generally not advised for floating point numbers due to rounding issues). Numbers can be written in multiple ways - examples:

  1234  -234  +53  +534.34  543.34e4  -534E-3  0.2343e-8

Strings

A string value is given quoted with double quotes (ie. "mystring"). The string is interpreted as an ASCII string. that is, only ASCII values 32 to 126 can be used unescaped, apart from the characters \ and " which also needs to be escaped. Escape common special characters like:

CharacterEscaped character
Newline\n
Carriage return\r
Tab\t
Form feed\f
"\"
Any other character \x## (where ## is a two digit hexadecimal number specifying the ASCII value.

Value arithmetics

In addition to single values you can also do arithmetics on values. The common arithmetics operators addition +, subtraction -, multiplication *, division / and modulo % are supported.

Functions

Functions are called on something and returns a value that can be used in comparison expressions:

Value functions A value function takes a value, does something with it and returns a value which can be of any type.
  • abs() Called on a numeric type, returns the absolute value of that numeric type. That is -3 returns 3 and -4.3 returns 4.3.
  • hash() Calculates an md5 hash of whatever value it is called on. The result is a signed 64 bit integer. (Use abs() after if you want to only get positive hash values).
  • lowercase() Called on a string value to turn upper case characters into lower case ones. NOTE: This only works for the characters 'a' through 'z', no locale support.
Document type functions Some functions can take a document type instead of a value, and return a value based on the type.
  • version() The version() function returns the version number of a document type.

Now function

Document selection provides a now() function, which returns the current date timestamp. Use this to filter documents by timestamp - example: Add a field called inserttimestamp in the search definition, and let the selection string filter out older documents:

music.inserttimestamp > now() - 7200

now() can only be used in a subset of the document selection language. All uses of now() must be on the form:

<doctype>.<field> > now() [ - <seconds> ]

now() can be used with the <documents> tag for search, or in a selection expression for a visitor. Often used in garbage collection selection.

Constraints

Language identifiers restrict what can be used as document type names. The following values are not valid document type names: true, false, and, or, not, id, null

Grammar - EBNF of the language

To simplify, double casing of strings has not been included. The identifiers "null", "true", "false" etc can be written in any case, including mixed case.

nil              = "null" ;
bool             = "true" | "false" ;
posdigit         = '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
digit            = '0' | posdigit ;
hexdigit         = digit | 'a' | 'b' | 'c' | 'd' | 'e' | 'f'
					 	   | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' ;
integer          = [ '-' | '+' ], posdigit, { digit } ;
float            = [ '-' | '+' ], digit, { digit },
      			[ '.' , { digit }, [ ('e' | 'E'), posdigit, { digit } ] ] ;
number           = float | integer ;
stdchars         = ? All ASCII chars except '\\', '"', 0 - 31 and 127 - 255 ? ;
alpha            = ? ASCII characters in the range a-z and A-Z ? ;
alphanum         = alpha | digit ;
space            = ( ' ' | '\t' | '\f' | '\r' | '\n' ) ;
string           = '"', { stdchars | ( '\\', ( 't' | 'n' | 'f' | 'r' | '"' ) )
							| ( "\\x", hexdigit, hexdigit ) }, '"' ;
doctype          = alpha, { alphanum } ;
fieldname        = { alphanum '{' |'}' | '[' | ']' '.' } ;
function         = alpha, { alphanum } ;
idarg            = "scheme" | "namespace" | "type" | "specific" | "user" | "group" | "order(W,d)" ;
searchcolumnarg  = integer ;
operator         = ">=" | ">" | "==" | "=~" | "=" | "<=" | "<" | "!=" ;
idspec           = "id", [ '.', idarg ] ;
searchcolumnspec = "searchcolumn", [ '.', searchcolumnarg ] ;
fieldspec        = doctype, ( function | ('.', fieldname) ) ;
value            = ( valuegroup | nil | number | string | idspec | searchcolumnspec | fieldspec ),
				   { function } ;
valuefuncmod     = ( valuegroup | value ), '%',
				   ( valuefuncmod | valuegroup | value ) ;
valuefuncmul     = ( valuefuncmod | valuegroup | value ), ( '*' | '/' ),
				   ( valuefuncmul | valuefuncmod | valuegroup | value ) ;
valuefuncadd     = ( valuefuncmul | valuefuncmod | valuegroup | value ),
				   ( '+' | '-' ),
				   ( valuefuncadd | valuefuncmul | valuefuncmod | valuegroup
				   | value ) ;
valuegroup       = '(', arithmvalue, ')' ;
arithmvalue      = ( valuefuncadd | valuefuncmul | valuefuncmod | valuegroup
				   | value ) ;
comparison       = arithmvalue, { space }, operator, { space },
				   arithmvalue ;
leaf             = bool | comparison | fieldspec | doctype ;
not              = "not", { space }, ( group | leaf ) ;
and              = ( not | group | leaf ), { space }, "and", { space },
				   ( and | not | group | leaf ) ;
or               = ( and | not | group | leaf ), { space }, "or", { space },
				   ( or | and | not | group | leaf ) ;
group            = '(', { space }, ( or | and | not | group | leaf ),
				   { space }, ')' ;
expression       = ( or | and | not | group | leaf ) ;