Query Language - YQL

Vespa accepts unstructured human input and structured queries for application logic separately, then combines them into a single data structure for executing on the search nodes. Human input is parsed heuristically, while application queries are formulated in YQL. Storing the YQL query in query profiles allows for simple URLs executing complex queries.

Vespa search queries are not database queries. The basic assumption is ideally a single document should be returned, and that should be an optimal answer to the query. Therefore a query with e.g. no terms does not return all documents, it is an error. A query with only a negative condition, e.g. field not contains "term" is also not possible, as the base state is not filtering away from the total corpus, but matching from an as small as possible subset. This has led YQL in Vespa to return errors for many technically valid YQL queries to avoid confusing search results.

YQL 101

A simple search URL looks like:

http://myhost.mydomain.com:8080/search/?yql=select%20%2A%20from%20sources%20%2A%20where%20text%20contains%20%22blues%22%3B
In other words, yql contains:
select%20%2A%20from%20sources%20%2A%20where%20text%20contains%20%22blues%22%3B
This is a search for all documents where the field named text contains the word blues. The match type is determined by the search definition.

Note that the terminating semicolon must be URL-quoted.

SELECT

SELECT is the list of requested summary fields (a field with the "summary" index attribute). Vespa will hide other fields in the matching documents.

select%20price,isbn%20from%20sources%20%2A%20where%20title%20contains%20%22madonna%22%3B
The above explicitly requests the fields "price" and "isbn" (from all sources). The matching documents may contain other fields, but they will not be shown. To request all fields, use an asterisk as field selection:
select%20*%20from%20sources%20%2A%20where%20title%20contains%20%22madonna%22%3B
The above will show all fields for the fetched documents. No overriding of normal Vespa document summary fetching will take place.

FROM SOURCES

FROM specifies which document sources to search, it is handled in a similar way to Vespa's "sources" parameter. Example:

select%20%2A%20from%20music%20where%20title%20contains%20%22madonna%22%3B
searches in the source "music" for the documents matching the "where" filter expression.

Searching in all sources, done by using select … from sources * where …, while searching in a set of sources is done with select … from sources source1, source2 where … and searching a single source is done by select … from source1 where …

In other words, the "sources" keyword is used for querying all sources (and then in conjunction with an asterisk), or when explicitly querying more than one source. If only a single source is requested, the "sources" keyword is dropped.

CONTAINS

CONTAINS is the basic building block for text search in Vespa. How the actual search will be done, depends on the field settings in the given application.

select%20%2A%20from%20sources%20%2A%20where%20title%20contains%20%22madonna%22%3B
The field, in our case "title", must be an indexed field or attribute. If an unknown field, or a field not configured to be somehow searchable, is attempted searched, an error message will be generated.

The string which is searched for in the field will be searched for exactly as written, though stemming, accent removal, etc may change its contents. Whether these should take effect may controlled using annotations, as in:

select%20%2A%20from%20sources%20%2A%20where%20title%20contains%20%28%5B%7B%22stem%22%3A%20false%7D%5D%22madonna%22%29%3B
Note the use of parentheses to force the correct precedence.

The "not contains" operation is not implemented, as it would be error prone and have confusing semantics for a search platform.

OR

The OR statement accepts other OR statements, AND statements, userQuery - and contains statements as arguments:

select%20%2A%20from%20sources%20%2A%20where%20title%20contains%20%22madonna%22%20or%20title%20contains%20%22saint%22%3B

AND

The AND statement accepts other AND statements, OR statements, userQuery, logically inverted statements - and contains statements as arguments:

select%20%2A%20from%20sources%20%2A%20where%20title%20contains%20%22madonna%22%20and%20title%20contains%20%22saint%22%3B

AND NOT

Since Vespa does recall, as opposed to filtering, the only excluding operator in Vespa is "ANDNOT". In YQL this is expressed as the right hand side, and only the right hand side, argument of the AND operator may be a logically inverted expression, i.e. using the ! operator:

select%20%2A%20from%20sources%20%2A%20where%20title%20contains%20%22madonna%22%20and%20%21%28title%20contains%20%22saint%22%29%3B

ORDER BY

Sorting the result according to attribute vectors in Vespa is supporting using order by. Adding asc or desc after the name of an attribute, specifies sorting in ascending or descending order, respectively. If no order is explicitly specified, ascending order is implied.

select%20%2A%20from%20sources%20%2A%20where%20title%20contains%20%22madonna%22%20order%20by%20price%20asc%2C%20releasedate%20desc%3B
Sorting function, locale and strength are defined using the annotations "function", "locale" and "strength", as in:
select%20%2A%20from%20sources%20%2A%20where%20title%20contains%20%22madonna%22%20order%20by%20%5B%7B%22function%22%3A%20%22uca%22%2C%20%22locale%22%3A%20%22en_US%22%2C%20%22strength%22%3A%20%22IDENTICAL%22%7D%5Dother%20desc%2C%20%5B%7B%22function%22%3A%20%22lowercase%22%7D%5Dsomething%3B
Note: match-phase is enabled when sorting - refer to the sorting reference.

Numeric Operators

The following numeric operators are available: =, <, >, <=, >=

select%20%2A%20from%20sources%20%2A%20where%20500%20%3E%3D%20price%3B
In addition, range(field, lower bound, upper bound) allows explicit matching of numeric ranges.
select%20%2A%20from%20sources%20%2A%20where%20range%28field%2C%200%2C%20500%29%3B
The interval is by default a closed interval. If the lower bound is exclusive, set the annotation "bounds" to "leftOpen". If the upper bound is exclusive, set the same annotation to "rightOpen". If both bounds are exclusive, set the annotation to "open".

These number operations support an extra annotation, the integer "hitLimit". This is used for capped range search. An alternative to using negative and positive values for "hitLimit" is always using a positive number of hits (as a negative number of hits do no not make much sense) and combine this with either of the boolean annotations "ascending" and "descending" (but not both). Then "[{"hitLimit": 38, "descending": true}]" would be equivalent to setting it to -38, i.e. only populate with 38 hits and start from upper boundary, i.e. descending order.

Numbers must be in the signed 32-bit range. To input 64-bit signed numbers, use "L" as suffix, e.g

... where%20range%28field%2C%200%2C%205000000000L%29%3B

Grouping

Group / aggregate results using a grouping expression:

select%20*%20from%20sources%20*%20where%20sddocname%20contains%20%27purchase%27%20%7C%20 all(group(customer)%20each(output(sum(price))))%3B
Read more.

Pagination

To limit the number of hits returned, use limit and offset to specify a slice:

select%20%2A%20from%20sources%20%2A%20where%20title%20contains%20%22madonna%22%20limit%2031%20offset%2029%3B
The above will return two hits (if there sufficiently many hits matching the query), skipping the 29 first documents.

paged and next is not supported.

TIMEOUT

Set query timeout in milliseconds using timeout:

select%20%2A%20from%20sources%20%2A%20where%20title%20contains%20%22madonna%22%20timeout%2070%3B
Only literal numbers are allowed, e.g. setting another unit is not supported.

Regular expressions

Limited support for using regular expressions for searching is provided. By replacing contains with matches, vespa will do a regex search instead. This example becomes a substring search:

select%20%2A%20from%20sources%20%2A%20where%20title%20matches%20%22madonna%22%3B
This example matches both madonna, madona and with any number of ns:
select%20%2A%20from%20sources%20%2A%20where%20title%20matches%20%22mado%5Bn%5D%2Ba%22%3B
Do note that you need to url encode these when using http. The above will then look like this:
select%20%2A%20from%20sources%20%2A%20where%20title%20matches%20%22mado%5Bn%5D%2Ba%22%3B
Here you match any string starting with mad:
select%20%2A%20from%20sources%20%2A%20where%20title%20matches%20%22^mad%22%3B
Vespa supports regular expressions with posix extended syntax with the limitation that it is case insensitive.

Note: Only attribute fields in documents that have mode="index" are supported. It is also not optimized. Having a prefix using the ^ will be faster than not having one.

Search specific operators

The right hand side argument of the contains operator is either a string literal, or a phrase. YQL has no native definition of e.g. phrase matching. Here the Vespa integration uses a function:

select%20%2A%20from%20sources%20%2A%20where%20text%20contains%20phrase%28%22st%22%2C%20%22louis%22%2C%20%22blues%22%29%3B
Other Vespa specific search operators, like wand, is expressed using functions as well.

It may be necessary to pass along extra information about a search term, for instance when specifying a term should not be stemmed before matching. This is done by using YQL annotations:

select%20%2A%20from%20sources%20%2A%20where%20text%20contains%20%28%5B%7B%22stem%22%3A%20false%7D%5D%22blues%22%29%3B

Near and Ordered Near

near() and onear() (ordered near) are operators which match if all argument terms occur close to each other in the same document. onear() additionally requires the terms in the document having the same order as given in the function (i.e. it is a phrase allowing other words interleaved).

near() and onear() supports the annotation distance which controls how many words are allowed to separate the argument terms. The default value is 2.

select%20%2A%20from%20sources%20%2A%20where%20description%20contains%20%28%5B%20%7B%22distance%22%3A%20100%7D%20%5Donear%28%22a%22%2C%20%22b%22%29%29%3B

Term Equivalence

If two terms in the same field should give exactly the same behavior when match, the equiv() operator behaves like a special case of "or".

select%20%2A%20from%20sources%20%2A%20where%20fieldName%20contains%20equiv%28%22A%22%2C%22B%22%29%3B

Term Annotations

Terms and phrases can be annotated to manipulate the precise behavior of the search platform.

Annotations supported by strings

These annotations are supported by the string arguments to functions like and phrase() and near() and also the string argument to the "contains" operator.

"nfkc": true|false NFKC normalization. Default on.
"implicitTransforms": true|false Implicit term transformations (field defaults), default on. If implicitTransforms is active, the settings for the field in the search definition will be honored in term transforms, e.g. if the field has stemming, this term will be stemmed. If implicitTransforms are turned off, the search backend will receive the term exactly as written in the initial YQL expression. This is in other words a top level switch to turn off all other stemming, accent removal, Unicode normalizations and so on.
"annotations": {"string": "string"} Custom term annotations. This is by default empty.
"origin": {"original": "string", "offset": int, "length": int} The (sub-)string which produced this term. Default unset.
"usePositionData": true|false Use position data for ranking algorithm. Default true. This is term position, not to be confused with geo searches
"stem": true|false Stem this term if it is the setting for this field, default on.
"normalizeCase": true|false Normalize casing of this term if it is the setting for this field, default on.
"accentDrop": true|false Remove accents from this term if it is the setting for this field, default on.
"andSegmenting": true|false Force phrase or AND operator if re-segmenting (e.g. in stemming) this term results in multiple terms. Default is choosing from language settings.
"prefix": true|false Do prefix matching for this word. Default false. ("Search for "word*".")
"suffix": true|false Do suffix matching for this word. Default false. ("Search for "*word".")
"substring": true|false Do substring matching for this word if available in the index. Default false. ("Search for "*word*".") Only supported for streaming search.

Annotations supported by strings and functions

These annotations are supported by strings and by the functions which are handled like leaf nodes internally in the query tree: phrase(), near(), onear(), range(), equiv(), weightedSet(), weakAnd() and wand().

"id": int Unique ID used for e.g. connectivity.
"connectivity": {"id": int, "weight": double} Map with the ID and weight of explicitly connectivity of this item.
"significance": double Significance value for ranking.
"annotations": {"string": "string"} Custom annotations. No special semantics inside the YQL layer.
"filter": true|false Regard this term as a "filter" term. Default false.
"ranked": true|false Include this term for ranking calculation. Default true.
"label": "string" Label for referring to this term during ranking.
"weight": int Term weight, used in some ranking calculations.

Annotations of sub-expressions

Consider the following query:

select%20%2A%20from%20sources%20%2A%20where%20%28%5B%7B%22stem%22%3A%20false%7D%5D%28foo%20contains%20%22a%22%20and%20bar%20contains%20%22b%22%29%29%20or%20foo%20contains%20%28%5B%7B%22stem%22%3A%20false%7D%5D%22c%22%29%3B
The "stem" annotation controls whether a given term should be stemmed if its field is configured as a stemmed field (default is "true"). The "AND" operator itself has no internal API for whether its operands should be stemmed or not, but we can still annotate as such, because when the value of a given annotation is determined, the expression tree is followed from the term in question and up through its ancestors. Traversing the tree stops when a value is found (or there is nothing more to traverse). In other words, none of the terms in this example will be stemmed.

How annotations behave may be easier to understand of expressing a boolean query in the style of an S-expression:

(AND term1 term2 (OR term3 term4) (OR term5 (AND term6 term7)))
The annotation scopes would then be as follows, i.e. annotations on which elements will be checked when determining the settings for a given term:
term1
term1 itself, and the first AND.
term2
term2 itself, and the first AND.
term3
term3 itself, the first OR and the first AND.
term4
term4 itself, the first OR and the first AND.
term5
term5 itself, the second OR and the first AND.
term6
term6 itself, the second AND, the second OR and the first AND.
term7
term7 itself, the second AND, the second OR and the first AND.

Query properties

The YQL variable syntax may be used to initialize words in phrases and as single terms. This alleviates the need for caring about quoting a term in YQL as well as URL quoting. The term will be used exactly as it is in the URL.

As an example, look at a query with a YQL argument, and the properties animal and syntaxExample:

yql=select%20%2A%20from%20sources%20%2A%20where%20foo%20contains%20%40animal%20and%20foo%20contains%20phrase%28%40animal%2C%20%40syntaxExample%2C%20%40animal%29%3B&animal=panda&syntaxExample=syntactic
This YQL expression will then access the query properties animal and syntaxExample and evaluate to:
select%20%2A%20from%20sources%20%2A%20where%20%28foo%20contains%20%22panda%22%20AND%20foo%20contains%20phrase%28%22panda%22%2C%20%22syntactic%22%2C%20%22panda%22%29%29%3B

userInput()

userInput() is a robust way of mixing user input and a formal query. It allows controlling whether the user input is to be stemmed, lowercased, etc, but it also allows for controlling whether it should be treated as a raw string, whether it should simply be segmented or parsed as a query.

yql=select%20%2A%20from%20sources%20%2A%20where%20userInput%28%40animal%29%3B&animal=panda
Here, the userInput() function will access the query property "animal", and parse the property value as an "ALL" query, resulting in the following expression:
select%20%2A%20from%20sources%20%2A%20where%20default%20contains%20%22panda%22%3B
Now, if we changed the value of "animal" without changing the rest of the expression:
yql=select%20%2A%20from%20sources%20%2A%20where%20userInput%28%40animal%29%3B&animal=panda%20smokey
The result would be:
select%20%2A%20from%20sources%20%2A%20where%20default%20contains%20%22panda%22%20and%20default%20contains%20%22smokey%22%3B
Now, let's assume we want to combine multiple query properties and have a more complex expression as well:
yql=select%20%2A%20from%20sources%20%2A%20where%20range%28year%2C%201963%2C%202014%29%20and%20%28userInput%28%40animal%29%20or%20userInput%28%40teddy%29%29%3B&animal=panda&teddy=bear%20roosevelt
The resulting YQL expression will be:
select%20%2A%20from%20sources%20%2A%20where%20range%28year%2C%201963%2C%202014%29%20and%20%28default%20contains%20%22panda%22%20or%20%28default%20contains%20%22bear%22%20and%20default%20contains%20%22roosevelt%22%29%29%3B
Now, consider we do not want the "teddy" field to be treated as its own query segment, it should only be segmented with the linguistic libraries to get recall. We can do this by adding a "grammar" annotation to the userInput() call:
yql=select%20%2A%20from%20sources%20%2A%20where%20range%28year%2C%201963%2C%202014%29%20and%20%28userInput%28%40animal%29%20or%20%5B%7B%22grammar%22%3A%20%22segment%22%7D%5DuserInput%28%40teddy%29%29%3B&animal=panda&teddy=bear%20roosevelt
Then, the linguistic library will split on space, and the resulting expression is:
select%20%2A%20from%20sources%20%2A%20where%20range%28year%2C%201963%2C%202014%29%20and%20%28default%20contains%20%22panda%22%20or%20default%20contains%20phrase%28%22bear%22%2C%20%22roosevelt%22%29%29%3B
Instead of a variable reference, the userInput() function also accepts raw strings as arguments, but this would obviously not be suited for parametrizing the query from a query profile. It is mostly intended as test feature.

userInput() control annotations

NameDefault valueValuesEffect
grammar all raw, segment and all values accepted for the model.type argument in the search API. How to parse the user input. "raw" will treat the user input as a string to be matched without any processing, "segment" will do a first pass through the linguistic libraries, while the rest of the values will treat the string as a query to be parsed. If query parsing fails, an error message will be returned.
defaultIndex default Any searchable field in the system's search definition. Same as model.defaultIndex in the search API. If "grammar" is set to "raw" or "segment", this will be the field searched.
language Autodetect RFC 3066 language code Language setting for the linguistics treatment of this userInput() call, also see model.language in the search API reference.
allowEmpty false Boolean true or false. Whether to allow empty input for query parsing and search terms. If this is true, a NullItem instance is inserted in the proper place in the query tree. If "allowEmpty" is false, the query will fail if the user provided data can not be parsed or is empty.

In addition, other annotations, like stem or ranked, will take effect as normal.

Limitations of annotation inheritance when treating userInput() queries

The query parsing mechanism has currently certain limitations for propagating annotation, therefore, for any value of grammar other than raw or segment, only the following annotations will take effect:

  • ranked
  • filter
  • stem
  • normalizeCase
  • accentDrop
  • usePositionData

userQuery()

userQuery() evaluates to the parsed user query, i.e. the HTTP API parameter named query (including the filter part, if this is available). The function userQuery represents where the heuristically parsed query is to be inserted as a sub-tree into the YQL query. In other words, this is not a string substitution, the user query is first parsed with any of the supported grammars, then the resulting tree is inserted into the corresponding place in the YQL query tree:

http://myhost.mydomain.com:8080/search/?query=abc%20def%20-ghi&type=all&yql=select%20%2A%20from%20sources%20%2A%20where%20vendor%20contains%20%22brick%20and%20mortar%22%20AND%20price%20%3C%2050%20AND%20userQuery%28%29%3B
Breakdown:
query abc def -ghi
type all
yql select * from sources * where vendor contains "brick and mortar" AND price < 50 AND userQuery();
The above example will in other words evaluate to a query where the numeric field price must have a value lower than 50, vendor must match the term brick and mortar, and the default index must contain the two terms abc and def while not containing the term ghi. The spaces in the vendor term will not be used to split this into several new terms by YQL. The string specified by the search will be used. Query transformers may convert the string at a later stage, but it is not necessary to do anything "special" to create a search term containing arbitrary characters.

Convert user queries to YQL

Run user queries through a container instance while setting tracelevel to 2 or higher. The parsed query expressed as YQL will be available in the trace info. It is also possible to do this programmatically, using the instance method com.yahoo.search.Query.yqlRepresentation() in the search API of the container.

YQL in query profiles

YQL requires quoting to be included in a URL. Since YQL is well suited to application logic, while not being intended for end users, a solution to this is storing the application's YQL queries into different query profiles. To add a default query profile, add search/query-profiles/default.xml to the application package:

<query-profile id="default">
  <field name="yql">select * from sources * where default contains "latest" or userQuery();</field>
</query-profile>
This will add latest as an OR term to all queries not having an explicit query profile parameter. The important thing to note is how it is not necessary to URL-quote anything in the query profiles files. They operate independently of the HTTP parsing as such.

rank()

The first, and only the first, argument of the rank() function determines whether a document is a match, but all arguments are used for calculating rank score.

select%20%2A%20from%20sources%20%2A%20where%20rank%28a%20contains%20%22A%22%2C%20b%20contains%20%22B%22%29%3B

Advanced functions

These advanced functions have recall behavior similar to OR, but only return a limited number of documents. For more information on these functions take a look at Advanced Search Operators.

Functions operating on a single field

wand(), dotProduct() and weightedSet() model very similar operations in the search core. The main function to use is wand(). These functions specify a weighted set in their argument which is matched against a single field in the backend.

wand()

wand() (aka Parallel Wand) implements the "Weak AND"/"Weighted AND" algorithm and is used for matching the weighted set in its argument against a single weighted set field. The best "targetNumHits" (according to the dot product score calculated) is returned from this function:

select%20%2A%20from%20sources%20%2A%20where%20%5B%20%7B%22scoreThreshold%22%3A%2013%2C%20%22targetNumHits%22%3A%207%7D%20%5Dwand%28description%2C%20%7B%22a%22%3A1%2C%20%22b%22%3A2%7D%29%3B
wand() also allows numeric arguments, then the search argument is an array of arrays of length two. In each pair, the first number is the search term, the second its weight:
select%20%2A%20from%20sources%20%2A%20where%20wand%28description%2C%20%5B%5B11%2C1%5D%2C%20%5B37%2C2%5D%5D%29%3B

dotProduct()

dotProduct() calculates the dot product between the weighted set in its argument and a weighted set field in the document as its rank score contribution:

select%20%2A%20from%20sources%20%2A%20where%20dotProduct%28description%2C%20%7B%22a%22%3A1%2C%20%22b%22%3A2%7D%29%3B

weightedSet()

When using weightedSet() to search a field, all tokens present in the searched field will be matched against the weighted set in the query. This means that using a weighted set to search a single-value attribute field will have similar semantics to using a normal term to search a weighted set field. The low-level matching information resulting from matching a document with a weighted set in the query will contain the weights of all the matched tokens in descending order. Each matched weight will be represented as a standard occurrence on position 0 in element 0.

select%20%2A%20from%20sources%20%2A%20where%20weightedSet%28description%2C%20%7B%22a%22%3A1%2C%20%22b%22%3A2%7D%29%3B

Functions operating across multiple fields

weakAnd()

weakAnd() (aka Vespa Wand) also implements the "Weak AND"/"Weighted AND" algorithm, but unlike wand(), it accepts arbitrary word matches (across arbitrary fields) as arguments. Only a limited number of documents are returned for ranking (default is 100), but it does not guarantee to return the best k hits. This function can be seen as an optimized OR:

select%20%2A%20from%20sources%20%2A%20where%20weakAnd%28a%20contains%20%22A%22%2C%20b%20contains%20%22B%22%29%3B

Optional Arguments

Both wand() and weakAnd() both support the annotations "scoreThreshold", which is an integer giving the minimum rank score for hits to include, and "targetNumHits" which is the desired number of hits to produce from the function in question:

select%20%2A%20from%20sources%20%2A%20where%20%5B%7B%22scoreThreshold%22%3A%2041%2C%20%22targetNumHits%22%3A%207%7D%5DweakAnd%28a%20contains%20%22A%22%2C%20b%20contains%20%22B%22%29%3B

nonEmpty()

nonEmpty takes as its only argument an arbitrary search expression. It will then perform a set of checks on that expression. If all the checks pass, the result is the same expression, otherwise the query will fail. The checks are as follows:

  1. No empty search term.
  2. No empty operators, like phrases without terms.
  3. No null markers (NullItem) from e.g. failed query parsing.
yql=select%20%2A%20from%20sources%20%2A%20where%20bar%20contains%20%22a%22%20and%20nonEmpty%28bar%20contains%20%22bar%22%20and%20foo%20contains%20%40foo%29&foo=
Note how "foo" is empty in this case, which will force the query to fail. If "foo" contained a searchable term, the query would not have failed.

predicate()

predicate() is used to specify a predicate query - see predicate fields. It takes three arguments: the predicate field to search, a map of attributes, and a map of range attributes:

select%20*%20from%20sources%20*%20where%20predicate(predicate_field%2C%7B%22gender%22%3A%22Female%22%7D%2C%7B%22age%22%3A20L%7D)%3B
Due to a quirk in YQL-parsing, one cannot specify an empty map, use the number 0 instead.
select%20*%20from%20sources%20*%20where%20predicate(predicate_field%2C0%2C%7B%22age%22%3A20L%7D)%3B

Known issues

Query rewriting in searcher plug-ins: In applications where query rewriting is done using a searcher plug-in in the container, one must annotate searchers. The yql query property is handled by a searcher as well, so if annotating the custom searcher with @Before("ExternalYql") the data in the YQL query parameter will never be present. Conversely, to ensure always getting the query defined in yql, annotate with @After("ExternalYql").

Search in struct fields: Search in struct fields works in streaming search only. YQL does not support dots '.' in field names. Use a fieldset to work around, example:

search mydoc {
    document mydoc {

        struct Recipient {
            field name type string { }
            field smtp type string { }
        }

        field from type Recipient {
            struct-field name { indexing: index | summary }
            struct-field smtp { indexing: index | summary }
        }
    }

    fieldset fromname {
        fields: from.name
    }
}
Then use fromname instead of from.name in queries, like:
select%20%2A%20from%20sources%20%2A%20where%20fromname%20contains%20%22thename%22%3B