Query Language Reference
Vespa accepts unstructured human input and structured queries for application logic separately, then combines them into a single data structure for executing. Human input is parsed heuristically, while application queries are formulated in YQL.
A query URL looks like:
http://myhost.mydomain.com:8080/search/?yql=select%20%2A%20from%20sources%20%2A%20where%20text%20contains%20%22blues%22%3BIn other words, yql contains:
select%20%2A%20from%20sources%20%2A%20where%20text%20contains%20%22blues%22%3BThis matches all documents where the field named text contains the word blues.
Quote (") and backslash (\) characters in text values must be escaped by a backslash.
select
select is the list of summary fields requested (a field with the "summary" index attribute). Vespa will hide other fields in the matching documents.
select%20price,isbn%20from%20sources%20%2A%20where%20title%20contains%20%22madonna%22%3BThe above explicitly requests the fields "price" and "isbn" (from all sources). To request all fields, use an asterisk as field selection:
select%20*%20from%20sources%20%2A%20where%20title%20contains%20%22madonna%22%3B
from sources
from sources specifies which document sources to search. Example:
select%20%2A%20from%20music%20where%20title%20contains%20%22madonna%22%3Bsearches in music documents. Search in:
all sources | select … from sources * where … |
a set of sources | select … from sources source1, source2 where … |
a single source | select … from source1 where … |
where
The where
clause is a tree of operators:
numeric |
The following numeric operators are available: =, <, >, <=, >=, range(field, lower bound, upper bound) where%20500%20%3E%3D%20price%3B where%20range%28fieldname%2C%200%2C%205000000000L%29%3B Numbers must be in the signed 32-bit range, or the string "Infinity"/"-Infinity". Input 64-bit signed numbers using L as suffix. The interval is by default a closed interval. If the lower bound is exclusive, set the annotation "bounds" to "leftOpen". If the upper bound is exclusive, set the same annotation to "rightOpen". If both bounds are exclusive, set the annotation to "open". The number operations support an extra annotation, the integer "hitLimit". This is used for capped range search. An alternative to using negative and positive values for "hitLimit" is always using a positive number of hits (as a negative number of hits do no not make much sense) and combine this with either of the boolean annotations "ascending" and "descending" (but not both). Then "[{"hitLimit": 38, "descending": true}]" would be equivalent to setting it to -38, i.e. only populate with 38 hits and start from upper boundary, i.e. descending order. Note that hitLimit will limit the number of documents that are considered. It is dangerous to use if you have other filters too. This is a very powerful optimisation that must be used with care. The set of documents to be considered will be limited upfront by only selecting the N best according to the range query and the hitLimit annotation, for further query evaluation. The hitLimit is not exact, but 'at least'. In addition the optimisation will only kick in if the attribute has fast-search. It will lookup the upper or lower bound in the range in the dictionary and scan in ascending or descending order and select entries until it has satisfied hitLimit. You will get all documents for all the dictionary entries selected. The weightedset field does not support filtering on weight. Solve this using the map type and sameElement query operator - see example. | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
boolean |
The boolean operator is: = where%20alive%20%3D%20true%3B | ||||||||||||||||||||
contains |
The right hand side argument of the contains operator is either a string literal, or a function, like phrase. contains is the basic building block for text matching. The kind of matching to be done depends on the field settings in the schema. where%20title%20contains%20%22madonna%22%3BThe matched field must be an indexed field or attribute.
Fields inside structs are referenced using dot notation -
e.g By default, the string will be tokenized to match the field(s) searched. Explicitly control tokenization by using annotations: where%20title%20contains%20%28%5B%7B%22stem%22%3A%20false%7D%5D%22madonna%22%29%3B Note the use of parentheses to control precedence.
|
||||||||||||||||||||
matches |
Regular expressions is supported using
posix extended syntax
with the limitation that it is case insensitive.
Replace where%20title%20matches%20%22madonna%22%3BThis example matches both madonna , madona and with any number of n s:
where%20title%20matches%20%22mado%5Bn%5D%2Ba%22%3BHere you match any string starting with mad :
where%20title%20matches%20%22^mad%22%3B
Note: Only attribute
fields in documents that have | ||||||||||||||||||||
userInput |
userInput() is a robust way of mixing user input and a formal query. It allows controlling whether the user input is to be stemmed, lowercased, etc, but it also allows for controlling whether it should be treated as a raw string, whether it should simply be segmented or parsed as a query. yql=select%20%2A%20from%20sources%20%2A%20where%20userInput%28%40animal%29%3B&animal=pandaHere, the userInput() function will access the query property "animal", and parse the property value as an "ALL" query, resulting in the following expression: select%20%2A%20from%20sources%20%2A%20where%20default%20contains%20%22panda%22%3BNow, if we changed the value of "animal" without changing the rest of the expression: yql=select%20%2A%20from%20sources%20%2A%20where%20userInput%28%40animal%29%3B&animal=panda%20smokeyThe result would be: select%20%2A%20from%20sources%20%2A%20where%20default%20contains%20%22panda%22%20and%20default%20contains%20%22smokey%22%3BNow, let's assume we want to combine multiple query properties and have a more complex expression as well: yql=select%20%2A%20from%20sources%20%2A%20where%20range%28year%2C%201963%2C%202014%29 %20and%20%28userInput%28%40animal%29%20or%20userInput%28%40teddy%29%29%3B&animal=panda&teddy=bear%20rooseveltThe resulting YQL expression will be: select%20%2A%20from%20sources%20%2A%20where%20range%28year%2C%201963%2C%202014%29 %20and%20%28default%20contains%20%22panda%22%20or%20%28default%20contains%20%22bear%22%20and%20default%20contains%20%22roosevelt%22%29%29%3BNow, consider we do not want the "teddy" field to be treated as its own query segment, it should only be segmented with the linguistic libraries to get recall. We can do this by adding a "grammar" annotation to the userInput() call: yql=select%20%2A%20from%20sources%20%2A%20where%20range%28year%2C%201963%2C%202014%29 %20and%20%28userInput%28%40animal%29%20or%20%5B%7B%22grammar%22%3A%20%22segment%22%7D%5DuserInput%28%40teddy%29%29%3B&animal=panda&teddy=bear%20rooseveltThen, the linguistic library will split on space, and the resulting expression is: select%20%2A%20from%20sources%20%2A%20where%20range%28year%2C%201963%2C%202014%29 %20and%20%28default%20contains%20%22panda%22%20or%20default%20contains%20phrase%28%22bear%22%2C%20%22roosevelt%22%29%29%3BInstead of a variable reference, the userInput() function also accepts raw strings as arguments, but this would obviously not be suited for parametrizing the query from a query profile. It is mostly intended as test feature. userInput() control annotations:
In addition, other annotations, like stem or ranked, will take effect as normal. The query parsing mechanism has currently certain limitations for propagating annotation, therefore, for any value of grammar other than raw or segment, only the following annotations will take effect:
| ||||||||||||||||||||
userQuery |
userQuery() reads from model.queryString and parses the query using simple query language. If set, model.filter is combined with model.queryString before the parsing. The user query is first parsed, then the resulting tree is inserted into the corresponding place in the YQL query tree. Example: query=abc def -ghi& type=all& yql=select * from sources * where vendor contains "brick and mortar" AND price < 50 AND userQuery(); query%3Dabc%20def%20-ghi%26%0Atype%3Dall%26%0Ayql%3Dselect%20%2A%20from%20sources%20%2A%20where%20vendor%20contains%20%22brick%20and%20mortar%22%20AND%20price%20%3C%2050%20AND%20userQuery%28%29%3BThis evaluates to a query where:
| ||||||||||||||||||||
rank |
The first, and only the first, argument of the rank() function determines whether a document is a match, but all arguments are used for calculating rank score. where%20rank%28a%20contains%20%22A%22%2C%20b%20contains%20%22B%22%29%3B | ||||||||||||||||||||
dotProduct |
dotProduct calculates the dot product between the weighted set in the query and a weighted set field in the document as its rank score contribution: where%20dotProduct%28description%2C%20%7B%22a%22%3A1%2C%20%22b%22%3A2%7D%29%3BThe result is stored as a raw score. A normal use case is a collection of weighted tokens produced by an algorithm, to match against a corpus containing weighted tokens produced by another algorithm in order to implement personalized content exploration. Refer to multivalue query operators for a discussion of usage and examples.
| ||||||||||||||||||||
weightedSet |
When using weightedSet to search a field, all tokens present in the searched field will be matched against the weighted set in the query. This means that using a weighted set to search a single-value attribute field will have similar semantics to using a normal term to search a weighted set field. The low-level matching information resulting from matching a document with a weighted set in the query will contain the weights of all the matched tokens in descending order. Each matched weight will be represented as a standard occurrence on position 0 in element 0. where%20weightedSet%28description%2C%20%7B%22a%22%3A1%2C%20%22b%22%3A2%7D%29%3BweightedSet has similar semantics to equiv, as it acts as a single term in the query. However, the restriction dictating that it contains a collection of weighted tokens directly enables specific back-end optimizations that improves performance for large sets of tokens compared to using the generic equiv or or operators. Refer to multivalue query operators for a discussion of usage and examples.
| ||||||||||||||||||||
wand |
wand can be used to search for documents where weighted tokens in a field matches a subset of weighted tokens in the query. At the same time, it internally calculates the dot product between token weights in the query and the field. wand is guaranteed to return the top-k hits according to its internal dot product rank score. It is an operator that scales adaptively from or to and. wand optimizes the performance of using multiple threads per search in the backend, and is also called Parallel Wand. wand also allows numeric arguments, then the search argument is an array of arrays of length two. In each pair, the first number is the search term, the second its weight: where%20wand%28description%2C%20%5B%5B11%2C1%5D%2C%20%5B37%2C2%5D%5D%29%3BBoth wand and weakAnd support the annotations scoreThreshold, which is a double for wand and an integer for weakAnd. This threshold specifies the minimum rank score for hits to include. The targetHits annotation sets the wanted number of hits exposed to the real first-phase ranking fuction per content node. [Note: this parameter was previously named targetNumHits - the old variant still works for backwards compatibility until Vespa 8.] The wand/weakAnd operator will both expose candidates that were evaluated to the first-phase and not only the top-k. By default, targetHits is 100. Note that total hit count becomes inaccurate when using wand/weakAnd. If additional second phase ranking with rerank-count is used, do not set targetHits less than the configured rank-profile's rerank-count. where%20%5B%20%7B%22scoreThreshold%22%3A%200.13%2C%20%22targetHits%22%3A%207%7D%20%5Dwand%28description%2C%20%7B%22a%22%3A1%2C%20%22b%22%3A2%7D%29%3BRefer to using wand for a usage and examples.
| ||||||||||||||||||||
weakAnd |
weakAnd is some times called Vespa Wand. Unlike wand, it accepts arbitrary word matches (across arbitrary fields) as arguments. Only a limited number of documents are returned for ranking (default is 100), but it does not guarantee to return the best k hits. This function can be seen as an optimized or: where%20weakAnd%28a%20contains%20%22A%22%2C%20b%20contains%20%22B%22%29%3BBoth wand and weakAnd support the annotations scoreThreshold, which is a double for wand and an integer for weakAnd. This threshold specifies the minimum rank score for hits to include. The targetHits annotation sets the wanted number of hits exposed to the real first-phase ranking fuction per content node. [Note: this parameter was previously named targetNumHits - the old variant still works for backwards compatibility until Vespa 8.] The wand/weakAnd operator will both expose candidates that were evaluated to the first-phase and not only the top-k. By default, targetHits is 100. Note that total hit count becomes inaccurate when using wand/weakAnd. where%20%5B%7B%22scoreThreshold%22%3A%200%2C%20%22targetHits%22%3A%207%7D%5DweakAnd%28a%20contains%20%22A%22%2C%20b%20contains%20%22B%22%29%3BUnlike wand, weakAnd can be used to search across several fields of various types, but it does NOT guarantee to return the top-k best number of hits. It can however be combined with any ranking expression. Keep in mind that this expression should correlate with its simple internal ranking score that uses query term weight and inverse document frequency for matching terms. Refer to using wand for a usage and examples.
| ||||||||||||||||||||
geoLocation |
geoLocation matches a position inside a geographical circle, specified as latitude, longitude, and a maximum distance (radius). Example: where%20geoLocation%28myfieldname%2C%2063.5%2C%2010.5%2C%20%22200%20km%22%29%3B In this example we search for documents near 63.5° north, 10.5° east, and within a 200 km radius. So a document with a "myfieldname" position in Trondheim, Norway at N63°25'47;E10°23'36 would match. The first parameter is the name of the attribute field. The second parameter is the longitude (positive for north, negative for south). The third parameter is the latitude (positive for east, negative for west). The fourth parameter must be a string specifying the radius and its units, where the supported units include "km", "m" (for meters), "miles", and "deg" for degrees (so "deg" gives radius the same units as latitude). Any negative number for radius (e.g. "-1 m") is interpreted as an "infinite" radius, letting any geographical position at all match the geoLocation operator. The position attribute in the schema could look like: field myfieldname type position { indexing: attribute | summary }Arrays of positions are also possible: field myfieldname type array Only the "label" annotation is currently supported for geoLocation.
| ||||||||||||||||||||
nearestNeighbor |
nearestNeighbor matches the top-k nearest neighbors in a multi-dimensional vector space. Points in the vector space are specified as tensors with one indexed dimension, where the size of that dimension is equal to the dimensionality of the vector space. The document positions are stored in a tensor attribute, and the query position is sent with the query request. Euclidean distance is used as the default distance metric and the exact nearest neighbors are returned. If a HNSW index is specified on the tensor, the approximate nearest neighbors are returned instead. Example: where%20%5B%7B%22targetHits%22%3A%2010%7D%5D%20nearestNeighbor%28doc_vector%2C%20query_vector%29%3B&ranking.features.query%28query_vector%29=%5B3%2C5%2C7%5D In this example we search for the top 10 nearest neighbors in a 3-dimensional vector space. targetHits specifies the wanted top-k nearest neighbors to find. This parameter is required. The first parameter of nearestNeighbor is the name of the tensor attribute containing the document positions (doc_vector). The second parameter is the name of the tensor sent with the query request (query_vector). Specifying query_vector as the name means the query request must set this tensor as ranking.features.query(query_vector). The document tensor attribute is defined as follows: field doc_vector type tensor<float>(x[3]) { indexing: attribute | summary } The last part of the YQL example specifies the query tensor, see defining query feature types This must have the same type as the document tensor. See Nearest Neighbor Search and Approximate Nearest Neighbor Search using HNSW Index for more detailed examples. These annotations are supported:
| ||||||||||||||||||||
nonEmpty |
nonEmpty takes as its only argument an arbitrary search expression. It will then perform a set of checks on that expression. If all the checks pass, the result is the same expression, otherwise the query will fail. The checks are as follows:
yql=select%20%2A%20from%20sources%20%2A%20where%20bar%20contains%20%22a%22%20and%20nonEmpty%28bar%20contains%20%22bar%22%20and%20foo%20contains%20%40foo%29&foo=Note how "foo" is empty in this case, which will force the query to fail. If "foo" contained a searchable term, the query would not have failed. | ||||||||||||||||||||
predicate |
predicate() specifies a predicate query - see predicate fields. It takes three arguments: the predicate field to search, a map of attributes, and a map of range attributes: where%20predicate(predicate_field%2C%7B%22gender%22%3A%22Female%22%7D%2C%7B%22age%22%3A20L%7D)%3BDue to a quirk in YQL-parsing, one cannot specify an empty map, use the number 0 instead. where%20predicate(predicate_field%2C0%2C%7B%22age%22%3A20L%7D)%3B |
order by
Sort using order by
.
Add asc
or desc
after the name of an
attribute to set sort order -
ascending order is default.
where%20title%20contains%20%22madonna%22%20order%20by%20price%20asc%2C%20releasedate%20desc%3BSorting function, locale and strength are defined using the annotations "function", "locale" and "strength", as in:
where%20title%20contains%20%22madonna%22%20order%20by%20%5B%7B%22function%22%3A%20%22uca%22%2C%20%22locale%22%3A%20%22en_US%22%2C%20%22strength%22%3A%20%22IDENTICAL%22%7D%5Dother%20desc%2C %20%5B%7B%22function%22%3A%20%22lowercase%22%7D%5Dsomething%3BNote: match-phase is enabled when sorting - refer to the sorting reference.
limit / offset
To specify a slice / limit the number of hits returned / do pagination,
use limit
and/or offset
:
where%20title%20contains%20%22madonna%22%20limit%2031%20offset%2029%3BThe above will return two hits (if there sufficiently many hits matching the query), skipping the 29 first documents.
timeout
Set query timeout in milliseconds using timeout
:
where%20title%20contains%20%22madonna%22%20timeout%2070%3BOnly literal numbers are valid, i.e. setting another unit is not supported.
Annotations
Terms and phrases can be annotated to manipulate the behavior.
Add an annotation using []
, like:
where%20text%20contains%20%28%5B%20%7B%22distance%22%3A%205%7D%20%5Dnear%28%22a%22%2C%20%22b%22%29%29%3B
Annotations supported by strings
These annotations are supported by the string arguments to functions like and phrase() and near() and also the string argument to the "contains" operator.
"nfkc": true|false | NFKC normalization. Default on. |
"implicitTransforms": true|false | Implicit term transformations (field defaults), default on. If implicitTransforms is active, the settings for the field in the schema will be honored in term transforms, e.g. if the field has stemming, this term will be stemmed. If implicitTransforms are turned off, the search backend will receive the term exactly as written in the initial YQL expression. This is in other words a top level switch to turn off all other stemming, accent removal, Unicode normalizations and so on. |
"annotations": { "string": "string" } |
Custom term annotations. This is by default empty. |
"origin": { "original": "string", "offset": int, "length": int } |
The (sub-)string which produced this term. Default unset. |
"usePositionData": true|false | Use position data for ranking algorithm. Default true. This is term position, not to be confused with geo searches |
"stem": true|false | Stem this term if it is the setting for this field, default on. |
"normalizeCase": true|false | Normalize casing of this term if it is the setting for this field, default on. |
"accentDrop": true|false | Remove accents from this term if it is the setting for this field, default on. |
"andSegmenting": true|false | Force phrase or AND operator if re-segmenting (e.g. in stemming) this term results in multiple terms. Default is choosing from language settings. |
"prefix": true|false | Do prefix matching for this word. Default false. ("Search for "word*".") |
"suffix": true|false | Do suffix matching for this word. Default false. ("Search for "*word".") |
"substring": true|false | Do substring matching for this word if available in the index. Default false. ("Search for "*word*".") Only supported for streaming search. |
Annotations supported by strings and functions
These annotations are supported by strings and by the functions which are handled like leaf nodes internally in the query tree: phrase(), near(), onear(), range(), equiv(), dotProduct(), weightedSet(), weakAnd(), wand() and nearestNeighbor().
"id": int | Unique ID used for e.g. connectivity. |
"connectivity": { "id": int, "weight": double } |
Map with the ID and weight of explicitly connectivity of this item. |
"significance": double | Significance value for ranking. |
"annotations": { "string": "string" } |
Custom annotations. No special semantics inside the YQL layer. |
"filter": true|false | Regard this term as a "filter" term. Default false. |
"ranked": true|false | Include this term for ranking calculation. Default true. Example |
"label": "string" | Label for referring to this term during ranking. |
"weight": int | Term weight (default 100), used in some ranking calculations.
where%20title%20contains%20(%5B%7B"weight"%3A200%7D%5D"heads")%20and%20album%20contains%20"tails"%3B |
Annotations of sub-expressions
Consider the following query:
select%20%2A%20from%20sources%20%2A%20where%20%28%5B%7B%22stem%22%3A%20false%7D%5D%28foo%20contains%20%22a%22%20and%20bar%20contains%20%22b%22%29%29 %20or%20foo%20contains%20%28%5B%7B%22stem%22%3A%20false%7D%5D%22c%22%29%3BThe "stem" annotation controls whether a given term should be stemmed if its field is configured as a stemmed field (default is "true"). The "AND" operator itself has no internal API for whether its operands should be stemmed or not, but we can still annotate as such, because when the value of a given annotation is determined, the expression tree is followed from the term in question and up through its ancestors. Traversing the tree stops when a value is found (or there is nothing more to traverse). In other words, none of the terms in this example will be stemmed.
How annotations behave may be easier to understand of expressing a boolean query in the style of an S-expression:
(AND term1 term2 (OR term3 term4) (OR term5 (AND term6 term7)))The annotation scopes would then be as follows, i.e. annotations on which elements will be checked when determining the settings for a given term:
term1 | term1 itself, and the first AND |
term2 | term2 itself, and the first AND |
term3 | term3 itself, the first OR and the first AND |
term4 | term4 itself, the first OR and the first AND |
term5 | term5 itself, the second OR and the first AND |
term6 | term6 itself, the second AND, the second OR and the first AND |
term7 | term7 itself, the second AND, the second OR and the first AND |
Query properties
Use YQL variable syntax to initialize words in phrases and as single terms. This removes the need for caring about quoting a term in YQL, as well as URL quoting. The term will be used exactly as it is in the URL. As an example, look at a query with a YQL argument, and the properties animal and syntaxExample:
yql=select%20%2A%20from%20sources%20%2A%20where%20foo%20contains%20%40animal %20and%20foo%20contains%20phrase%28%40animal%2C%20%40syntaxExample%2C%20%40animal%29%3B&animal=panda&syntaxExample=syntacticThis YQL expression will then access the query properties animal and syntaxExample and evaluate to:
select%20%2A%20from%20sources%20%2A%20where%20%28foo%20contains%20%22panda%22%20AND%20foo%20contains%20phrase%28%22panda%22%2C%20%22syntactic%22%2C%20%22panda%22%29%29%3B
YQL in query profiles
YQL requires quoting to be included in a URL. Since YQL is well suited to application logic, while not being intended for end users, a solution to this is storing the application's YQL queries into different query profiles. To add a default query profile, add search/query-profiles/default.xml to the application package:
<query-profile id="default"> <field name="yql">select * from sources * where default contains "latest" or userQuery();</field> </query-profile>This will add latest as an OR term to all queries not having an explicit query profile parameter. The important thing to note is how it is not necessary to URL-quote anything in the query profiles files. They operate independently of the HTTP parsing as such.
Query rewriting in Searchers
Searchers which modifies the textual YQL statement (not recommended) should be annotated with @Before("ExternalYql"). Searchers modifying query tree produced from an input YQL statement should annotate with @After("ExternalYql").
Grouping
Group / aggregate results by adding a grouping expression after a |
-
read more.
select%20*%20from%20sources%20*%20where%20sddocname%20contains%20%27purchase%27%20%7C%20all(group(customer)%20each(output(sum(price))))%3B