Vespa accepts unstructured human input and structured queries for application logic separately, then combines them into a single data structure for executing. Human input is parsed heuristically, while application queries are formulated in YQL.
A query URL looks like:
http://myhost.mydomain.com:8080/search/?yql=select%20%2A%20from%20sources%20%2A%20where%20text%20contains%20%22blues%22
In other words, yql
contains:
select * from sources * where text contains "blues"
This matches all documents where the field named text contains the word blues.
Quote "
and backslash \
characters in text values must be escaped by a backslash,
also see how does backslash escapes work.
null
or NaN
.
Work around using a "magic" value (like MAXINT) that is not normally used in the documents.
Since Vespa 7.520.3, YQL queries do not require a semicolon at the end.
select is the list of summary fields requested
(a field with the summary
index attribute).
Vespa will hide other fields in the matching documents.
select price,isbn from sources * where title contains "madonna"
The above explicitly requests the fields "price" and "isbn" (from all sources). To request all fields, use an asterisk as field selection:
select * from sources * where title contains "madonna"
from sources specifies which content sources to query. Example:
select * from music where title contains "madonna"
queries all document types in the music content cluster or federation source. Query in:
all sources | select … from sources * where … |
a set of sources | select … from sources source1, source2 where … |
a single source | select … from source1 where … |
In other words, sources is used for querying some/all sources. If only a single source is queried, the sources keyword is dropped. To restrict the query to only one schema (aka document type) use the model.restrict URL parameter. Also see federation.
The where
clause is a tree of operators:
numeric |
The following numeric operators are available:
where 500 >= price where range(fieldname, 0, 5000000000L)
Numbers must be in the signed 32-bit range.
Input 64-bit signed numbers using
For the where (range(year, 2000, Infinity))
The weightedset field does not support filtering on weight. Solve this using the map type and sameElement query operator - see example. | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
boolean |
The boolean operator is: where alive = true | ||||||||||||||||||||||
contains |
The right-hand side argument of the contains operator is either a string literal,
or a function, like
where title contains "madonna"
The matched field must be an indexed field or attribute.
Fields inside structs are referenced using dot notation -
e.g |
||||||||||||||||||||||
and |
where title contains "madonna" and title contains "saint" |
||||||||||||||||||||||
or |
where title contains "madonna" or title contains "saint" |
||||||||||||||||||||||
not |
Use the where title contains "madonna" and !(title contains "saint") |
||||||||||||||||||||||
phrase |
Phrases are expressed as a function: where text contains phrase("st", "louis", "blues") |
||||||||||||||||||||||
near |
|
||||||||||||||||||||||
onear |
|
||||||||||||||||||||||
sameElement |
sameElement() is an operator that requires the terms to match within the same struct element in an array or a map field. Example: struct person { field first_name type string {} field last_name type string {} field year_of_birth type int {} } field persons type array<person> { indexing: summary struct-field first_name { indexing: attribute } struct-field last_name { indexing: attribute } struct-field year_of_birth { indexing: attribute } } field identities type map<string, person> { indexing: summary struct-field key { indexing: attribute } struct-field value.first_name { indexing: attribute } struct-field value.last_name { indexing: attribute } struct-field value.year_of_birth { indexing: attribute } }
With normal AND the query where persons contains sameElement(first_name contains 'Joe', last_name contains 'Smith', year_of_birth < 1940) The above returns all documents containing Joe Smith born before 1940 in the persons array. Searching in a map is similar to searching in an array of struct. The difference is that you have an extra synthetic struct with the field members key and value. The above example with the identities map looks like this: where identities contains sameElement(key contains 'father', value.first_name contains 'Joe', value.last_name contains 'Smith', value.year_of_birth < 1940) The above returns all documents that have tagged Joe Smith born before 1940 as a 'father'. The importance here is using the indirection of key and value to address the keys and the values of the map. |
||||||||||||||||||||||
equiv |
If two terms in the same field should give exactly the same behavior when matched,
the where fieldName contains equiv("A","B") In many cases, the OR operator will give the same results as an EQUIV. The matching logic is exactly the same, and an OR does not have the limitations that EQUIV does (below). The difference is in how matches are visible to ranking functions. All words that are children of an OR count for ranking. When using an EQUIV however, it looks like a single word:
Limitations on how
Learn how to use equiv. |
||||||||||||||||||||||
uri |
Used to search for urls indexed using the uri field type. where myUrlField contains uri("vespa.ai/foo") Various subfields are supported to search components of the URL, see the field type definition.
|
||||||||||||||||||||||
fuzzy |
Levenshtein edit distance search within a string attribute. where myStringAttribute contains ({prefixLength:1, maxEditDistance:2}fuzzy("parantesis")) Annotations below are configuring
Important:
Only string attribute
fields in documents are supported (single, array or weightedset).
It is also not optimized, but setting prefixLength greater than 0
would narrow the match for the fast-search,
so it won't run a full scan.
|
||||||||||||||||||||||
matches |
Regular expressions is supported using
posix extended syntax
with the limitation that it is case insensitive.
Replace This example becomes a substring search: where attribute_field matches "madonna" This example matches both where attribute_field matches "mado[n]+a" Here you match any string starting with where attribute_field matches "^mad" |
||||||||||||||||||||||
userInput |
userInput() is a robust way of mixing user input and a formal query. It allows controlling whether the user input is to be stemmed, lowercased, etc., but it also allows for controlling whether it should be treated as a raw string, whether it should simply be segmented or parsed as a query. yql=select * from sources * where userInput(@animal)&animal=panda Here, the userInput() function will access the query property "animal", and parse the property value as a weakAnd query, resulting in the following expression: select * from sources * where weakAnd(default contains "panda") Find a full example in the query API guide. Instead of parameter substitution, the userInput() function also accepts raw strings as arguments, but this would obviously not be suited for parametrizing the query from a query profile. It is mostly intended as test feature.
In addition, other annotations, like stem or ranked, will take effect as normal. |
||||||||||||||||||||||
userQuery |
userQuery() reads from model.queryString and parses the query using simple query language. If set, model.filter is combined with model.queryString before the parsing. The user query is first parsed, then the resulting tree is inserted into the corresponding place in the YQL query tree. Example: query=abc def -ghi type=all yql=select * from sources * where vendor contains "brick and mortar" AND price < 50 AND userQuery() This evaluates to a query where:
|
||||||||||||||||||||||
rank |
The first, and only the first, argument of the rank() function determines whether a document is a match, but all arguments are used for calculating rank score. where rank(a contains "A", b contains "B") |
||||||||||||||||||||||
dotProduct |
dotProduct calculates the dot product between the weighted set in the query and a weighted set field in the document as its rank score contribution: where dotProduct(description, {"a":1, "b":2}) The result is stored as a raw score. A normal use case is a collection of weighted tokens produced by an algorithm, to match against a corpus containing weighted tokens produced by another algorithm in order to implement personalized content exploration. See example usage of dotProduct in practical performance guide . Refer to multivalue query operators for a discussion of usage and examples.
Keys must be single or double-quoted if passed inline in YQL -
alternatively, use parameter substitution
to submit the weighted set with a simple format for faster query parsing -
example:
|
||||||||||||||||||||||
weightedSet |
When using weightedSet to search a field, all tokens present in the searched field will be matched against the weighted set in the query. This means that using a weighted set to search a single-value attribute field will have similar semantics to using a normal term to search a weighted set field. The low-level matching information resulting from matching a document with a weighted set in the query will contain the weights of all the matched tokens in descending order. Each matched weight will be represented as a standard occurrence on position 0 in element 0. where weightedSet(description, {"a":1, "b":2}) weightedSet has similar semantics to equiv, as it acts as a single term in the query. However, the restriction dictating that it contains a collection of weighted tokens directly enables specific back-end optimizations that improves performance for large sets of tokens compared to using the generic equiv or or operators. Refer to multivalue query operators for a discussion of usage and examples. Also see multi-lookup set filtering.
Keys must be single or double-quoted if passed inline in YQL -
alternatively, use parameter substitution
to submit the weighted set with a simple format for faster query parsing -
example:
|
||||||||||||||||||||||
wand |
Note that total hit count becomes inaccurate when using wand.
where wand(description, [[11,1], [37,2]])
Keys must be single or double-quoted if passed inline in YQL -
alternatively, use parameter substitution
to submit the weighted set with a simple format for faster query parsing -
example:
where ({scoreThreshold: 0.13, targetHits: 7}wand(description, {"a":1, "b":2})) Refer to using wand for introduction to the WAND algorithm and example usage of wand in practical performance guide .
|
||||||||||||||||||||||
weakAnd |
where weakAnd(a contains "A", b contains "B")
where ({scoreThreshold: 0, targetHits: 7}weakAnd(a contains "A", b contains "B"))
Unlike wand, Refer to using wand for a usage and examples.
|
||||||||||||||||||||||
geoLocation |
where geoLocation(myfieldname, 63.5, 10.5, "200 km") In this example we search for documents near 63.5° north, 10.5° east, and within a 200 km radius. So a document with a "myfieldname" position in Trondheim, Norway at N63°25'47;E10°23'36 would match. The first parameter is the name of the attribute field. The second parameter is the longitude (positive for north, negative for south). The third parameter is the latitude (positive for east, negative for west). The fourth parameter must be a string specifying the radius and its units, where the supported units include "km", "m" (for meters), "miles", and "deg" for degrees (so "deg" gives radius the same units as latitude). Any negative number for radius (e.g. "-1 m") is interpreted as an "infinite" radius, letting any geographical position at all match the geoLocation operator. The position attribute in the schema could look like: field myfieldname type position { indexing: attribute | summary } Arrays of positions are also possible: field myfieldname type array<position> { indexing: attribute }
Properties:
|
||||||||||||||||||||||
nearestNeighbor |
Euclidean distance is used as the default distance metric and the exact nearest neighbors are returned. When storing multiple vectors per document, the vector that is closest to the query vector is used when calculating the distance between the document and the query. If a HNSW index is specified on the tensor field, the approximate nearest neighbors are returned instead. Example: where ({targetHits: 10}nearestNeighbor(doc_vector, query_vector))&input.query(query_vector)=[3,5,7] In this example we search for the top 10 nearest neighbors in a 3-dimensional vector space. targetHits specifies the wanted top-k nearest neighbors to find. This parameter is required. The first parameter of nearestNeighbor is the name of the tensor field attribute containing the document vectors (doc_vector). The second parameter is the name of the tensor sent with the query request (query_vector). Specifying query_vector as the name means the query request must set this tensor as input.query(query_vector). The document tensor field attribute is defined as follows: field doc_vector type tensor<float>(x[3]) { indexing: attribute | summary } The last part of the YQL example specifies the query tensor, see defining query feature types This must have the same type as the document tensor. See Nearest Neighbor Search, Approximate Nearest Neighbor Search using HNSW Index and Nearest Neighbor Search Guide for more detailed examples.
Properties:
|
||||||||||||||||||||||
nonEmpty |
nonEmpty takes as its only argument an arbitrary search expression. It will then perform a set of checks on that expression. If all the checks pass, the result is the same expression, otherwise the query will fail. The checks are as follows:
yql=select * from sources * where bar contains "a" and nonEmpty(bar contains "bar" and foo contains @foo)&foo= Note how "foo" is empty in this case, which will force the query to fail. If "foo" contained a searchable term, the query would not have failed. |
||||||||||||||||||||||
predicate |
predicate() specifies a predicate query - see predicate fields. It takes three arguments: the predicate field to search, a map of attributes, and a map of range attributes: where predicate(predicate_field,{"gender":"Female"},{"age":20L}) Due to a quirk in YQL-parsing, one cannot specify an empty map, use the number 0 instead. where predicate(predicate_field,0,{"age":20L}) |
||||||||||||||||||||||
true |
Matches all documents of any type. Care must be taken when using this since processing all documents as matches is expensive. At minimum, consider restricting to only one schema where you know the corpus isn't too big, see the model.restrict URL parameter. |
||||||||||||||||||||||
false |
Does not match any document at all. Not useful in itself, but could potentially be used as a placeholder in the query tree. |
Sort using order by
.
Add asc
or desc
after the name of an
attribute to set sort order -
ascending order is default. Add another sorting attributes to get a secondary sort, that will be a tiebreaker for the
primary ordering attribute. This is typically used to get a predictable ordering when the primary ordering attribute
has the same value for multiple documents.
where title contains "madonna" order by price asc, releasedate desc
Sorting function, locale and strength are defined using the annotations "function", "locale" and "strength", as in:
where title contains "madonna" order by {function: "uca", locale: "en_US", strength: "IDENTICAL"}other desc, {function: "lowercase"}something
The rank profile determines the rank score each document will get.
Results are ordered by that value by default, but order by
overrides that ordering.
Vespa does not optimize away the rank score computation in this case, it is still executed, even if the model score is thrown away. Use the built-in rank-profile unranked for optimal performance of sorting queries.
To do a primary ordering on the rank score, and a secondary sort on an attribute, use '[relevance]'
as the first order by attribute.
See Special sorting attributes for more details.
Annotation | Effect |
---|---|
function | Sort function, default UCA. |
locale | Locale identifier for the UCA sort function. |
strength | Strength setting for the UCA sort function. |
To specify a slice / limit the number of hits returned / do pagination,
use limit
and/or offset
. This can also be controlled by using
native execution parameters.
Example: This returns two hits (if there are sufficiently many hits matching the query), skipping the 29 first documents
where title contains "madonna" limit 31 offset 29
Set query timeout in milliseconds using timeout
. This can also be controlled by using the native
execution parameter timeout. YQL specified values takes precedence.:
where title contains "madonna" timeout 70
Only literal numbers are valid, i.e. setting another unit is not supported.
The query operators dotProduct(field, value),
weightedSet(field, value) and wand(field, value)
support parameter substitution for the value
parameter - example of equivalent queries:
... where weightedSet(field, {"a":1, "b":2}) ... where weightedSet(field, @myset)&myset={a:1,b:2}
Use this to:
,
or :
.The value string can be passed in one of:
[[key, value], ...]
{key: value, ...}
The query operator userInput(value)
supports parameter substitution for the value
parameter:
... where userInput(@userinput)&userinput=free+text
Use this to submit the user data unchanged for parsing in Vespa, without risk of corrupting the YQL query.
Terms and phrases can be annotated to manipulate the behavior.
Add an annotation using {}
:
where text contains ({distance: 5}near("a", "b")) and text contains ({distance:2}near("c", "d"))
Note that the annotation is enclosed by parentheses to scope the annotation to the operator.
All annotations are supported by the string arguments to functions like and phrase() and near() and also the string argument to the "contains" operator. Some annotations are also supported by the functions which are handled like leaf nodes internally in the query tree: phrase(), near(), onear(), range(), equiv(), dotProduct(), weightedSet(), weakAnd(), wand() and nearestNeighbor().
Refer to SelectTestCase.java for sample usage.
Annotation | Default | Values | Description | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
accentDrop | true | boolean | Remove accents from this term if it is the setting for this field. Refer to linguistics. |
||||||||
allowEmpty | false | boolean |
Whether to allow empty input for query parsing and query terms in userInput.
If |
||||||||
andSegmenting | true|false |
Force phrase or AND operator if re-segmenting (e.g. in stemming) this term results in multiple terms. Default is choosing from language settings. |
|||||||||
annotations | map |
Map of annotations : {cox: "another"} |
|||||||||
approximate | boolean |
Used in nearestNeighbor.
The optional approximate annotation may be set to |
|||||||||
ascending | boolean |
Ascending hit order. Used by hitLimit. |
|||||||||
bounds | closed |
enum |
A numeric interval is by default a closed interval.
If the lower bound is exclusive, set to where ({bounds:"rightOpen"}range(year, 2000, 2018)) |
||||||||
connectivity | map |
Map of connectivity: {id: 4, weight: 0.8} |
|||||||||
descending | boolean |
Descending hit order. Used by hitLimit. |
|||||||||
defaultIndex | default |
Any searchable field in the schema. |
Used by userInput.
Same as model.defaultIndex in the query API.
If grammar is set to |
||||||||
distance | 2 | int |
The distance-annotation sets the maximum position difference to count as a match, see near / onear. The default distance is 2, meaning match if the words have up to one separating word. where text contains ({distance: 5}near("a", "b")) |
||||||||
distanceThreshold | 2 | int |
Used in nearestNeighbor.
The |
||||||||
endAnchor | true | boolean |
The where myUrlField.hostname contains uri("vespa.ai") will match vespa.ai and docs.vespa.ai, while where myUrlField.hostname contains ({startAnchor: true}uri("vespa.ai")) will only match vespa.ai. |
||||||||
filter | false | boolean | Regard this term as a "filter" term and not a term from the end user. Terms that are annotated with "filter:true" are not bolded. See also model.filter. Bolding of terms is controlled by schema:bolding. |
||||||||
function |
Default sort function for strings is Numeric fields are numerically sorted.
|
||||||||||
grammar | weakAnd |
raw , segment and all values accepted for the
model.type argument in the query API. |
How to parse userInput.
|
||||||||
hitLimit | int |
Numeric operations support
Note that
See the practical-search-performance-guide for an example. |
|||||||||
hnsw.exploreAdditionalHits |
Used in nearestNeighbor.
When using an HNSW index,
the optional |
||||||||||
id | int | Unique ID used for e.g. connectivity. |
|||||||||
implicitTransforms | true | boolean |
Implicit term transformations (field defaults).
If |
||||||||
label | string |
Used by geoLocation and nearestNeighbor. Label for referring to this term during ranking. |
|||||||||
language | RFC 3066 language code |
Language setting for the linguistics handling of userInput, also see model.language in the query API reference. |
|||||||||
locale |
Used by the UCA sort function.
An identifier following
unicode locale identifiers, e.g. |
||||||||||
maxEditDistance | 2 | int |
Used in fuzzy. An inclusive upper bound of edit distance between query and string attribute. |
||||||||
nfkc | true | boolean | NFKC normalization. |
||||||||
normalizeCase | true | boolean | Normalize casing of this term if it is the setting for this field. |
||||||||
origin | map |
Map of origin: {original: "abc", offset: 1, length: 2} |
|||||||||
prefix | false | boolean | Do prefix matching for this term, e.g. search for "word*". |
||||||||
prefixLength | 0 | int | Used in fuzzy. Number of characters that are considered frozen, so the fuzzy match will be performed with the suffix left. |
||||||||
ranked | true | boolean |
Include this term for ranking calculation. Setting ranked to false can speed up query evaluation. Read more about schema reference. Example |
||||||||
scoreThreshold | double / integer |
Both wand and weakAnd supports |
|||||||||
significance | double |
Significance value for text ranking features - see text matching and ranking. |
|||||||||
startAnchor | false | boolean | See endAnchor. |
||||||||
stem | true | boolean | Stem this term if it is the setting for this field. |
||||||||
strength | PRIMARY |
|
Used by the UCA sort function.
Default is |
||||||||
suffix | false | boolean | Do suffix matching for this term, e.g. search for "*word". |
||||||||
targetHits | 100 | int |
Used by wand, weakAnd
and nearestNeighbor.
It sets the wanted number of hits exposed to the real first-phase ranking function per content node.
If additional second phase ranking with rerank-count is used,
do not set |
||||||||
usePositionData | true | boolean |
Use term position data for text ranking features such as nativeRank. This is term position, not to be confused with geo searches. Setting "usePositionData:false" can improve query performance. |
||||||||
weight | 100 | int |
Term weight, used in some text ranking features - see text matching and ranking. where title contains ({weight:200}"heads") |
Consider the following query:
select * from sources * where ({stem: false}(foo contains "a" and bar contains "b")) or foo contains {stem: false}"c"
The "stem" annotation controls whether a given term should be stemmed if its field is configured as a stemmed field (default is "true"). The "AND" operator itself has no internal API for whether its operands should be stemmed or not, but we can still annotate as such, because when the value of a given annotation is determined, the expression tree is followed from the term in question and up through its ancestors. Traversing the tree stops when a value is found (or there is nothing more to traverse). In other words, none of the terms in this example will be stemmed.
How annotations behave may be easier to understand of expressing a boolean query in the style of an S-expression:
(AND term1 term2 (OR term3 term4) (OR term5 (AND term6 term7)))
The annotation scopes would then be as follows, i.e. annotations on which elements will be checked when determining the settings for a given term:
term1 | term1 itself, and the first AND |
term2 | term2 itself, and the first AND |
term3 | term3 itself, the first OR and the first AND |
term4 | term4 itself, the first OR and the first AND |
term5 | term5 itself, the second OR and the first AND |
term6 | term6 itself, the second AND, the second OR and the first AND |
term7 | term7 itself, the second AND, the second OR and the first AND |
Use YQL variable syntax to initialize words in phrases and as single terms. This removes the need for caring about quoting a term in YQL, as well as URL quoting. The term will be used exactly as it is in the URL. As an example, look at a query with a YQL argument, and the properties animal and syntaxExample:
yql=select * from sources * where foo contains @animal and foo contains phrase(@animal, @syntaxExample, @animal)&animal=panda&syntaxExample=syntactic
This YQL expression will then access the query properties animal and syntaxExample and evaluate to:
select * from sources * where (foo contains "panda" AND foo contains phrase("panda", "syntactic", "panda"))
YQL requires quoting to be included in a URL. Since YQL is well suited to application logic, while not being intended for end users, a solution to this is storing the application's YQL queries into different query profiles. To add a default query profile, add search/query-profiles/default.xml to the application package:
<query-profile id="default"> <field name="yql">select * from sources * where default contains "latest" or userQuery()</field> </query-profile>
This will add latest as an OR term to all queries not having an explicit query profile parameter. The important thing to note is how it is not necessary to URL-quote anything in the query profiles files. They operate independently of the HTTP parsing as such.
Searchers which modifies the textual YQL statement (not recommended)
should be annotated with @Before("ExternalYql")
.
Searchers modifying query tree produced from an input YQL statement
should annotate with @After("ExternalYql")
.
Group / aggregate results by adding a grouping expression after a |
-
read more.
select * from sources * where sddocname contains 'purchase' | all(group(customer) each(output(sum(price))))