Vespa accepts unstructured human input and structured queries for application logic separately, then combines them into a single data structure for executing. Human input is parsed heuristically, while application queries are formulated in YQL.
A query URL looks like:
http://myhost.mydomain.com:8080/search/?yql=select%20%2A%20from%20sources%20%2A%20where%20text%20contains%20%22blues%22
In other words, yql
contains:
select * from sources * where text contains "blues"
This matches all documents where the field named text contains the word blues.
Quote "
and backslash \
characters in text values must be escaped by a backslash,
also see how does backslash escapes work.
null
or NaN
.
Work around using a "magic" value (like MAXINT) that is not normally used in the documents.
Since Vespa 7.520.3 , YQL queries do not require a semicolon at the end.
select is the list of summary fields requested
(a field with the summary
index attribute).
Vespa will hide other fields in the matching documents.
select price,isbn from sources * where title contains "madonna"
The above explicitly requests the fields "price" and "isbn" (from all sources). To request all fields, use an asterisk as field selection:
select * from sources * where title contains "madonna"
from sources specifies which content sources to query. Example:
select * from music where title contains "madonna"
queries all document types in the music content cluster or federation source. Query in:
all sources | select … from sources * where … |
a set of sources | select … from sources source1, source2 where … |
a single source | select … from source1 where … |
In other words, sources is used for querying some/all sources. If only a single source is queried, the sources keyword is dropped. To restrict the query to only one schema (aka document type) use the model.restrict URL parameter. Also see federation.
The where
clause is a tree of operators:
numeric |
The following numeric operators are available:
where 500 >= price where range(fieldname, 0, 5000000000L)
Numbers must be in the signed 32-bit range.
Input 64-bit signed numbers using
For the where (range(year, 2000, Infinity))
The weightedset field does not support filtering on weight. Solve this using the map type and sameElement query operator - see example. | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
boolean |
The boolean operator is: where alive = true | ||||||||||||||||||||||
contains |
The right-hand side argument of the contains operator is either a string literal,
or a function, like
where title contains "madonna"
The matched field must be an indexed field or attribute.
Fields inside structs are referenced using dot notation -
e.g |
||||||||||||||||||||||
and |
where title contains "madonna" and title contains "saint" |
||||||||||||||||||||||
or |
where title contains "madonna" or title contains "saint" |
||||||||||||||||||||||
not |
Use the where title contains "madonna" and !(title contains "saint") |
||||||||||||||||||||||
phrase |
Phrases are expressed as a function: where text contains phrase("st", "louis", "blues") |
||||||||||||||||||||||
near |
|
||||||||||||||||||||||
onear |
|
||||||||||||||||||||||
sameElement |
sameElement() is an operator that requires the terms to match within the same struct element in an array or a map field. Example: struct person { field first_name type string {} field last_name type string {} field year_of_birth type int {} } field persons type array<person> { indexing: summary struct-field first_name { indexing: attribute } struct-field last_name { indexing: attribute } struct-field year_of_birth { indexing: attribute } } field identities type map<string, person> { indexing: summary struct-field key { indexing: attribute } struct-field value.first_name { indexing: attribute } struct-field value.last_name { indexing: attribute } struct-field value.year_of_birth { indexing: attribute } }
With normal AND the query where persons contains sameElement(first_name contains 'Joe', last_name contains 'Smith', year_of_birth < 1940) The above returns all documents containing Joe Smith born before 1940 in the persons array. Searching in a map is similar to searching in an array of struct. The difference is that you have an extra synthetic struct with the field members key and value. The above example with the identities map looks like this: where identities contains sameElement(key contains 'father', value.first_name contains 'Joe', value.last_name contains 'Smith', value.year_of_birth < 1940) The above returns all documents that have tagged Joe Smith born before 1940 as a 'father'. The importance here is using the indirection of key and value to address the keys and the values of the map. |
||||||||||||||||||||||
equiv |
If two terms in the same field should give exactly the same behavior when matched,
the where fieldName contains equiv("A","B") In many cases, the OR operator will give the same results as an EQUIV. The matching logic is exactly the same, and an OR does not have the limitations that EQUIV does (below). The difference is in how matches are visible to ranking functions. All words that are children of an OR count for ranking. When using an EQUIV however, it looks like a single word:
Limitations on how
Learn how to use equiv. |
||||||||||||||||||||||
uri |
Used to search for urls indexed using the uri field type. where myUrlField contains uri("vespa.ai/foo") Various subfields are supported to search components of the URL, see the field type definition.
|
||||||||||||||||||||||
fuzzy |
Levenshtein edit distance search within a string attribute. where myStringAttribute contains ({prefixLength:1, maxEditDistance:2}fuzzy("parantesis")) Annotations below are configuring
Find an example in text matching.
Important:
Only string attribute
fields in documents are supported (single, array or weightedset).
Matching is optimized internally when
maxEditDistance is 1 or 2.
Setting prefixLength greater than 0
narrows the match for the fast-search,
greatly reducing the number of terms that must be considered.
|
||||||||||||||||||||||
matches |
Regular expression match is supported using posix extended syntax, with the limitation that it is case insensitive. Example matching both where attribute_field matches "mado[n]+a" Find more examples in the text matching guide. |
||||||||||||||||||||||
userInput |
userInput() is a robust way of mixing user input and a formal query. It allows controlling whether the user input is to be stemmed, lowercased, etc., but it also allows for controlling whether it should be treated as a raw string, whether it should simply be segmented or parsed as a query. yql=select * from sources * where userInput(@animal)&animal=panda Here, the userInput() function will access the query property "animal", and parse the property value as a weakAnd query, resulting in the following expression: select * from sources * where weakAnd(default contains "panda") Find a full example in the query API guide. Instead of parameter substitution, the userInput() function also accepts raw strings as arguments, but this would obviously not be suited for parametrizing the query from a query profile. It is mostly intended as test feature.
In addition, other annotations, like stem or ranked, will take effect as normal. |
||||||||||||||||||||||
userQuery |
userQuery() reads from model.queryString and parses the query using simple query language. If set, model.filter is combined with model.queryString before the parsing. The user query is first parsed, then the resulting tree is inserted into the corresponding place in the YQL query tree. Example: $ vespa query 'select * from sources * where vendor contains "brick and mortar" AND price < 50 AND userQuery()' \ query="abc def -ghi" \ type=all This evaluates to a query where:
Use model.defaultIndex to specify a field or fieldset if not using default - see example. |
||||||||||||||||||||||
rank |
The first, and only the first, argument of the rank() function
determines whether a document is a match,
but all arguments are used for calculating rank features. The where rank(a contains "A", b contains "B", c contains "C") It's also useful in hybrid search use cases. See blog post for usage examples. For example, retrieve using the nearestNeighbor query operator as the first argument and have matching features calculated for the the other arguments. where rank(nearestNeighbor(field, queryVector), a contains "A", b contains "B", c contains "C") |
||||||||||||||||||||||
in |
The in operator is used to match a set of values in an integer or string field. A document is considered a match when at least one of the values matches the content of the field. This is an optimized shorthand for multiple OR conditions, and is similar to the IN operator in SQL. Available since Vespa 8.293.15 . Example: where integer_field in (10, 20, 30) where string_field in ('germany', 'france', 'norway')Where string_field is a field with match:word .
There is no linguistic processing like tokenization or stemming
of the string values used in the in operator except lowercasing. See string match.
field string_field type string { indexing: summary | index # or attribute match: word rank:filter attribute: fast-search # if attribute } Using the in operator against string fields with The argument before in is the name of the field or fieldset to search. The argument after in is a comma-separated list of values, enclosed in parenthesis. String values must be single or double-quoted if passed inline in YQL For faster query parsing use parameter substitution to submit the values as an additional request parameter. Quoting of string values are optional. Example: where integer_field in (@integer_values)&integer_values=10,20,30 where string_field in (@string_values)&string_values=germany,france,norway The in operator acts as a single term in the query tree, and does not provide any match information for text ranking features. For a discussion of usage and examples refer to:
Important:
When using the in operator with an attribute field,
set fast-search and rank: filter
for best possible performance. Always use
match:word for string fields.
|
||||||||||||||||||||||
dotProduct |
dotProduct calculates the dot product between the weighted set in the query and a weighted set field in the document as its rank score contribution: where dotProduct(description, {"a":1, "b":2}) The result is stored as a raw score. A normal use case is a collection of weighted tokens produced by an algorithm, to match against a corpus containing weighted tokens produced by another algorithm in order to implement personalized content exploration. See example usage of dotProduct in practical performance guide . Refer to multivalue query operators for a discussion of usage and examples.
Keys must be single or double-quoted if passed inline in YQL -
alternatively, use parameter substitution
to submit the weighted set with a simple format for faster query parsing -
example:
|
||||||||||||||||||||||
weightedSet |
When using weightedSet to search a field, all tokens present in the searched field will be matched against the weighted set in the query. This means that using a weighted set to search a single-value attribute field will have similar semantics to using a normal term to search a weighted set field. The low-level matching information resulting from matching a document with a weighted set in the query will contain the weights of all the matched tokens in descending order. Each matched weight will be represented as a standard occurrence on position 0 in element 0. where weightedSet(description, {"a":1, "b":2}) weightedSet has similar semantics to equiv, as it acts as a single term in the query. However, the restriction dictating that it contains a collection of weighted tokens directly enables specific back-end optimizations that improves performance for large sets of tokens compared to using the generic equiv or or operators.
Keys must be single or double-quoted if passed inline in YQL -
alternatively, use parameter substitution
to submit the weighted set with a simple format for faster query parsing -
example:
|
||||||||||||||||||||||
wand |
Note that total hit count becomes inaccurate when using wand.
where wand(description, [[11,1], [37,2]])
Keys must be single or double-quoted if passed inline in YQL -
alternatively, use parameter substitution
to submit the weighted set with a simple format for faster query parsing -
example:
where ({scoreThreshold: 0.13, targetHits: 7}wand(description, {"a":1, "b":2})) Refer to using wand for introduction to the WAND algorithm and example usage of wand in practical performance guide .
|
||||||||||||||||||||||
weakAnd |
where weakAnd(a contains "A", b contains "B")
where ({targetHits: 7}weakAnd(a contains "A", b contains "B"))
Unlike wand, Refer to using wand for a usage and examples.
|
||||||||||||||||||||||
geoLocation |
where geoLocation(myfieldname, 63.5, 10.5, "200 km") In this example we search for documents near 63.5° north, 10.5° east, and within a 200 km radius. So a document with a "myfieldname" position in Trondheim, Norway at N63°25'47;E10°23'36 would match. The first parameter is the name of the attribute field. The second parameter is the longitude (positive for north, negative for south). The third parameter is the latitude (positive for east, negative for west). The fourth parameter must be a string specifying the radius and its units, where the supported units include "km", "m" (for meters), "miles", and "deg" for degrees (so "deg" gives radius the same units as latitude). Any negative number for radius (e.g. "-1 m") is interpreted as an "infinite" radius, letting any geographical position at all match the geoLocation operator. The position attribute in the schema could look like: field myfieldname type position { indexing: attribute | summary } Arrays of positions are also possible: field myfieldname type array<position> { indexing: attribute }
Properties:
|
||||||||||||||||||||||
nearestNeighbor |
The document vectors are stored in a tensor field attribute, and the query vector is sent with the query request. The following tensor field types are supported:
Euclidean distance is used as the default distance metric and the exact nearest neighbors are returned. When storing multiple vectors per document, the vector that is closest to the query vector is used when calculating the distance between the document and the query. If an HNSW index is specified on the tensor field, the approximate nearest neighbors are returned. Example: where ({targetHits: 10}nearestNeighbor(doc_vector, query_vector))&input.query(query_vector)=[3,5,7]&ranking=semantic
In this example we search for the top 10 nearest neighbors in a 3-dimensional vector space.
targetHits specifies the top-k nearest neighbors to expose to a user defined The second parameter is the name of the tensor sent with the query request (query_vector). Specifying query_vector as the name means the query request must set this tensor as input.query(query_vector) - see the reference. The tensor type of the input query vector must be defined in the rank profile: rank-profile semantic { inputs { query(query_vector) tensor<float>(x[3]) } first-phase: closeness(field, doc_vector) } Also see defining query feature types. Failure to define the query input tensor in the schema will fail the request: Expected 'query(query_vector)' to be a tensor, but it is the string '[3,5,7]' The document tensor field attribute is defined as follows: field doc_vector type tensor<float>(x[3]) { indexing: attribute | summary } The above example does not define HNSW See Nearest Neighbor Search, Approximate Nearest Neighbor Search using HNSW Index and Nearest Neighbor Search Guide for more detailed examples.
Properties:
|
||||||||||||||||||||||
nonEmpty |
nonEmpty takes as its only argument an arbitrary search expression. It will then perform a set of checks on that expression. If all the checks pass, the result is the same expression, otherwise the query will fail. The checks are as follows:
yql=select * from sources * where bar contains "a" and nonEmpty(bar contains "bar" and foo contains @foo)&foo= Note how "foo" is empty in this case, which will force the query to fail. If "foo" contained a searchable term, the query would not have failed. |
||||||||||||||||||||||
predicate |
predicate() specifies a predicate query - see predicate fields. It takes three arguments: the predicate field to search, a map of attributes, and a map of range attributes: where predicate(predicate_field,{"gender":"Female"},{"age":20L}) Due to a quirk in YQL-parsing, one cannot specify an empty map, use the number 0 instead. where predicate(predicate_field,0,{"age":20L}) |
||||||||||||||||||||||
true |
Matches all documents of any type. Care must be taken when using this since processing all documents as matches is expensive. At minimum, consider restricting to only one schema where you know the corpus isn't too big, see the model.restrict URL parameter. |
||||||||||||||||||||||
false |
Does not match any document at all. Not useful in itself, but could potentially be used as a placeholder in the query tree. |
Sort using order by
.
Add asc
or desc
after the name of an
attribute to set sort order -
ascending order is default. Add another sorting attributes to get a secondary sort, that will be a tiebreaker for the
primary ordering attribute. This is typically used to get a predictable ordering when the primary ordering attribute
has the same value for multiple documents.
where title contains "madonna" order by price asc, releasedate desc
Sorting function, locale and strength are defined using the annotations "function", "locale" and "strength", as in:
where title contains "madonna" order by {function: "uca", locale: "en_US", strength: "IDENTICAL"}other desc, {function: "lowercase"}something
The rank profile determines the rank score each document will get.
Results are ordered by that value by default, but order by
overrides that ordering.
Vespa does not optimize away the rank score computation in this case, it is still executed, even if the model score is thrown away. Use the built-in rank-profile unranked for optimal performance of sorting queries.
To do a primary ordering on the rank score, and a secondary sort on an attribute, use '[relevance]'
as the first order by attribute.
See Special sorting attributes for more details.
Annotation | Effect |
---|---|
function | Sort function, default UCA. |
locale | Locale identifier for the UCA sort function. |
strength | Strength setting for the UCA sort function. |
To specify a slice / limit the number of hits returned / do pagination,
use limit
and/or offset
. This can also be controlled by using
native execution parameters.
limit 100
overrides <field name="hits" overridable="false">50</field>
.
Limited by maxHits (default 400) and maxOffset (default 1000) - these can be configured in a queryProfile.
Example: This returns two hits (if there are sufficiently many hits matching the query), skipping the 29 first documents
where title contains "madonna" limit 31 offset 29
Set query timeout in milliseconds using timeout
. This can also be controlled by using the native
execution parameter timeout. YQL specified values takes precedence.:
where title contains "madonna" timeout 70
Only literal numbers are valid, i.e. setting another unit is not supported.
Use parameter substitution to separate the YQL string from user input values.
E.g., the userInput(value) query operator
supports parameter substitution for the value
parameter:
... where userInput(@userinput)&userinput=free+text
The query operators field in (value),
dotProduct(field, value),
weightedSet(field, value) and wand(field, value)
support parameter substitution for the value
parameter.
The value
string can be passed in one of the following forms
(quotes can be skipped unless the keys contain ,
or :
.):
value, ...
.
For the in operator only.[[key, value], ...]
.
For dotproduct, weightedset and wand.{key: value, ...}
.
For dotproduct, weightedset and wand.See the query API guide for examples.
Terms and phrases can be annotated to manipulate the behavior.
Add an annotation using {}
:
where text contains ({distance: 5}near("a", "b")) and text contains ({distance:2}near("c", "d"))
Note that the annotation is enclosed by parentheses to scope the annotation to the operator.
All annotations are supported by the string arguments to functions like and phrase() and near() and also the string argument to the "contains" operator. Some annotations are also supported by the functions which are handled like leaf nodes internally in the query tree: phrase(), near(), onear(), range(), equiv(), dotProduct(), weightedSet(), weakAnd(), wand() and nearestNeighbor().
Refer to SelectTestCase.java for sample usage.
Annotation | Default | Values | Description | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
accentDrop | true | boolean | Remove accents from this term if it is the setting for this field. Refer to linguistics. |
||||||||
allowEmpty | false | boolean |
Whether to allow empty input for query parsing and query terms in userInput.
If |
||||||||
andSegmenting | true|false |
Force phrase or AND operator if re-segmenting (e.g. in stemming) this term results in multiple terms. Default is choosing from language settings. |
|||||||||
annotations | map |
Map of annotations : {cox: "another"} |
|||||||||
approximate | boolean |
Used in nearestNeighbor.
The optional approximate annotation may be set to |
|||||||||
ascending | boolean |
Ascending hit order. Used by hitLimit. |
|||||||||
bounds | closed |
enum |
A numeric interval is by default a closed interval.
If the lower bound is exclusive, set to where ({bounds:"rightOpen"}range(year, 2000, 2018)) |
||||||||
connectivity | map |
Map of connectivity: {id: 4, weight: 0.8} |
|||||||||
descending | boolean |
Descending hit order. Used by hitLimit. |
|||||||||
defaultIndex | default |
Any searchable field in the schema. |
Used by userInput.
Same as model.defaultIndex in the query API.
If grammar is set to |
||||||||
distance | 2 | int |
The distance-annotation sets the maximum position difference to count as a match, see near / onear. The default distance is 2, meaning match if the words have up to one separating word. where text contains ({distance: 5}near("a", "b")) |
||||||||
distanceThreshold | +infinity | double |
Used in nearestNeighbor.
The |
||||||||
endAnchor | true | boolean |
The where myUrlField.hostname contains uri("vespa.ai") will match vespa.ai and docs.vespa.ai, while where myUrlField.hostname contains ({startAnchor: true}uri("vespa.ai")) will only match vespa.ai. |
||||||||
filter | false | boolean | Regard this term as a "filter" term and not a term from the end user. Terms that are annotated with "filter:true" are not bolded. See also model.filter. Bolding of terms is controlled by schema:bolding. |
||||||||
function |
Default sort function for strings is Numeric fields are numerically sorted.
|
||||||||||
grammar | weakAnd |
raw , segment and all values accepted for the
model.type argument in the query API. |
How to parse userInput.
|
||||||||
hitLimit | int |
Numeric operations support
Note that
See the practical-search-performance-guide for an example. |
|||||||||
hnsw.exploreAdditionalHits |
Used in nearestNeighbor.
When using an HNSW index,
the optional |
||||||||||
id | int | Unique ID used for e.g. connectivity. |
|||||||||
implicitTransforms | true | boolean |
Implicit term transformations (field defaults).
If |
||||||||
label | string |
Used by geoLocation and nearestNeighbor. Label for referring to this term/operator during ranking. |
|||||||||
language | RFC 3066 language code |
Language setting for the linguistics handling of userInput, also see model.language in the query API reference. |
|||||||||
locale |
Used by the UCA sort function.
An identifier following
unicode locale identifiers, e.g. |
||||||||||
maxEditDistance | 2 | int |
Used in fuzzy. An inclusive upper bound of edit distance between query and string attribute. |
||||||||
nfkc | true | boolean | NFKC normalization. |
||||||||
normalizeCase | true | boolean | Normalize casing of this term if it is the setting for this field. |
||||||||
origin | map |
Map of origin: {original: "abc", offset: 1, length: 2} |
|||||||||
prefix | false | boolean | Do prefix matching for this term, e.g. search for "word*". |
||||||||
substring | false | boolean | Do substring matching for this word if available in the index. ("Search for "*word*".") Only supported for streaming search. | ||||||||
prefixLength | 0 | int | Used in fuzzy. Number of characters that are considered frozen, so the fuzzy match will be performed with the suffix left. |
||||||||
ranked | true | boolean |
Include this term for ranking calculation. Setting ranked to false can speed up query evaluation. Read more about schema reference. Example |
||||||||
scoreThreshold | double |
A threshold in wand for the minimum score of hits to include as matches. |
|||||||||
significance | double |
Significance value for text ranking features - see text matching and ranking. |
|||||||||
startAnchor | false | boolean | See endAnchor. |
||||||||
stem | true | boolean | Stem this term if it is the setting for this field. |
||||||||
strength | PRIMARY |
|
Used by the UCA sort function.
Default is |
||||||||
suffix | false | boolean | Do suffix matching for this term, e.g. search for "*word". |
||||||||
targetHits | 100 | int |
Used by wand and weakAnd, where the default is 100. It is also used with nearestNeighbor, where it has no default - it must always be set, see examples in nearest neighbor search.
It sets the wanted number of hits exposed to the real first-phase ranking function per content node.
If additional second phase ranking with rerank-count is used,
do not set |
||||||||
usePositionData | true | boolean |
Use term position data for text ranking features such as nativeRank. This is term position, not to be confused with geo searches. Setting "usePositionData:false" can improve query performance. |
||||||||
weight | 100 | int |
Term weight, used in some text ranking features - see text matching and ranking. where title contains ({weight:200}"heads") |
Consider the following query:
select * from sources * where ({stem: false}(foo contains "a" and bar contains "b")) or foo contains {stem: false}"c"
The "stem" annotation controls whether a given term should be stemmed if its field is configured as a stemmed field (default is "true"). The "AND" operator itself has no internal API for whether its operands should be stemmed or not, but we can still annotate as such, because when the value of a given annotation is determined, the expression tree is followed from the term in question and up through its ancestors. Traversing the tree stops when a value is found (or there is nothing more to traverse). In other words, none of the terms in this example will be stemmed.
How annotations behave may be easier to understand of expressing a boolean query in the style of an S-expression:
(AND term1 term2 (OR term3 term4) (OR term5 (AND term6 term7)))
The annotation scopes would then be as follows, i.e. annotations on which elements will be checked when determining the settings for a given term:
term1 | term1 itself, and the first AND |
term2 | term2 itself, and the first AND |
term3 | term3 itself, the first OR and the first AND |
term4 | term4 itself, the first OR and the first AND |
term5 | term5 itself, the second OR and the first AND |
term6 | term6 itself, the second AND, the second OR and the first AND |
term7 | term7 itself, the second AND, the second OR and the first AND |
Use YQL variable syntax to initialize words in phrases and as single terms. This removes the need for caring about quoting a term in YQL, as well as URL quoting. The term will be used exactly as it is in the URL. As an example, look at a query with a YQL argument, and the properties animal and syntaxExample:
yql=select * from sources * where foo contains @animal and foo contains phrase(@animal, @syntaxExample, @animal)&animal=panda&syntaxExample=syntactic
This YQL expression will then access the query properties animal and syntaxExample and evaluate to:
select * from sources * where (foo contains "panda" AND foo contains phrase("panda", "syntactic", "panda"))
YQL requires quoting to be included in a URL. Since YQL is well suited to application logic, while not being intended for end users, a solution to this is storing the application's YQL queries into different query profiles. To add a default query profile, add search/query-profiles/default.xml to the application package:
<query-profile id="default"> <field name="yql">select * from sources * where default contains "latest" or userQuery()</field> </query-profile>
This will add latest as an OR term to all queries not having an explicit query profile parameter. The important thing to note is how it is not necessary to URL-quote anything in the query profiles files. They operate independently of the HTTP parsing as such.
Searchers which modifies the textual YQL statement (not recommended)
should be annotated with @Before("ExternalYql")
.
Searchers modifying query tree produced from an input YQL statement
should annotate with @After("ExternalYql")
.
Group / aggregate results by adding a grouping expression after a |
-
read more.
select * from sources * where sddocname contains 'purchase' | all(group(customer) each(output(sum(price))))