Read the Vespa grouping guide first, for examples and an introduction to grouping - this is the Vespa grouping reference.
Also note that using a multivalued attribute (such as an array of doubles) in a grouping expression affects performance. Such operations can hit a memory bandwidth bottleneck, particularly if the set of hits to be processed is large, as more data is evaluated.
Group query results using a custom expression (using the group
clause):
xorbit
, md5
, cat
,
xor
, and
, or
, add
, sub
,
mul
, div
, mod
) or any other expression
add(<array>)
adds all elements together to produce a scalaradd(<arrayA>, <arrayB>)
adds each element together producing a new
array whose size is max(|<arrayA>|, |<arrayB>|)
Groups can contain subgroups (by using each
and group
operations),
and may be nested to any level.
Multiple sub-groupings or outputs can be created under the same group level, using multiple parallel each
or all
clauses, and each one may be labelled using as(mylabel).
When grouping results, groups that contain outputs, group lists and hit lists are generated. Group lists contain subgroups, and hit lists contain hits that are part of the owning group.
The identity of a group is held by its id. Scalar identities such as long, double and string, are directly available from the id, whereas range identities used for bucket aggregation are separated into the sub-nodes from and to. Refer to the result format reference.
A multivalue attribute is a weighted set, array or map. Most grouping functions will just handle the elements of multivalued attributes separately, as if they were all individual values in separate documents. If you are grouping over array of struct or maps, scoping will be used to preserve structure. Each entry in the array/map will be treated as a separate sub-document. The following syntax can be used when grouping on map attribute fields.
Group on map keys:
all( group(mymap.key) each(output(count())) )
Group on map keys then on map values:
all( group(mymap.key) each( group(mymap.value) each(output(count())) ))
Group on values for key my_key:
all( group(my_map{"my_key"}) each(output(count())) )
Group on struct field my_field referenced in map element my_key:
all( group(my_map{"my_key"}.my_field) each(output(count())) )
The key can either be specified directly (above) or indirectly via a key source attribute. The key is retrieved from the key source attribute for each document. Note that the key source attribute must be single value and have the same data type as the key type of the map:
all( group(my_map{attribute(my_key_source)}) each(output(count())) )
Group on array of integers field:
all( group(my_array) each(output(count())) )
Group on struct field my_field in the my_array array of structs:
all( group(my_array.my_field) each(output(count())) )
Tensors can not be used in grouping.
Each level of grouping may specify how to order its groups (using order
):
Limit the number of groups returned for each level using max
,
returning only first n groups as specified by order
:
order
changes ordering of groups after a merge operation for the following
aggregators: count
, avg
and sum
order
will not change ordering of groups after a merge operation
when max
or min
is used
-max(relevance())
, does not require use of
precision
Pagination of grouping results are managed by continuations
.
These are opaque objects that can be combined and re-submitted using the continuations
annotation
on the grouping step of the query to move to the previous or next page in a result list.
All root groups contain a single this continuation per select
.
That continuation represents the current view, and if submitted as the sole continuation,
it will reproduce the exact same result as the one that contained it.
There are zero or one prev/next continuation per group- and hit list. Submit any number of these to retrieve the next/previous pages of the corresponding lists
Any number of continuations can be combined in a query, but the first must always be the this continuation. E.g. one may simultaneously move both to the next page of one list, and the previous page of another.
If working programmatically with grouping, find the Continuation objects within RootGroup, GroupList and HitList result objects. These can then be added back into the continuation list of the GroupingRequest to paginate.
Refer to the grouping guide for an example.
Lists created using the each
keyword can be assigned a label
using the construct each(...) as(mylabel)
.
The outputs created by that each clause will be identified by this label.
Grouping expressions can be tagged with an alias. An alias allows the expression to be reused without having to repeat the expression verbatim.
all(group(a) alias(myalias, count()) each(output($myalias)))is equivalent to
all(group(a) each(output(count()))).
all(group(a) order($myalias=count()) each(output($myalias)))is equivalent to
all(group(a) order(count()) each(output(count()))).
The is the number of intermediate groups returned from each content node
during expression evaluation to give the container node more data to consider
when selecting the groups that are to be evaluated further:
each(...) precision(1000)
A higher number costs more bandwidth, but leads to higher accuracy in some cases.
The following query parameters are relevant for grouping. See the Query API Reference for description.
max
is specified in the grouping expression,
it might cause inaccuracies in aggregated values such as count
.
It is recommended testing whether this is an issue or not,
and if so, adjust the precision
parameter to still get correct counts.
The session cache stores intermediate grouping results in the content nodes when using multi-level grouping expressions, in order to speed up grouping at a potential loss of accuracy. This causes the query and grouping expression to be run only once.
When having multi-level grouping expressions, the search query is normally re-run for each level. The drawback of this is, with an expensive ranking function, the query will take more time than strictly necessary.
Each level of grouping specifies a set of aggregates to collect for all documents
that belong to that group (using the output
operation):
When all arguments are numeric, the result type is resolved by looking at the argument types. If all arguments are longs, the result is a long. If at least one argument is a double, the result is a double.
When using order
, aggregators can also be used in expressions in order to get increased control over group sorting.
This does not work with expressions that takes attributes as an argument, unless the expression is enclosed within an aggregator.
Using sum, max on a multivalued attribute:
Doing an operation such as output(sum(myarray))
will run the sum over each element value in each document.
The result is the sum of sums of values.
Similarly max(myarray)
will yield the maximal element over all elements in all documents, and so on.
Multivalue fields such as maps, arrays can be used for grouping.
However, using aggregation functions such as sum() on such fields can give misleading results.
Assume a map from strings to integers (map<string, int>
),
where the strings are some sort of key to use for grouping.
The following expression will provide the sum of the values for all keys:
all( group(mymap.key) each(output(sum(mymap.value))) )
and not the sum of the values within each key, as one would expect. It is still, however, possible to run the following expression to get the sum of values within a specific key:
all( group("my_group") each(output(sum(mymap{"foo"}))) )
Refer to the system test for grouping on struct and map types for more examples.
Group list aggregators | |||
Name | Description | Arguments | Result |
---|---|---|---|
count |
Counts the number of unique groups (as produced by group ).
Note that count operates independently of max
and that this count is an estimate using HyperLogLog++
which is an algorithm for the count-distinct problem | None | Long |
Group aggregators | |||
Name | Description | Arguments | Result |
count | Increments a long counter every time it is invoked | None | Long |
sum | Sums the argument over all selected documents | Numeric | Numeric |
avg | Computes the average over all selected documents | Numeric | Numeric |
min | Keeps the minimum value of selected documents | Numeric | Numeric |
max | Keeps the maximum value of selected documents | Numeric | Numeric |
xor | XOR the values (their least significant 64 bits) of all selected documents | Any | Long |
stddev | Computes the population standard deviation over all selected documents | Numeric | Double |
Hit aggregators | |||
Name | Description | Arguments | Result |
summary | Produces a summary of the requested summary class | Name of summary class | Summary |
Arithmetic expressions | |||
Name | Description | Arguments | Result |
---|---|---|---|
add | Add the arguments together | Numeric+ | Numeric |
+ | Add left and right argument | Numeric, Numeric | Numeric |
mul | Multiply the arguments together | Numeric+ | Numeric |
* | Multiply left and right argument | Numeric, Numeric | Numeric |
sub | Subtract second argument from first, third from result, etc | Numeric+ | Numeric |
- | Subtract right argument from left | Numeric, Numeric | Numeric |
div | Divide first argument by second, result by third, etc | Numeric+ | Numeric |
/ | Divide left argument by right | Numeric, Numeric | Numeric |
mod | Modulo first argument by second, result by third, etc | Numeric+ | Numeric |
% | Modulo left argument by right | Numeric, Numeric | Numeric |
neg | Negate argument | Numeric | Numeric |
- | Negate right argument | Numeric | Numeric |
Bitwise expressions | |||
Name | Description | Arguments | Result |
and | AND the arguments in order | Long+ | Long |
or | OR the arguments in order | Long+ | Long |
xor | XOR the arguments in order | Long+ | Long |
String expressions | |||
Name | Description | Arguments | Result |
strlen | Count the number of bytes in argument | String | Long |
strcat | Concatenate arguments in order | String+ | String |
Type conversion expressions | |||
Name | Description | Arguments | Result |
todouble | Convert argument to double | Any | Double |
tolong | Convert argument to long | Any | Long |
tostring | Convert argument to string | Any | String |
toraw | Convert argument to raw | Any | Raw |
Raw data expressions | |||
Name | Description | Arguments | Result |
cat | Cat the binary representation of the arguments together | Any+ | Raw |
md5 | Does an MD5 over the binary representation of the argument, and keeps the lowest 'width' bits | Any, Numeric(width) | Raw |
xorbit | Does an XOR of 'width' bits over the binary representation of the argument. Width is rounded up to a multiple of 8 | Any, Numeric(width) | Raw |
Accessor expressions | |||
Name | Description | Arguments | Result |
relevance | Return the computed rank of a document | None | Double |
<attribute-name> | Return the value of the named attribute | None | Any |
array.at |
Array element access.
The expression array.at(myarray, idx) returns one value per document
by evaluating the idx expression and using it as an index into the array.
The expression can then be used to build bigger expressions such as
output(sum(array.at(myarray, 0)))
which will sum the first element in the array of each document.
| Array, Numeric | Any |
interpolatedlookup |
Counts elements in a sorted array that are less than an expression,
with linear interpolation if the expression is between element values.
The operation
When the lookup argument's value is between two consecutive array element values,
the returned position will be a linear interpolation between their respective indexes.
The return value is always in the range
Assume
| Array, Numeric | Numeric |
Bucket expressions | |||
Name | Description | Arguments | Result |
fixedwidth | Maps the value of the first argument into consecutive buckets whose width equals the second argument | Any, Numeric | NumericBucketList |
predefined | Maps the value of the first argument into the given buckets.
| Any, Bucket+ | BucketList |
Time expressionsThe field must be a long, with second resolution (unix timestamp/epoch) - examples. | |||
Name | Description | Arguments | Result |
time.dayofmonth | Returns the day of month (1-31) for the given timestamp | Long | Long |
time.dayofweek | Returns the day of week (0-6) for the given timestamp, Monday being 0 | Long | Long |
time.dayofyear | Returns the day of year (0-365) for the given timestamp | Long | Long |
time.hourofday | Returns the hour of day (0-23) for the given timestamp | Long | Long |
time.minuteofhour | Returns the minute of hour (0-59) for the given timestamp | Long | Long |
time.monthofyear | Returns the month of year (1-12) for the given timestamp | Long | Long |
time.secondofminute | Returns the second of minute (0-59) for the given timestamp | Long | Long |
time.year | Returns the full year (e.g. 2009) of the given timestamp | Long | Long |
time.date | Returns the date (e.g. 2009-01-10) of the given timestamp | Long | Long |
List expressions | |||
Name | Description | Arguments | Result |
size | Return the number of elements in the argument if it is a list. If not return 1 | Any | Long |
sort | Sort the elements in argument in ascending order if argument is a list If not it is a NOP | Any | Any |
reverse | Reverse the elements in the argument if argument is a list If not it is a NOP | Any | Any |
Other expressions | |||
Name | Description | Arguments | Result |
zcurve.x |
Returns the X component of the given zcurve encoded 2d point.
All fields of type "position" have an accompanying "<fieldName>_zcurve" attribute
that can be decoded using this expression, e.g. zcurve.x(foo_zcurve)
| Long | Long |
zcurve.y | Returns the Y component of the given zcurve encoded 2d point | Long | Long |
uca |
Converts the attribute string using unicode collation algorithm. Groups are sorted using locale aware sorting, with the default and primary strength values, respectively: all( group(s) order(max(uca(s, "sv"))) each(output(count())) ) all( group(s) order(max(uca(s, "sv", "PRIMARY"))) each(output(count())) ) |
Any, Locale(String), Strength(String) | Raw |
Single argument standard mathematical expressionsThese are the standard mathematical functions as found in the Java Math class. | |||
Name | Description | Arguments | Result |
math.exp | Double | Double | |
math.log | Double | Double | |
math.log1p | Double | Double | |
math.log10 | Double | Double | |
math.sqrt | Double | Double | |
math.cbrt | Double | Double | |
math.sin | Double | Double | |
math.cos | Double | Double | |
math.tan | Double | Double | |
math.asin | Double | Double | |
math.acos | Double | Double | |
math.atan | Double | Double | |
math.sinh | Double | Double | |
math.cosh | Double | Double | |
math.tanh | Double | Double | |
math.asinh | Double | Double | |
math.acosh | Double | Double | |
math.atanh | Double | Double | |
Dual argument standard mathematical expressions | |||
Name | Description | Arguments | Result |
math.pow | Return X^Y. | Double, Double | Double |
math.hypot | Return length of hypotenuse given X and Y sqrt(X^2 + Y^2) | Double, Double | Double |
request ::= group [ "where" "(" ( "true" | "$query" ) ")" ] group ::= ( "all" | "each") "(" operations ")" [ "as" "(" identifier ")" ] operations ::= [ "group" "(" expression ")" ] ( ( "alias" "(" identifier "," expression ")" ) | ( "max" "(" ( number | "inf" ) ")" ) | ( "order" "(" expList | aggrList ")" ) | ( "output" "(" aggrList ")" ) | ( "precision" "(" number ")" ) )* group* aggrList ::= aggr ( "," aggr )* aggr ::= ( ( "count" "(" ")" ) | ( "sum" "(" exp ")" ) | ( "avg" "(" exp ")" ) | ( "max" "(" exp ")" ) | ( "min" "(" exp ")" ) | ( "xor" "(" exp ")" ) | ( "stddev" "(" exp ")" ) | ( "summary" "(" [ identifier ] ")" ) ) [ "as" "(" identifier ")" ] expList ::= exp ( "," exp )* exp ::= ( "+" | "-") ( "$" identifier [ "=" math ] ) | ( math ) | ( aggr ) math ::= value [ ( "+" | "-" | "*" | "/" | "%" ) value ] value ::= ( "(" exp ")" ) | ( "add" "(" expList ")" ) | ( "and" "(" expList ")" ) | ( "cat" "(" expList ")" ) | ( "div" "(" expList ")" ) | ( "docidnsspecific" "(" ")" ) | ( "fixedwidth" "(" exp "," number ")" ) | ( "interpolatedlookup" "(" attributeName "," exp ")") | ( "math" "." ( ( "exp" | "log" | "log1p" | "log10" | "sqrt" | "cbrt" | "sin" | "cos" | "tan" | "asin" | "acos" | "atan" | "sinh" | "cosh" | "tanh" | "asinh" | "acosh" | "atanh" ) "(" exp ")" | ( "pow" | "hypot" ) "(" exp "," exp ")" )) | ( "max" "(" expList ")" ) | ( "md5" "(" exp "," number "," number ")" ) | ( "min" "(" expList ")" ) | ( "mod" "(" expList ")" ) | ( "mul" "(" expList ")" ) | ( "or" "(" expList ")" ) | ( "predefined" "(" exp "," "(" bucket ( "," bucket )* ")" ")" ) | ( "reverse" "(" exp ")" ) | ( "relevance" "(" ")" ) | ( "sort" "(" exp ")" ) | ( "strcat" "(" expList ")" ) | ( "strlen" "(" exp ")" ) | ( "size" "(" exp")" ) | ( "sub" "(" expList ")" ) | ( "time" "." ( "date" | "year" | "monthofyear" | "dayofmonth" | "dayofyear" | "dayofweek" | "hourofday" | "minuteofhour" | "secondofminute" ) "(" exp ")" ) | ( "todouble" "(" exp ")" ) | ( "tolong" "(" exp ")" ) | ( "tostring" "(" exp ")" ) | ( "toraw" "(" exp ")" ) | ( "uca" "(" exp "," string [ "," string ] ")" ) | ( "xor" "(" expList ")" ) | ( "xorbit" "(" exp "," number ")" ) | ( "zcurve" "." ( "x" | "y" ) "(" exp ")" ) | ( attributeName "." "at" "(" number ")") | ( attributeName ) bucket ::= "bucket" ( "(" | "[" | "<" ) ( "-inf" | rawvalue | number | string ) [ "," ( "inf" | rawvalue | number | string ) ] ( ")" | "]" | ">" ) rawvalue ::= "{" ( ( string | number ) "," )* "}"