Query result grouping reference

Refer to the grouping guide for an introduction to Vespa's grouping feature.

  • Group query results using a custom expression (using the group clause):
    • A numerical constant.
    • A document attribute.
    • A function over another expression (xorbit, md5, cat, xor, and, or, add, sub, mul, div, mod) or any other expression.
    • The data type of an expression is resolved using best effort, similarly to how you expect common programming languages to resolve arithmetics of different data typed operands.
    • The results of any expression are either scalar or single dimension arrays.
      • add(<array>) adds all elements together to produce a scalar.
      • add(<arrayA>, <arrayB>) adds each element together producing a new array whose size is max(|<arrayA>|, |<arrayB>|).
  • Groups can contain subgroups (by using each and group operations), and may be nested to any level.
  • Each level of grouping specifies a set of aggregates to collect for all documents that belong to that group (using the output operation):
    • The documents in a group, retrieved using a specified summary class.
    • The count of documents in a group.
    • The sum, average, min, max, xor or standard deviation of an expression.
  • Each level of grouping may specify how to order its groups (using the order operation):
    • Ordering can be done using any of the available aggregates.
    • Multi-level grouping allows strict ordering where primary aggregates may be equal.
    • Ordering is either ascending or descending, specified per level of ordering.
  • You may limit the number of groups returned for each level (using the max operation), returning only first n groups as specified by the order operation.
  • You may count the number of unique groups for a level using the count aggregator. Note that count operates independently of the max clause.
  • You may paginate through group- and hit-lists using the "continuations" query parameter.
  • You may group on multivalued attributes. Most grouping functions will just handle the elements of multivalued attributes separately, as if they were all individual values in separate documents.
  • The interpolatedlookup function will count elements in a sorted array that are less than an expression, with linear interpolation if the expression is between element values.

Select parameter language grammar

request    ::= group [ "where" "(" ( "true" | "$query" ) ")" ]
group      ::= ( "all" | "each") "(" operations ")" [ "as" "(" identifier ")" ]
operations ::= [ "group" "(" expression ")" ]
               ( ( "alias" "(" identifier "," expression ")" ) |
                 ( "max"   "(" number ")" ) |
                 ( "order" "(" expList | aggrList ")" ) |
                 ( "output" "(" aggrList ")" ) |
                 ( "precision" "(" number ")" ) )*
aggrList   ::= aggr ( "," aggr )*
aggr       ::= ( ( "count" "(" ")" ) |
                 ( "sum" "(" exp ")" ) |
                 ( "avg" "(" exp ")" ) |
                 ( "max" "(" exp ")" ) |
                 ( "min" "(" exp ")" ) |
                 ( "xor" "(" exp ")" ) |
                 ( "stddev" "(" exp ")" ) |
                 ( "summary" "(" [ identifier ] ")" ) )
               [ "as" "(" identifier ")" ]
expList    ::= exp ( "," exp )*
exp        ::= ( "+" | "-") ( "$" identifier [ "=" math ] ) | ( math ) | ( aggr )
math       ::= value [ ( "+" | "-" | "*" | "/" | "%" ) value ]
value      ::= ( "(" exp ")" ) |
               ( "add" "(" expList ")" ) |
               ( "and" "(" expList ")" ) |
               ( "cat" "(" expList ")" ) |
               ( "div" "(" expList ")" ) |
               ( "docidnsspecific" "(" ")" ) |
               ( "fixedwidth" "(" exp "," number ")" ) |
               ( "interpolatedlookup" "(" attributeName "," exp ")") |
               ( "math" "." (
                                "exp" | "log" | "log1p" | "log10" | "sqrt" | "cbrt" |
                                "sin" | "cos" | "tan" | "asin" | "acos" | "atan" |
                                "sinh" | "cosh" | "tanh" | "asinh" | "acosh" | "atanh"
                              ) "(" exp ")" |
                              ( "pow" | "hypot" ) "(" exp "," exp ")"
                            )) |
               ( "max" "(" expList ")" ) |
               ( "md5" "(" exp "," number "," number ")" ) |
               ( "min" "(" expList ")" ) |
               ( "mod" "(" expList ")" ) |
               ( "mul" "(" expList ")" ) |
               ( "or" "(" expList ")" ) |
               ( "predefined" "(" exp "," "(" bucket ( "," bucket )* ")" ")" ) |
               ( "reverse" "(" exp ")" ) |
               ( "relevance" "(" ")" ) |
               ( "sort" "(" exp ")" ) |
               ( "strcat" "(" expList ")" ) |
               ( "strlen" "(" exp ")" ) |
               ( "size" "(" exp")" ) |
               ( "sub" "(" expList ")" ) |
               ( "time" "." ( "year" | "monthofyear" | "dayofmonth" | "dayofyear" | "dayofweek" |
                              "hourofday" | "minuteofhour" | "secondofminute" ) "(" exp ")" ) |
               ( "todouble" "(" exp ")" ) |
               ( "tolong" "(" exp ")" ) |
               ( "tostring" "(" exp ")" ) |
               ( "toraw" "(" exp ")" ) |
               ( "uca" "(" exp "," string [ "," string ] ")" ) |
               ( "xor" "(" expList ")" ) |
               ( "xorbit" "(" exp "," number ")" ) |
               ( "ymum" "(" ")" ) |
               ( "zcurve" "." ( "x" | "y" ) "(" exp ")" ) |
               ( attributeName "." "at" "(" number ")") |
               ( attributeName )
bucket     ::= "bucket" ( "(" | "[" | "<" )
                        ( "-inf" | rawvalue | number | string )
                        [ "," ( "inf" | rawvalue | number | string ) ]
                        ( ")" | "]" | ">" )
rawvalue   ::= "{" ( ( string | number ) "," )* "}"

Output format

When grouping results, groups that contain outputs, group lists, and hit lists are generated. Group lists contain sub-groups, and hit lists contain hits that are part of the owning group.

The identity of a group is held by its id. Scalar identities such as long, double and string, are directly available from the id, whereas range identities used for bucket aggregation are separated into the sub-nodes from and to. Refer to the result format reference.

Continue parameter

Pagination of grouping results are managed by "continuations". These are opaque objects that can be combined and re-submitted using the "continuations" annotation on the grouping step of the query to move to the previous or next page in a result list.

All root groups contain a single "this" continuation. That continuation represents the current view, and if submitted as the sole continuation it will reproduce the exact same result as the one that contained it. Other named continuations are available in the result, and these can be appended to the "this" continuation to perform the corresponding pagination operation. E.g. the "next" continuation of a group list can be used to move to the next page of groups in that list.

Any number of continuations can be combined in a query, but the first must always be the "this" continuation. E.g. you may simultaneously move both to the next page of one list, and the previous page of another.

If more than one continuation object are provided for the same group- or hit-list, the one given last is the one that takes effect. This is because continuations are processed in the order given, and they replace whatever continuations they collide with.

If working programmatically with grouping, you will find the Continuation objects within RootGroup, GroupList and HitList result objects. These can then be added back into the continuation list of the GroupingRequest to paginate.

Here is an example of a query that provides a continuation to the grouping statement:

/search/?yql=select (…) | [{ 'continuations':['BGAAABEBCA'] }]all(…);


Group list aggregators

countCounts the number of unique groups (As produced by the group clause).NoneLong

Group aggregators

countIncrements a long counter everytime it is invoked.NoneLong
sumSums the argument over all selected documents.NumericNumeric
avgComputes the average over all selected documents.NumericNumeric
minKeeps the minimum value of selected documents.NumericNumeric
maxKeeps the maximum value of selected documents.NumericNumeric
xorXOR the values (their least significant 64 bits) of all selected documents.AnyLong
stddevComputes the population standard deviation over all selected documents.NumericDouble

Hit aggregators

summaryProduces a summary of the requested summary class.Name of summary classSummary

When all arguments are numeric, the result type is resolved by looking at the argument types. If all arguments are longs, the result is an long, if at least one argument is a double, the result is a double.

When using order(), aggregators can also be used in expressions, in order to get increased control over group sorting. This does not work with expressions that takes attributes as an argument, unless the expression is enclosed within an aggregator.


Arithmetic expressions

addAdd the arguments together.Numeric+Numeric
+Add left and right argument.Numeric, NumericNumeric
mulMultiply the arguments together.Numeric+Numeric
*Multiply left and right argument.Numeric, NumericNumeric
subSubtract second argument from first, third from result, etc.Numeric+Numeric
-Subtract right argument from left.Numeric, NumericNumeric
divDivide first argument by second, result by third, etc.Numeric+Numeric
/Divide left argument by right.Numeric, NumericNumeric
modModulo first argument by second, result by third, etc.Numeric+Numeric
%Modulo left argument by right.Numeric, NumericNumeric
negNegate argument.NumericNumeric
-Negate right argument.NumericNumeric

Bitwise expressions

andAND the arguments in order.Long+Long
orOR the arguments in order.Long+Long
xorXOR the arguments in order.Long+Long

String expressions

strlenCount the number of bytes in argument.StringLong
strcatConcatenate arguments in order.String+String

Type conversion expressions

todoubleConvert argument to double.AnyDouble
tolongConvert argument to long.AnyLong
tostringConvert argument to string.AnyString
torawConvert argument to raw.AnyRaw

Raw data expressions

catCat the binary representation of the arguments together.Any+Raw
md5Does an md5 over the binary representation of the argument, and keeps the lowest 'width' bits.Any, Numeric(width)Raw
xorbitDoes an xor of 'width' bits over the binary representation of the argument. Width is rounded up to a multiple of 8.Any, Numeric(width)Raw

Accessor expressions

relevanceReturn the computed rank of a document.NoneDouble
docidnsspecificReturn the docid without namespace.NoneString
 Applies only to streaming mode.
ymumReturn the ymum part of docid.NoneLong
 Applies only to streaming search.
<attribute-name>Return the value of the named attribute.NoneAny

Bucket expressions

fixedwidthMaps the value of the first argument into consecutive buckets whose width equals the second argument.Any, NumericNumericBucketList
predefinedMaps the value of the first argument into the given buckets.Any, Bucket+BucketList

Time expressions

Use the query parameter "timezone" to set the timezone to use when running these expressions. E.g. &timezone=GMT-1. See Sun's documentation on TimeZone for format reference.
time.dayofmonthReturns the day of month (1-31) for the given timestamp.LongLong
time.dayofweekReturns the day of week (0-6) for the given timestamp, Monday being 0.LongLong
time.dayofyearReturns the day of year (0-365) for the given timestamp.LongLong
time.hourofdayReturns the hour of day (0-23) for the given timestamp.LongLong
time.minuteofhourReturns the minute of hour (0-59) for the given timestamp.LongLong
time.monthofyearReturns the month of year (1-12) for the given timestamp.LongLong
time.secondofminuteReturns the second of minute (0-59) for the given timestamp.LongLong
time.yearReturns the full year (e.g. 2009) of the given timestamp.LongLong

List expressions

sizeReturn the number of elements in the argument if it is a list. If not return 1.AnyLong
sortSort the elements in argument in ascending order if argument is a list If not it is a NOP.AnyAny
reverseReverse the elements in the argument if argument is a list If not it is a NOP.AnyAny

Other expressions

zcurve.x Returns the X component of the given zcurve encoded 2d point. All fields of type "position" have an accompanying "<fieldName>_zcurve" attribute that can be decoded using this expression, e.g. zcurve.x(foo_zcurve). LongLong
zcurve.yReturns the Y component of the given zcurve encoded 2d point.LongLong
ucaConverts the attribute string using unicode collation algorithm, useful for sorting.Any, Locale(String), Strength(String)Raw

Single argument standard mathematical expressions

These are the standard mathematical functions as found in the Java Math class.
math.exp DoubleDouble
math.log DoubleDouble
math.log1p DoubleDouble
math.log10 DoubleDouble
math.sqrt DoubleDouble
math.cbrt DoubleDouble
math.sin DoubleDouble
math.cos DoubleDouble
math.tan DoubleDouble
math.asin DoubleDouble
math.acos DoubleDouble
math.atan DoubleDouble
math.sinh DoubleDouble
math.cosh DoubleDouble
math.tanh DoubleDouble
math.asinh DoubleDouble
math.acosh DoubleDouble
math.atanh DoubleDouble

Dual argument standard mathematical expressions

We also implement a few other convenient expressions. One very nice for geometrical distance calculations.
math.pow Return X^Y. Double, Double Double
math.hypot Return length of hypotenuse given X and Y sqrt(X^2 + Y^2). Double, Double Double


TopN / Full corpus

Simple grouping where you count the number of documents in each group:

/search/?yql=select (…) | all(group(a) each(output(count())));

Two parallel groupings:

/search/?yql=select (…) | all(all(group(a) each(output(count())))
                              all(group(b) each(output(count()))));

Only the 1000 best hits will be grouped at each backend node. Lower accuracy, but higher speed:

/search/?yql=select (…) | all(max(1000) all(group(a) each(output(count()))));

In streaming search you may also group all searched documents by adding a where(true) clause:

/search/?yql=select (…) | all(group(a) each(output(count()))) where(true);

Selecting groups

Perform a modulo 5 operation before selecting the group you want:

/search/?yql=select (…) | all(group(a % 5) each(output(count())));

Perform a + b * c before selecting the group you want:

/search/?yql=select (…) | all(group(a + b * c) each(output(count())));

Grouping on maps

For streaming search, the field path syntax may also be used when searching, which enables structs and maps to be searched. The following syntax can be used when grouping maps, and will create a group for values whose keys match "foo":

/search/?yql=select (…) | all(group(mymap{"foo"}) each(output(count())));

Locale aware sorting

Groups are sorted using locale aware sorting, with the default and primary strength values, respectively:

/search/?yql=select (…) | all(group(s) order(max(uca(s, "sv")))
/search/?yql=select (…) | all(group(s) order(max(uca(s, "sv", "PRIMARY")))

Ordering groups

Perform a modulo 5 operation before selecting the group you want. The groups are then ordered by their aggregated sum of attribute "b":

/search/?yql=select (…) | all(group(a % 5) order(sum(b))

Perform a + b * c before selecting the group you want. Ordering is given by the maximum value of attribute "d" in each group:

/search/?yql=select (…) | all(group(a + b * c) order(max(d))

Take the average relevance of the groups and multiply it with the number of groups to get a cumulative count:

/search/?yql=select (…) | all(group(a) order(avg(relevance()) * count())

You can not, however, directly reference an attribute in your order clause, as this:

/search/?yql=select (…) | all(group(a) order(attr * count())

But, you can do this:

/search/?yql=select (…) | all(group(a) order(max(attr) * count())

Collecting aggregates

Simple grouping where you count number of documents in each group and return the best hit in each group:

/search/?yql=select (…) |
             all(group(a) each(max(1) each(output(summary()))));

Also return the sum of attribute "b":

/search/?yql=select (…) |
             all(group(a) each(max(1) output(count(), sum(b))

Also return an xor of the 64 most significant bits of an md5 over the concatenation of attributes "a", "b" and "c":

/search/?yql=select (…) |
             all(group(a) each(max(1) output(count(), sum(b), xor(md5(cat(a, b, c), 64)))

Predefined buckets

Group on predefined buckets for raw attribute and use infinity to make sure the buckets cover the whole possible range:

/search/?yql=select (…) |
             all(group(predefined(r, bucket(-inf, {0, 'a', 3}), bucket({1, 'u', 4}, inf)))

Standard mathematical start and end specifiers may be used to define the width of a bucket. The "(" and ")" evaluates to "[" and ">" by default. Here, make a bucket that can only with one exact group, and use different width specifiers:

/search/?yql=select (…) |
             all(group(predefined(r, bucket[-inf, "bar">, bucket["bar"], bucket<"bar", inf]))


Single level grouping on "a" attribute, returning at most 5 groups with full hit count as well as the 69 best hits.

/search/?yql=select (…) |
             all(group(a) max(5) each(max(69) output(count())

Two level grouping on "a" and "b" attribute:

/search/?yql=select (…) |
             all(group(a) max(5) each(output(count())
                 all(group(b) max(5) each(max(69) output(count())

Three level grouping on "a", "b" and "c" attribute:

/search/?yql=select (…) |
             all(group(a) max(5) each(output(count())
                 all(group(b) max(5) each(output(count())
                     all(group(c) max(5) each(max(69) output(count())

As above example, but also collect best hit in level 2:

/search/?yql=select (…) |
             all(group(a) max(5) each(output(count())
                 all(group(b) max(5) each(output(count())
                     all(max(1) each(output(summary())))
                     all(group(c) max(5) each(max(69) output(count())

As above example, but also collect best hit in level 1:

/search/?yql=select (…) |
             all(group(a) max(5) each(output(count())
                 all(max(1) each(output(summary())))
                 all(group(b) max(5) each(output(count())
                     all(max(1) each(output(summary())))
                     all(group(c) max(5) each(max(69) output(count())

As above example, but using different document summaries on each level:

/search/?yql=select (…) |
             all(group(a) max(5) each(output(count())
                 all(max(1) each(output(summary(complexsummary))))
                 all(group(b) max(5) each(output(count())
                     all(max(1) each(output(summary(simplesummary))))
                     all(group(c) max(5) each(max(69) output(count())

Group on fixed width buckets for numeric attribute, then on "a" attribute, count hits in leaf nodes:

/search/?yql=select (…) |
             all(group(fixedwidth(n, 3)) each(group(a) max(2) each(output(count()))));

As above example, but limiting groups in level 1, and returning hits from level 2:

/search/?yql=select (…) |
             all(group(fixedwidth(n, 3)) max(5) each(group(a) max(2) each(output(count())

Deep grouping with counting and hit collection on all levels:

/search/?yql=select (…) |
             all(group(a) max(5) each(output(count())
                 all(max(1) each(output(summary())))
                 all(group(b) each(output(count())
                     all(max(1) each(output(summary())))
                     all(group(c) each(output(count())
                         all(max(1) each(output(summary())))))))));

Time aware grouping

Group by year:

/search/?yql=select (…) |
             all(group(time.year(a)) each(output(count())));

Group by year, then by month:

/search/?yql=select (…) |
             all(group(time.year(a)) each(output(count())
                 all(group(time.month(a)) each(output(count())))));

Group by year, then by month, then day, then by hour:

/search/?yql=select (…) |
             all(group(time.year(a)) each(output(count())
                 all(group(time.monthofyear(a)) each(output(count())
                     all(group(time.dayofmonth(a)) each(output(count())
                         all(group(time.hourofday(a)) each(output(count())))))))));

Groups today, yesterday, lastweek, and lastmonth using predefined aggregator, and groups each day within each of these separately:

/search/?yql=select (…) |
             all(group(predefined((now() - a) / (60 * 60 * 24),
                                  bucket(0,1), bucket(1,2),
                                  bucket(3,7), bucket(8,31))) each(output(count())
                 all(max(2) each(output(summary())))
                     all(group((now() - a) / (60 * 60 * 24)) each(output(count())
                         all(max(2) each(output(summary())))))));

Counting unique groups

The count aggregator can be applied on list of groups to determine the number of unique groups without having to explicitly retrieve all groups. Another use case for this aggregator is counting the number of unique instances matching a given expression.

The following query outputs the number of groups, which is equivalent to the number of unique values for attribute "a".

/search/?yql=select (…) |
             all(group(a) output(count()))

The following query outputs the number of unique string lengths for the attribute "name".

/search/?yql=select (…) |
             all(group(strlen(name)) output(count()))

The following query outputs the sum of the "b" attribute for each group in addition to the overall group count.

/search/?yql=select (…) |
             all(group(a) output(count()) each(output(sum(b))))

The max clause is used to restrict the number of groups returned. The query outputs the sum for the 3 best groups. The count clause outputs the actual number of groups (potentially >3).

/search/?yql=select (…) |
             all(group(a) max(3) output(count()) each(output(sum(b))))

The following query outputs the number of top level groups, and for the 10 best groups, outputs the number of unique values for attribute "b".

/search/?yql=select (…) |
             all(group(a) max(10) output(count()) each(group(b) output(count())))

Using the grouping session cache

When having multi-level grouping expressions, the search query is normally re-run for each level. The drawback of this is that if you have an expensive ranking function, the query will take more time than strictly necessary.

To avoid this, you can set the groupingSessionCache query flag. This causes the query and grouping expression to be performed only once.

However, the flag is only useful if the following conditions are met.

  • The grouping expression does not have the order() clause.
  • The grouping expression has at least two levels (I.e. group(a) each(group(b) each()….)).
The drawback of using this flag is that when max() is specified in the grouping expression, it might cause inaccuracies in aggregated values such as count(). We therefore recommend that you test whether or not this is an issue for your queries, and (if it is an issue) adjust the precision parameter to still get correct counts.

Grouping of multivalue attributes

Some grouping operators may be used with multivalue attributes. Note that using a multivalued attribute (such as an array of doubles) in a grouping expression is likely to have a large, adverse impact on performance, particularly if the set of hits to be processed is large, since it means a large amount of data is streamed through the CPU. Such operations is therefore likely to hit a bottleneck on memory bandwidth.


For streaming search, multi-value fields such as maps, arrays etc. can be used for grouping. However, using aggregation functions such as sum() on such fields can give misleading results. Assume a map from strings to integers, where the strings are some sort of key you wish to use for grouping. The following expression will provide the sum of the values for all keys:

/search/?yql=select (…) | all(group(mymap.key) each(output(sum(mymap.value))));

and not the sum of the values within each key, as one would expect. It is still, however, possible to run the following expression to get the sum of values within a specific key:

/search/?yql=select (…) | all(group(mymap{"foo"}) each(output(sum(mymap.value))));

This syntax is map-specific, it does NOT apply to weighted sets.

Using sum, max, etc on a multivalued attribute

Doing an operation such as output(sum(myarray)) will run the sum over each element value in each document. The result is the sum of sums of values. Similarly max(myarray) will yield the maximal element over all elements in all documents, and so on.

Array at: element access

The expression array.at(myarray, idx) will yield one value per document by evaluating the idx expression and using it as an index into the given array. The expression will be capped to the range [0, size(myarray)-1]. If it's larger than the array size you always get the last element, while if it's smaller than zero you always get the first element. This expression can then be used to build bigger expressions such as output(sum(array.at(myarray, 0))) which will sum the first element in the array of each document.

Interpolated lookup (BETA)

The operation interpolatedlookup(myarray, expr) is intended for generic graph/function lookup. The data in myarray should be numerical values sorted in ascending order. The operation will then scan from the start of the array to find the position where the element values become equal to (or greater than) the value of the expr lookup argument, and return the index of that position. When the lookup argument's value is between two consecutive array element values, the returned position will be a linear interpolation between their respective indexes. The return value is always in the range [0, size(myarray)-1] of the legal index values for an array.

Given an example where myarray is a sorted array of type array<double> in each document. The expression interpolatedlookup(myarray, 4.2) is now a per-document expression that first evaluates the lookup argument, here a constant expression 4.2, and then looks at the contents of myarray in the document. The scan starts at the first element and proceeds until it hits an element value greater than 4.2 in the array. This means that:

  • If the first element in the array is greater than 4.2, the expression returns 0.
  • If the first element in the array is exactly 4.2, the expression still returns 0.
  • If the first element in the array is 1.7 while the second element value is exactly 4.2, the expression return 1.0 – the index of the second element.
  • If all the elements in the array are less than 4.2, the last valid array index size(myarray)-1 is returned.
  • If the 5 first elements in the array have values smaller than the lookup argument, and the lookup argument is halfway between the fifth and sixth element, a value of 4.5 is returned – halfway between the array indexes of the fifth and sixth elements.
  • Similarly, if the elements in the array are {0, 1, 2, 4, 8} then passing a lookup argument of "5" would return 3.25 (linear interpolation between indexOf(4)==3 and indexOf(8)==4).

Use case: Impression counting

If you have the impression logs for a specific user, you can make a function that maps from rankscore to the number of impressions an advertisement would get. So you would have a table like this:

Score   Integer (# impressions for this user)
0.200   0
0.210   1
0.220   2
0.240   3
0.320   4
0.420   5
0.560   6
0.700   7
0.800   8
0.880   9
0.920  10
0.940  11
0.950  12
Storing just the first column (the rank scores, including a rank score for 0 impressions) in an array attribute named "impressions", we could then use the grouping operation interpolatedlookup(impressions, relevance()) to figure out how many times a given advertisement would have been shown to this particular user. So if the rankscore is 0.420 for a specific user/ad/bid combination, then interpolatedlookup(impressions,relevance()) would return 5.0. If the the bid is increased so the rankscore gets to 0.490 it would get 5.5 as the return value instead. In this context a count of 5.5 isn't meaningful for the past of a single user, but it gives more information that may be used as a forecast. Summing this across many different users may then be used to forecast the total of future impressions for the advertisement.