# Grouping Information in Results

This document describes what grouping is and gives a few examples on how to use it. Refer to the Query API reference for how to set the select parameter, and the Grouping reference for details.

Fields used in grouping needs to be defined as attribute in the document schema.

The Vespa grouping language is a list processing language which describes how the query hits should be grouped, aggregated and presented in results. A grouping statement takes the list of all matches to a query as input and groups/aggregates it, possibly in multiple nested and parallel ways to produce the desired output. Of course, this is a logical specification, and does not indicate how it is executed, as instantiating the list of all matches to the query somewhere would be too expensive, and execution is distributed instead.

## The grouping language structure

The operations defining the structure of a grouping are:

• all(statement): Perform the nested statement once on the input list as a whole.
• each(statement): Perform the nested statement on each element of the input list.
• group(specification): Turn the input list into a list of list according to the grouping specification.
• output: Output some value(s) at the current location in the structure.

The parallel and nested collection of these operations defines both the structure of the computation and of the result it produces. For example, all(group(customer) each(output(count()))) will take all matches, group them by customer id, and for each group output the count of hits in the group.

Vespa distributes and executes the grouping program on content nodes, and merges results on container nodes - in multiple phases, as needed. As realizing such programs over a distributed data set requires more network round-trips than a regular search query, these queries may be more expensive than regular queries.

Grouping supports continuation objects for pagination.

## Grouping by example

For the entirety of this document, assume an index of engine part purchases:

Date Price Tax Item Customer
2006-09-06 09:00:00 $1 000 0.24 Intake valve Smith 2006-09-07 10:00:00$1 000 0.12 Rocker arm Smith
2006-09-07 11:00:00 $2 000 0.24 Spring Smith 2006-09-08 12:00:00$3 000 0.12 Valve cover Jones
2006-09-08 10:00:00 $5 000 0.24 Intake port Jones 2006-09-08 11:00:00$8 000 0.12 Head Brown
2006-09-09 12:00:00 $1 300 0.24 Coolant Smith 2006-09-09 10:00:00$2 100 0.12 Engine block Jones
2006-09-09 11:00:00 $3 400 0.24 Oil pan Brown 2006-09-09 12:00:00$5 500 0.12 Oil sump Smith
2006-09-10 10:00:00 $8 900 0.24 Camshaft Jones 2006-09-10 11:00:00$1 440 0.12 Exhaust valve Brown
2006-09-10 12:00:00 $2 330 0.24 Rocker arm Brown 2006-09-10 10:00:00$3 770 0.12 Spring Brown
2006-09-10 11:00:00 $6 100 0.24 Spark plug Smith 2006-09-11 12:00:00$9 870 0.12 Exhaust port Jones
2006-09-11 10:00:00 $1 597 0.24 Piston Brown 2006-09-11 11:00:00$2 584 0.12 Connection rod Smith
2006-09-11 12:00:00 $4 181 0.24 Rod bearing Jones 2006-09-11 13:00:00$6 765 0.12 Crankshaft Jones

## Basic Grouping

Example: Return the total sum of purchases per customer - steps:

1. Select all documents:
/search/?yql=select * from sources * where true
2. Take the list of all hits:
all(...)
3. Turn it into a list of lists of all hits having the same customer id:
group(customer)
4. For each of those lists of same-customer hits: each(...)
5. Output the sum (an aggregator) of the price over all items in that list of hits:
output(sum(price))

Final query, producing the sum of the price of all purchases for each customer:

/search/?yql=select * from sources * where true limit 0 |
all( group(customer) each(output(sum(price))) );


Here, limit is set to zero to get the grouping output only. URL encoded equivalent:

/search/?yql=select%20%2A%20from%20sources%20%2A%20where%20true%20limit%200%20%7C%20
all%28%20group%28customer%29%20each%28output%28sum%28price%29%29%29%20%29%3B


Result:

GroupId Sum(price)
Brown $20 537 Jones$39 816
Smith $19 484 Example: Sum price of purchases per date: select (…) | all(group(time.date(date)) each(output(sum(price))));  GroupId Sum(price) 2006-09-06$1 000
2006-09-07 $3 000 2006-09-08$16 000
2006-09-09 $12 300 2006-09-10$22 540
Smith $19 484 ## Hits per Group Use summary to print the fields for a hit, and max to limit the number of hits per group. Example: Return the three most expensive parts per customer: /search/?yql=select * from sources * where true | all(group(customer) each(max(3) each(output(summary())))); &ranking=pricerank  GroupId Brown Date Price Tax Item Customer 2006-09-08 11:00$8 000 0.12 Head Brown
2006-09-10 10:00 $3 770 0.12 Spring Brown 2006-09-09 11:00$3 400 0.24 Oil pan Brown
Jones
Date Price Tax Item Customer
2006-09-11 12:00 $9 870 0.12 Exhaust port Jones 2006-09-10 10:00$8 900 0.24 Camshaft Jones
2006-09-11 13:00 $6 765 0.12 Crankshaft Jones Smith Date Price Tax Item Customer 2006-09-10 11:00$6 100 0.24 Spark plug Smith
2006-09-09 12:00 $5 500 0.12 Oil sump Smith 2006-09-11 11:00$2 584 0.12 Connection rod Smith

Notes on ordering in the example above:

• The order clause is a directive for group ordering, not hit ordering. Here, there is no order clause on the groups, hence default ordering -max(relevance()) is used. The - denotes the sorting order, - means descending (higher score first). In this case, the query is "all documents", hence all groups are equally relevant and the group order is random.
• To order hits inside groups, use ranking. Add ranking=pricerank to the query to use the pricerank rank profile to rank by price:
rank-profile pricerank inherits default {
first-phase {
expression: attribute(price)
}
}

## Nested Groups

Groups can be nested. This offers great drilling capabilities, as there are no limits to nesting depth or presented information on any level. Example: How much each customer has spent per day by grouping on customer, then date:

select (…) | all(group(customer) each(group(time.date(date)) each(output(sum(price)))));

GroupId
Brown
GroupId Sum(price)
2006-09-08 $8 000 2006-09-09$3 400
2006-09-10 $7 540 2006-09-11$1 597
Jones
GroupId Sum(price)
2006-09-08 $8 000 2006-09-09$2 100
2006-09-10 $8 900 2006-09-11$20 816
Smith
GroupId Sum(price)
2006-09-06 $1 000 2006-09-07$3 000
2006-09-09 $6 800 2006-09-10$6 100
2006-09-11 $2 584 Use this to query for all items on a per-customer basis, displaying the most expensive hit for each customer, with subgroups of purchases on a per-date basis. Use the summary clause to show hits inside any group at any nesting level. Include the sum price for each customer, both as a grand total and broken down on a per-day basis: /search/?yql=select * from sources * where true| all(group(customer) each(max(1) output(sum(price)) each(output(summary()))) each(group(time.date(date)) each(max(10) output(sum(price)) each(output(summary()))))); &ranking=pricerank  GroupId sum(price) Brown$20 537
Date Price Tax Item Customer
2006-09-08 11:00 $8 000 0.12 Head Brown GroupId Sum(price) 2006-09-08$8 000
Date Price Tax Item Customer
2006-09-08 11:00 $8 000 0.12 Head Brown 2006-09-09$3 400
Date Price Tax Item Customer
2006-09-09 11:00 $3 400 0.12 Oil pan Brown 2006-09-10$7 540
Date Price Tax Item Customer
2006-09-10 10:00 $3 770 0.12 Spring Brown 2006-09-10 12:00$2 330 0.24 Rocker arm Brown
2006-09-10 11:00 $1 440 0.12 Exhaust valve Brown 2006-09-11$1 597
Date Price Tax Item Customer
2006-09-11 10:00 $1 597 0.24 Piston Brown Jones$39 816
Date Price Tax Item Customer
2006-09-11 12:00 $9 870 0.12 Exhaust port Jones GroupId Sum(price) 2006-09-08$8 000
Date Price Tax Item Customer
2006-09-08 10:00 $5 000 0.24 Intake port Jones 2006-09-08 12:00$3 000 0.12 Valve cover Jones
2006-09-09 $2 100 Date Price Tax Item Customer 2006-09-09 10:00$2 100 0,12 Engine block Jones
2006-09-10 $8 900 Date Price Tax Item Customer 2006-09-10 10:00$8 900 0.24 Camshaft Jones
2006-09-11 $20 816 Date Price Tax Item Customer 2006-09-11 12:00$9 870 0.12 Exhaust port Jones
2006-09-11 13:00 $6 765 0.12 Crankshaft Jones 2006-09-11 12:00$4 181 0.24 Rod bearing Jones
Smith $19 484 Date Price Tax Item Customer 2006-09-10 11:00$6 100 0.24 Spark plug Smith
GroupId Sum(price)
2006-09-06 $1 000 Date Price Tax Item Customer 2006-09-06 09:00$1 000 0.24 Intake valve Smith
2006-09-07 $3 000 Date Price Tax Item Customer 2006-09-07 11:00$2 000 0.24 Spring Smith
2006-09-07 10:00 $1 000 0.12 Rocker arm Smith 2006-09-09$6 800
Date Price Tax Item Customer
2006-09-09 12:00 $5 500 0.12 Oil sump Smith 2006-09-09 12:00$1 300 0.24 Coolant Smith
2006-09-10 $6 100 Date Price Tax Item Customer 2006-09-10 11:00$6 100 0.24 Spark plug Smith
2006-09-11 $2 584 Date Price Tax Item Customer 2006-09-11 11:00$2 584 0.12 Connection rod Smith

## Range grouping

In examples above, results are grouped on distinct values, like customer or date. To group on price:

select (…) | all(group(price) each(each(output(summary()))));


This gives one group per price. To group on price ranges, one could compress the price range. This gives prices in $0 -$999 in bucket 0, $1 000 -$2 000 in bucket 1 and so on:

select (…) | all(group(price/1000) each(each(output(summary()))));


An alternative is using bucket expressions - think of a bucket as the range per group. Group on price, make groups have a width of 1000:

select (…) | all(group(fixedwidth(price,1000)) each(each(output(summary()))));


Use predefined to configure group sizes individually (the two below are equivalent):

select (…) | all(group(predefined(price, bucket(0,1000), bucket(1000,2000), bucket(2000,5000), bucket(5000,inf))) each(each(output(summary()))));
select (…) | all(group(predefined(price, bucket[0,1000>, bucket[1000,2000>, bucket[2000,5000>, bucket[5000,inf>)) each(each(output(summary()))));


This works with strings as well - put Jones and Smith in the second group:

select (…) | all(group(predefined(customer, bucket(-inf,"Jones"), bucket("Jones", inf))) each(each(output(summary()))));


... or have Jones in his own group:

select (…) | all(group(predefined(customer, bucket<-inf,"Jones">, bucket["Jones"], bucket<"Jones", inf>)) each(each(output(summary()))));


## Pagination

Grouping supports continuation objects that are passed as annotations to the grouping statement. The continuations annotation is a list of zero or more continuation strings, returned in the grouping result. For example, given the result:

{
"root": {
"children": [
{
"children": [
{
"children": [
{
"fields": {
"count()": 7
},
"value": "Jones",
"id": "group:string:Jones",
"relevance": 1.0
}
],
"continuation": {
"next": "BGAAABEBEBC",
"prev": "BGAAABEABC"
},
"id": "grouplist:customer",
"label": "customer",
"relevance": 1.0
}
],
"continuation": {
"this": "BGAAABEBCA"
},
"id": "group:root:0",
"relevance": 1.0
}
],
"fields": {
"totalCount": 20
},
"id": "toplevel",
"relevance": 1.0
}
}


reproduce the same result by passing the this continuation along the original select:

select (…) | [{ 'continuations':['BGAAABEBCA'] }]all(…);


To display the next page of customers, pass the this continuation of the root group, and the next continuation of the customer list:

select (…) | [{ 'continuations':['BGAAABEBCA', 'BGAAABEBEBC'] }]all(…);


To display the previous page of customers, pass the this continuation of the root group, and the prev continuation of the customer list:

select (…) | [{ 'continuations':['BGAAABEBCA', 'BGAAABEABC'] }]all(…);


The continuations annotation is an ordered list of continuation strings. These are combined by replacement, so that a continuation given later will replace any shared state with a continuation given before. Also, when using the continuations annotation, always pass the this continuation as its first element.

## Expressions

Instead of just grouping on some attribute value, the group clause may contain arbitrarily complex expressions - see group in the grouping reference for an exhaustive list. Examples:

• Select the minimum or maximum of sub-expressions
• Addition, subtraction, multiplication, division, and even modulo of sub-expressions
• Bitwise operations on sub-expressions
• Concatenation of the results of sub-expressions

Sum the prices of purchases on per-hour-of-day basis:

select (…) | all(group(mod(div(date,mul(60,60)),24)) each(output(sum(price))));

GroupId sum(price)
09:00 $1 000 10:00$22 367
11:00 $23 524 12:00$26 181
13:00 $6 765 These types of expressions may also be used inside output operations, so instead of simply calculating the sum price of the grouped purchases, calculate the sum income after taxes per customer: select (…) | all(group(customer) each(output(sum(mul(price,sub(1,tax))))));  GroupId sum(mul(price,sub(1,tax))) Brown$17 193
Jones $32 868 Smith$15 897

Note that the validity of an expression depends on the current nesting level. E.g. while sum(price) would be a valid expression for a group of hits, price would not. As a general rule, each operator within an expression either applies to a single hit or aggregates values across a group.

## Search Container API

As an alternative to a textual representation, one can use the programmatic API to execute grouping requests. This allows multiple grouping requests to run in parallel, and does not collide with the yql parameter - example:

@Override
public Result search(Query query, Execution execution) {
// Create grouping request.
GroupingRequest request = GroupingRequest.newInstance(query);
request.setRootOperation(new AllOperation()
.setGroupBy(new AttributeValue(&quot;foo&quot;))

// Perform grouping request.
Result result = execution.search(query);

// Process grouping result.
Group root = request.getResultGroup(result);
GroupList foo = root.getGroupList(&quot;foo&quot;);
for (Hit hit : foo) {
Group group = (Group)hit;
Long count = (Long)group.getField(&quot;count&quot;);
// TODO: Process group and count.
}

// Pass results back to calling searcher.
return result;
}

Refer to the API documentation for the complete reference.

## More examples

### TopN / Full corpus

Simple grouping, count the number of documents in each group:

all( group(a) each(output(count())) );

Two parallel groupings:

all( all(group(a) each(output(count())))
all(group(b) each(output(count()))) );

Only the 1000 best hits will be grouped at each content node. Lower accuracy, but higher speed:

all( max(1000) all(group(a) each(output(count()))) );

In streaming search one can group all searched documents by adding a where(true) clause:

all( where(true) all(group(myfield) each(output(count()))) );

### Selecting groups

Do a modulo 5 operation before selecting the group:

all( group(a % 5) each(output(count())) );

Do a + b * c before selecting the group:

all( group(a + b * c) each(output(count())) );

### Ordering groups

Do a modulo 5 operation before selecting the group - the groups are then ordered by their aggregated sum of attribute "b":

all( group(a % 5) order(sum(b)) each(output(count())) );

Do a + b * c before selecting the group. Ordering is given by the maximum value of attribute "d" in each group:

all( group(a + b * c) order(max(d)) each(output(count())) );

Take the average relevance of the groups and multiply it with the number of groups to get a cumulative count:

all( group(a) order(avg(relevance()) * count()) each(output(count())) );

One can not directly reference an attribute in the order clause, as this:

all(group(a) order(attr * count()) each(output(count())) );

However, one can do this:

all(group(a) order(max(attr) * count()) each(output(count())) );

### Collecting aggregates

Simple grouping to count number of documents in each group and return the best hit in each group:

all( group(a) each(max(1) each(output(summary()))) );

Also return the sum of attribute "b":

all( group(a) each(max(1) output(count(), sum(b)) each(output(summary()))) );

Also return an XOR of the 64 most significant bits of an MD5 over the concatenation of attributes "a", "b" and "c":

all(group(a) each(max(1) output(count(), sum(b), xor(md5(cat(a, b, c), 64)))
each(output(summary()))));

### Grouping

Single level grouping on "a" attribute, returning at most 5 groups with full hit count as well as the 69 best hits.

all( group(a) max(5) each(max(69) output(count()) each(output(summary()))) );

Two level grouping on "a" and "b" attribute:

all( group(a) max(5) each(output(count())
all(group(b) max(5) each(max(69) output(count())
each(output(summary()))))) );

Three level grouping on "a", "b" and "c" attribute:

all( group(a) max(5) each(output(count())
all(group(b) max(5) each(output(count())
all(group(c) max(5) each(max(69) output(count())
each(output(summary()))))) );

As above, but also collect best hit in level 2:

all( group(a) max(5) each(output(count())
all(group(b) max(5) each(output(count())
all(max(1) each(output(summary())))
all(group(c) max(5) each(max(69) output(count())
each(output(summary()))))) );

As above, but also collect best hit in level 1:

all( group(a) max(5) each(output(count())
all(max(1) each(output(summary())))
all(group(b) max(5) each(output(count())
all(max(1) each(output(summary())))
all(group(c) max(5) each(max(69) output(count())
each(output(summary()))))) );

As above, but using different document summaries on each level:

all( group(a) max(5) each(output(count())
all(max(1) each(output(summary(complexsummary))))
all(group(b) max(5) each(output(count())
all(max(1) each(output(summary(simplesummary))))
all(group(c) max(5) each(max(69) output(count())
each(output(summary(fastsummary)))))) );

Deep grouping with counting and hit collection on all levels:

all( group(a) max(5) each(output(count())
all(max(1) each(output(summary())))
all(group(b) each(output(count())
all(max(1) each(output(summary())))
all(group(c) each(output(count())
all(max(1) each(output(summary())))))))) );

### Time and date

Group by year:

all( group(time.year(a)) each(output(count())) );

Group by year, then by month:

all( group(time.year(a)) each(output(count())
all(group(time.month(a)) each(output(count())))) );

Group by year, then by month, then day, then by hour:

all( group(time.year(a)) each(output(count())
all(group(time.monthofyear(a)) each(output(count())
all(group(time.dayofmonth(a)) each(output(count())
all(group(time.hourofday(a)) each(output(count())))))))) );

Groups today, yesterday, lastweek, and lastmonth using predefined aggregator, and groups each day within each of these separately:

all( group(predefined((now() - a) / (60 * 60 * 24),
bucket(0,1), bucket(1,2), bucket(3,7), bucket(8,31)))
each(output(count())
all(max(2) each(output(summary())))
all(group((now() - a) / (60 * 60 * 24)) each(output(count())
all(max(2) each(output(summary())))))) );

### Counting unique groups

The count aggregator can be applied on list of groups to determine the number of unique groups without having to explicitly retrieve all groups. Note that this count is an estimate using HyperLogLog++ which is an algorithm for the count-distinct problem. To get an accurate count one needs to explicitly retrieve all groups and count them in a custom component or in the middle tier calling out to Vespa. This is network intensive and might not be feasible in cases with many unique groups.

Another use case for this aggregator is counting the number of unique instances matching a given expression.

Output an estimate of the number of groups, which is equivalent to the number of unique values for attribute "a":

all( group(a) output(count()) )

Output an estimate of the number of unique string lengths for the attribute "name":

all( group(strlen(name)) output(count()) )

Output the sum of the "b" attribute for each group in addition to the accurate count of the overall number of unique groups as the inner each causes all groups to be returned.

all( group(a) output(count()) each(output(sum(b))) )

The max clause is used to restrict the number of groups returned. The query outputs the sum for the 3 best groups. The count clause outputs the estimated number of groups (potentially >3). The count becomes an estimate here as the number of groups is limited by max, while in the above example it's not limited by max:

all( group(a) max(3) output(count()) each(output(sum(b))) )

Output the number of top level groups, and for the 10 best groups, output the number of unique values for attribute "b":

all( group(a) max(10) output(count()) each(group(b) output(count())) )

### Impression forecasting

Using impression logs for a given user, one can make a function that maps from rank score to the number of impressions an advertisement would get - example:

Score   Integer (# impressions for this user)
0.200   0
0.210   1
0.220   2
0.240   3
0.320   4
0.420   5
0.560   6
0.700   7
0.800   8
0.880   9
0.920  10
0.940  11
0.950  12


Storing just the first column (the rank scores, including a rank score for 0 impressions) in an array attribute named impressions, the grouping operation interpolatedlookup(impressions, relevance()) can be used to figure out how many times a given advertisement would have been shown to this particular user.

So if the rank score is 0.420 for a specific user/ad/bid combination, then interpolatedlookup(impressions,relevance()) would return 5.0. If the bid is increased so the rankscore gets to 0.490, it would get 5.5 as the return value instead.

In this context a count of 5.5 isn't meaningful for the past of a single user, but it gives more information that may be used as a forecast. Summing this across more, different users may then be used to forecast the total of future impressions for the advertisement.