Grouping Information in Results

This document describes what grouping is and gives a few examples on how to use it. Refer to the Search API for how to set the select parameter, and the grouping reference.

The advanced grouping feature of Vespa offers a flexible language in which to describe how your query hits should be grouped, aggregated and presented to the user. The grouping parameter can be thought of as a high-level program. The core operations are all (process a list of elements as a whole), each (process each element in a list separately), group (group the elements of a list into sub-lists) and output (present something to the user). These operations may be nested to describe a great variety of behavior.

The engine will carry out the desired program by a distributed execution against the nodes containing the data. As realizing such programs over a distributed data set requires more network roundtrips than a regular search query, these queries are more expensive than regular queries. However, executing such programs are always much more efficient than achieving the same end result by issuing multiple independent queries to the engine.

For the entirety of this document, we assume that you have an index of engine part purchases that contains the following data:

Date Price Tax Item Customer
2006-09-06 09:00:00 $1 000 0.24 Intake valve Smith
2006-09-07 10:00:00 $1 000 0.12 Rocker arm Smith
2006-09-07 11:00:00 $2 000 0.24 Spring Smith
2006-09-08 12:00:00 $3 000 0.12 Valve cover Jones
2006-09-08 10:00:00 $5 000 0.24 Intake port Jones
2006-09-08 11:00:00 $8 000 0.12 Head Brown
2006-09-09 12:00:00 $1 300 0.24 Coolant Smith
2006-09-09 10:00:00 $2 100 0.12 Engine block Jones
2006-09-09 11:00:00 $3 400 0.24 Oil pan Brown
2006-09-09 12:00:00 $5 500 0.12 Oil sump Smith
2006-09-10 10:00:00 $8 900 0.24 Camshaft Jones
2006-09-10 11:00:00 $1 440 0.12 Exhaust valve Brown
2006-09-10 12:00:00 $2 330 0.24 Rocker arm Brown
2006-09-10 10:00:00 $3 770 0.12 Spring Brown
2006-09-10 11:00:00 $6 100 0.24 Spark plug Smith
2006-09-11 12:00:00 $9 870 0.12 Exhaust port Jones
2006-09-11 10:00:00 $1 597 0.24 Piston Brown
2006-09-11 11:00:00 $2 584 0.12 Connection rod Smith
2006-09-11 12:00:00 $4 181 0.24 Rod bearing Jones
2006-09-11 13:00:00 $6 765 0.12 Crankshaft Jones

Basic Grouping

Say you wanted to return the sum price of purchases that have been recorded on a per-customer basis. First of all you would need to include every document available, so you query by document type /search/?yql=select * from sources * where sddocname contains 'purchase';. Note that you can (and should) swap sddocname contains "purchase" for a valid Vespa query - we are only querying all of the documents for the sake of completeness in this example. E.g. /search/?yql=select * from sources * where customer container 'smith'; would only produce groups related to Mr. Smith (and it would also be a faster query, because there would be fewer hits to process).

Second, you need to specify an expression on which to perform grouping. In our case, we would use group(customer).

Next, you need to specify what to present to the user. If you wanted to present the sum of the purchase price for each customer, you would add an each(output(sum(price))) clause. The each in this case would refer to each of the groups produced by group(customer).

Finally, seeing that you do not need the regular hits returned, you may append hits=0 to reduce latency. This amounts to the following complete query:

/search/?hits=0&yql=select * from sources * where sddocname contains 'purchase' |
                    all(group(customer) each(output(sum(price))));

In order to execute this query it needs to have all its special characters escaped. For the sake of readability, our queries are left unescaped in this article. For your reference, however, this is how the above query comes out when escaped:
/search/?hits=0& yql=select%20%2A%20from%20sources%20%2A%20where%20sddocname%20contains%20%27purchase%27%20%7C%20 all(group(customer)%20each(output(sum(price))))%3B

Result:

GroupId Sum(price)
Brown $20 537
Jones $39 816
Smith $19 484

Or you may choose to query for the sum price of purchases on a per-date basis;

select (…) | all(group(time.date(date)) each(output(sum(price))));

which produces:

GroupId Sum(price)
2006-09-06 $1 000
2006-09-07 $3 000
2006-09-08 $16 000
2006-09-09 $12 300
2006-09-10 $22 540
2006-09-11 $24 997

Search Container API

As well as adding a pipeline step to your YQL+ query, you can use the programmatic API to construct and run a grouping request. This allows multiple grouping requests to run in parallel, and does not collide with the yql parameter. To programmatically run a grouping request, do:

@Override
public Result search(Query query, Execution execution) {
    // Create grouping request.
    GroupingRequest request = GroupingRequest.newInstance(query);
    request.setRootOperation(new AllOperation()
            .setGroupBy(new AttributeValue("foo"))
            .addChild(new EachOperation()
                .addOutput(new CountAggregator().setLabel("count"))));

    // Perform grouping request.
    Result result = execution.search(query);

    // Process grouping result.
    Group root = request.getResultGroup(result);
    GroupList foo = root.getGroupList("foo");
    for (Hit hit : foo) {
        Group group = (Group)hit;
        Long count = (Long)group.getField("count");
        // TODO: Process group and count.
    }

    // Pass results back to calling searcher.
    return result;
}

Please see the API documentation for the complete reference on this.

Expressions

Instead of just grouping on some attribute value, the group clause may contain arbitrarily complex expressions - see group in the syntax reference for an exhaustive list. You may do things like:

  • Selecting the minimum or maximum of sub-expressions,
  • addition, subtraction, multiplication, division, and even modulo of sub-expressions,
  • bitwise operations on sub-expressions,
  • concatenation of the results of sub-expressions,
  • and more.

If you wanted to sum the prices of purchases on per-hour-of-day basis, do:

select (…) | all(group(mod(div(date,mul(60,60)),24)) each(output(sum(price))));

which produces:

GroupId sum(price)
09:00 $1 000
10:00 $22 367
11:00 $23 524
12:00 $26 181
13:00 $6 765

These types of expressions may also be used inside output operations, so instead of simply calculating the sum price of the grouped purchases, you could calculate the sum income after taxes per customer by doing:

select (…) | all(group(customer) each(output(sum(mul(price,sub(1,tax))))));

which produces:

GroupId sum(mul(price,sub(1,tax)))
Brown $17 193
Jones $32 868
Smith $15 897

Note that the validity of an expression depends on the current level of nesting. For example; while sum(price) would be a valid expression for a group of hits, price would not. As a general rule, each operator within an expression either applies to a single hit or aggregates values across a group.

Ordering and Limiting Groups

In most realistic scenarios you may be producing a large collection of groups, perhaps even too large to display or process. This is handled by ordering your groups according to some criteria, and then limiting the number of groups to return. However, these are decoupled features, and you may of course choose to do only one or the other.

The order clause accepts a list of one or more expressions. Each of the arguments to order is prefixed by either a plus for ascending order, or a minus for descending order.

Limiting the number of groups is done by using a constraint. The one used here is max which represent an upper limit for the number of groups presented in the result. However max can not be mentioned without its ugly brother precision. As finding a correct global maximum would require us to fetch all groups from all nodes and merging them together. This would result in a huge network bandwidth usage and will not scale. As a compromise precision was introduced. In most cases the groups that are the best on one node would be among the best on another node too. If you want the 2 globally best groups you should make an educated guess on how many samples you need to fetch from each node in order to get a correct answer. This educated guess you should use as precision. An initial factor of 3 has proven to be quite good in most usecases. However there is one important exception. If you do not give an explicit order constraint you do not need precision. Then local ordering will be the same as global ordering. Ordering will not change after a merge operation.

Say you wanted to see the two customers who had the most number of purchases, presenting the sum price for each customer;

select (…) | all(group(customer) max(2) order(-count())
                 each(output(sum(price))));

which produces:

GroupId sum(price)
Jones $39 816
Smith $19 484

If however the data for customer 'Jones' was spread on 3 different backend nodes. 'Jones' might be among the 2 best on only one node. But based on the distribution of the data we have concluded by earlier tests that if we fetch 5.67 as many groups as we need we will have a correct answer with at least 99.999% confidence. So then we just use 6 times as much groups when doing the merge. And that would look like this.

select (…) | all(group(customer) max(2) precision(12) order(-count())
                 each(output(sum(price))));

Important rules for max and precision

As max and precision are not intuitive right away we will give a few simple guidelines:

  • order clause changes ordering of groups after a merge operation for the following aggregators: count, avg and sum.
  • order clause will not change ordering of groups after a merge operation when max or min is used.
  • default order, which is max(relevancy()), does not require the use of precision.

Limiting and pagination

Instead of offering an offset clause to accompany the max clause to support pagination, grouping supports cookie-esque "continuation" objects that are passed as annotations to the grouping statement. The "continuations" annotation is a list of zero or more continuation strings. These strings can be found inside the grouping result itself;

  • A single "this" continuation per select. This can be regarded as the session identifier, as submitting this will reproduce the exact same result from which it was taken.
  • Zero or one "prev" continuation per group- and hit list. Submit any number of these to retrieve the next pages of the corresponding lists.
  • Zero or one "next" continuation per group- and hit list. Submit any number of these to retrieve the previous pages of the corresponding lists.

For example, given the result:

{
    "root": {
        "children": [
            {
                "children": [
                    {
                        "children": [
                            {
                                "fields": {
                                    "count()": 7
                                },
                                "value": "Jones",
                                "id": "group:string:Jones",
                                "relevance": 1.0
                            }
                        ],
                        "continuation": {
                            "next": "BGAAABEBEBC",
                            "prev": "BGAAABEABC"
                        },
                        "id": "grouplist:customer",
                        "label": "customer",
                        "relevance": 1.0
                    }
                ],
                "continuation": {
                    "this": "BGAAABEBCA"
                },
                "id": "group:root:0",
                "relevance": 1.0
            }
        ],
        "fields": {
            "totalCount": 20
        },
        "id": "toplevel",
        "relevance": 1.0
    }
}

One may now reproduce the same result by passing the "this" continuation along the orginal select:

select (…) | [{ 'continuations':['BGAAABEBCA'] }]all(…);

To display the next page of customers, pass the "this" continuation of the root group, and the "next" continuation of the customer list:

select (…) | [{ 'continuations':['BGAAABEBCA', 'BGAAABEBEBC'] }]all(…);

And finally, to display the previous page of customers, pass the "this" continuation of the root group, and the "prev" continuation of the customer list:

select (…) | [{ 'continuations':['BGAAABEBCA', BGAAABEABC'] }]all(…);

The "continuations" annotation is an ordered list of continuation strings. These are combined by replacement, so that a continuation given later will replace any shared state with a continuation given before. Also, when using the "continuations" annotation, one must always pass the "this" continuation as its first element.

Presenting Hits per Group

A special expression that can only be used inside the output operation is summary. This will result in the presentation of actual hits within the group. Constraints may be used to limit the number of hits presented.

To see the three most expensive parts sold to each registered customer, do:

/search/?yql=select * from sources * where sddocname contains 'purchase' |
             all(group(customer) each(max(3) each(output(summary()))));
        &ranking=pricerank

which produces:

GroupId
Brown
Date Price Tax Item Customer
2006-09-08 11:00 $8 000 0.12 Head Brown
2006-09-10 10:00 $3 770 0.12 Spring Brown
2006-09-09 11:00 $3 400 0.24 Oil pan Brown
Jones
Date Price Tax Item Customer
2006-09-11 12:00 $9 870 0.12 Exhaust port Jones
2006-09-10 10:00 $8 900 0.24 Camshaft Jones
2006-09-11 13:00 $6 765 0.12 Crankshaft Jones
Smith
Date Price Tax Item Customer
2006-09-10 11:00 $6 100 0.24 Spark plug Smith
2006-09-09 12:00 $5 500 0.12 Oil sump Smith
2006-09-11 11:00 $2 584 0.12 Connection rod Smith

Notice how we must append a ranking=pricerank parameter to the query in order to get ordering of the hits. The ranking parameter points a rank profile that is used for ranking the hits. The following ranking expression will rank the hits by their price attribute:

rank-profile pricerank inherits default {
    first-phase {
        expression: attribute(price)
    }
}

The order clause available in the select parameter is a directive for group ordering, not hit ordering.

Nested Groups

One of the greatest features that advanced grouping offers, is the ability to do nested groups. This offers unprecedented drilling capabilities, because there are no limits to nesting depth or presented information on any level.

To find out how much each customer has spent per day, do

select (…) | all(group(customer) each(group(time.date(date)) each(output(sum(price)))));
GroupId
Brown
GroupId Sum(price)
2006-09-08 $8 000
2006-09-09 $3 400
2006-09-10 $7 540
2006-09-11 $1 597
Jones
GroupId Sum(price)
2006-09-08 $8 000
2006-09-09 $2 100
2006-09-10 $8 900
2006-09-11 $20 816
Smith
GroupId Sum(price)
2006-09-06 $1 000
2006-09-07 $3 000
2006-09-09 $6 800
2006-09-10 $6 100
2006-09-11 $2 584

The summary expression described earlier may be used to present hits inside any group at any nesting level. Let us use this to query for all items on a per-customer basis, displaying the most expensive hit for each customer, with subgroups of purchases on a per-date basis. We also want to include the sum price for each customer, both as a grand total and broken down on a per-day basis. The query

/search/?yql=select * from sources * where sddocname contains 'purchase' |
             all(group(customer)
                 each(max(1) output(sum(price)) each(output(summary())))
                      each(group(time.date(date))
                      each(max(10) output(sum(price)) each(output(summary())))));
        &ranking=pricerank

produces the following:

GroupId sum(price)
Brown $20 537
Date Price Tax Item Customer
2006-09-08 11:00 $8 000 0.12 Head Brown
GroupId Sum(price)
2006-09-08 $8 000
Date Price Tax Item Customer
2006-09-08 11:00 $8 000 0.12 Head Brown
2006-09-09 $3 400
Date Price Tax Item Customer
2006-09-09 11:00 $3 400 0.12 Oil pan Brown
2006-09-10 $7 540
Date Price Tax Item Customer
2006-09-10 10:00 $3 770 0.12 Spring Brown
2006-09-10 12:00 $2 330 0.24 Rocker arm Brown
2006-09-10 11:00 $1 440 0.12 Exhaust valve Brown
2006-09-11 $1 597
Date Price Tax Item Customer
2006-09-11 10:00 $1 597 0.24 Piston Brown
Jones $39 816
Date Price Tax Item Customer
2006-09-11 12:00 $9 870 0.12 Exhaust port Jones
GroupId Sum(price)
2006-09-08 $8 000
Date Price Tax Item Customer
2006-09-08 10:00 $5 000 0.24 Intake port Jones
2006-09-08 12:00 $3 000 0.12 Valve cover Jones
2006-09-09 $2 100
Date Price Tax Item Customer
2006-09-09 10:00 $2 100 0,12 Engine block Jones
2006-09-10 $8 900
Date Price Tax Item Customer
2006-09-10 10:00 $8 900 0.24 Camshaft Jones
2006-09-11 $20 816
Date Price Tax Item Customer
2006-09-11 12:00 $9 870 0.12 Exhaust port Jones
2006-09-11 13:00 $6 765 0.12 Crankshaft Jones
2006-09-11 12:00 $4 181 0.24 Rod bearing Jones
Smith $19 484
Date Price Tax Item Customer
2006-09-10 11:00 $6 100 0.24 Spark plug Smith
GroupId Sum(price)
2006-09-06 $1 000
Date Price Tax Item Customer
2006-09-06 09:00 $1 000 0.24 Intake valve Smith
2006-09-07 $3 000
Date Price Tax Item Customer
2006-09-07 11:00 $2 000 0.24 Spring Smith
2006-09-07 10:00 $1 000 0.12 Rocker arm Smith
2006-09-09 $6 800
Date Price Tax Item Customer
2006-09-09 12:00 $5 500 0.12 Oil sump Smith
2006-09-09 12:00 $1 300 0.24 Coolant Smith
2006-09-10 $6 100
Date Price Tax Item Customer
2006-09-10 11:00 $6 100 0.24 Spark plug Smith
2006-09-11 $2 584
Date Price Tax Item Customer
2006-09-11 11:00 $2 584 0.12 Connection rod Smith