This is the third part of the tutorial series for setting up a Vespa application for personalized news recommendations. The parts are:
In the previous part, we converted the Microsoft News Dataset (MIND) to Vespa, and fed it to our application. In this part, we'll issue searches in this content and look at sorting, grouping, and ranking the results.
For reference, the final state of this tutorial can be found in the
app-3-searching
sub-directory of the news
sample application.
Conceptually, Vespa has two stages when determining the exact result to return. This first is "matching", where all the documents that match the query are found. This is a binary decision; either the document matches or it doesn't. For instance, when searching for a word, all documents that contain it are selected as candidates in this stage.
The next stage determines the ordering of the results. We can think of the results being ordered either by:
Ordering by an attribute is called sorting.
For instance, we can sort by decreasing date
.
Grouping also works on attributes.
An example is to group the results by a category
attribute.
Calculating a score to order by is generally called "ranking". As these scores are usually dependent upon both query and document, they can also be called relevance. Such expressions can be arbitrarily complex, but in general, require some form of computation to find this score. Ranking can be divided into multiple rank phases as well.
We'll start by looking at attribute-based sorting and grouping before moving on to ranking.
We saw multiple examples of attributes in the news.sd
schema, for instance:
field date type int {
indexing: summary | attribute
attribute: fast-search
}
Note that this date
field has been defined as an int
here, and when
feeding document, we convert the date to the format YYYYMMDD
.
An attribute is an in-memory field - this is different from index fields, which may be moved to a disk-based index as more documents are added and the index grows. Since attributes are kept in memory, they are excellent for fields that require fast access for many documents, e.g. fields used for sorting, ranking or grouping query results. The downside is higher memory usage.
In the above field definition we have included an additional property attribute: fast-search
which will inform Vespa that we want to build inverted index structures (dictionary and posting lists)
for fast matching in the field.
See more about
when to use fast-search
in the performance feature tuning section.
$ vespa query -v 'yql=select * from news where default contains "20191110"'
This is a single-term query for the term 20191110
in the default
fieldset.
In the schema, the field date
is not included in the default
fieldset, so no results are found.
Instead, we search using =
which can be used for numeric and bool fields:
$ vespa query -v 'yql=select * from news where date=20191110'
To get documents that were created 10 November 2019, and whose date
field is
20191110
, replace default
with date
in the YQL query string.
$ vespa query -v 'yql=select * from news where date=20191110 and default contains "weather"'
This is a query with two terms; a search in the default
field set for the term
"weather" combined with a search in the date
field for the value 20191110
.
The examples above searched over date
just as any other field,
and requested documents where the value was exactly 20191110
.
Since the field is of type int, however, we can use it for range searches as well,
using the "less than" and "greater than" operators (<
and >
). The query:
$ vespa query -v 'yql=select * from news where date < 20191110'
finds all documents where the value of date
is less than 20191110
,
i.e. all documents from before 10 November 2019, while
$ vespa query -v 'yql=select * from news where date <= 20191110 AND date >= 20191108'
finds all news articles from 8 November 2019 to 10 November 2019, inclusive.
The first feature we will look at is how an attribute can be used to change the hit order. By now, you have probably noticed that hits are returned in order of descending relevance, i.e. how well the document matches the query — if not, take a moment to verify this. You might ask how Vespa does this since we haven't even touched upon ranking yet. The answer is that Vespa uses its nativeRank score unless anything else is defined in the schema. We'll get back to defining custom ranking later on.
Now send the following query to Vespa, and look at the order of the hits:
$ vespa query -v 'yql=select date from news where default contains phrase("music","festival") order by date'
By default, sorting is done in ascending order.
This can also be specified by appending asc
after the sort attribute name.
Use desc
to sort the results in descending order:
$ vespa query -v 'yql=select date from news where default contains phrase("music","festival") order by date desc'
Attempting to sort on a field which is not defined as attribute in the schema will create an error.
Grouping is the concept of looking through all matching documents at query-time and then performing operations with specified fields across all the documents — some common use cases include:
Displaying such groups and their sizes (in terms of matching documents per group) on a search result page,
with a link to each such group,
is a common way to let users refine searches.
For now, we will only do a simple grouping query to get a list of unique values for category
,
ordered by the number of documents they occur in and top 3 is shown:
$ vespa query -v 'yql=select * from news where true limit 0 | all(group(category) max(3) order(-count())each(output(count())))'
Note that expression after the pipe (|
):
this is the grouping expression that determines how grouping will be performed.
You can read more about the grouping syntax in the
grouping reference documentation.
limit 0
is an alternative syntax for the native hits
parameter,
in this case we are only interested in the group counts, so we set limit to 0.
For this query, you will get something like the following:
{
"root": {
"children": [
{
"children": [
{
"children": [
{
"fields": {
"count()": 9115
},
"id": "group:string:news",
"relevance": 1.0,
"value": "news"
},
{
"fields": {
"count()": 6765
},
"id": "group:string:sports",
"relevance": 0.6666666666666666,
"value": "sports"
},
{
"fields": {
"count()": 1886
},
"id": "group:string:finance",
"relevance": 0.3333333333333333,
"value": "finance"
}
],
"continuation": {
"next": "BGAAABEBGBC"
},
"id": "grouplist:category",
"label": "category",
"relevance": 1.0
}
],
"continuation": {
"this": ""
},
"id": "group:root:0",
"relevance": 1.0
}
],
"coverage": {
"coverage": 100,
"documents": 28603,
"full": true,
"nodes": 1,
"results": 1,
"resultsFull": 1
},
"fields": {
"totalCount": 28603
},
"id": "toplevel",
"relevance": 1.0
}
}
So, the three most common unique values of category
among the indexed documents
(for the demo data set) are:
news
with 9115 articlessports
with 6765 articlesfinance
with 1886 articlesTry to change the filter part of the YQL+ expression — the where
clause —
to a text match of "weather", or restrict date
to be less than 20191110,
and see how the list of unique values changes as the set of matching documents for your query changes.
If you try to search for a single term that is not present in the document set,
you will see that the list of groups is empty as no documents have been matched.
Vespa grouping is only applied over the documents which matched the query.
In the following example we use the select parameter to pass the grouping specification:
$ vespa query -v 'yql=select * from news where userQuery() limit 0' \ 'select=all(group(category) max(2) each(max(2)each(output(summary()))))' \ 'query=drinks'
This request searches for drinks, groups by category and for each unique category output the 2 top ranking hits (according to the rank profile used). Groups are sorted by default by maximum relevance in the group. Notice that we also set an upper limit on the number of unique groups my the outermost max. This is important in cases with many unique values. See also Result diversification using Vespa result grouping.
Please refer to the grouping guide for more information and examples using Vespa grouping. Similar to with sorting, attempting to group on a field which is not defined as attribute in the schema will create an error.
Before we move on to ranking,
it's important to know some of the differences between index
and attribute
.
Consider the title
field from our schema,
and the document for the article with title "A little snow causes a big mess, more than 100 crashes on Minnesota roads".
In the original input, the value for title
is a string built of up the 14 words,
with a single white space character between them.
How should we be able to search this field?
For string fields with index
which defaults to match:text
, Vespa performs linguistic processing of the string.
This includes tokenization, normalization
and language dependent stemming of the string.
In our example, this means that the string above is split into the 14 tokens, enabling Vespa to match this document for:
This is how we all have come to expect normal free text search to work.
However, string fields with indexing:attributes
do not support match:text
,
only exact matching or prefix matching.
Exact matching is the default, and, as the name implies,
it requires you to search for the exact contents of the field in order to get a match.
See supported match modes
and the differences in support between attribute
and index
.
Attributes are stored in memory, as opposed to fields with index
,
where the data is mostly kept on disk but paged in on-demand and cached by the OS buffer cache.
Even with large flavor types,
one will notice that it is not practical to define all the document type fields as attributes,
as it will heavily restrict the number of documents per search node.
Some Vespa applications have more than 1 billion documents per node —
having megabytes of text per document in memory per document might not be cost-effective.
There are both advantages and drawbacks of using attributes —
it enables sorting, ranking and grouping,
but requires more memory and does not support match:text
capabilities.
Attribute fields do support at least one order higher update throughput then regular index
fields,
see partial updates with Vespa.
When to use attributes depends on the application; in general, use attributes for:
Finally, all numeric and tensors fields used in ranking must be defined with attribute.
field category type string {
indexing: summary | attribute | index
}
Combining both index and attribute for the same field is supported.
In this case, we can sort and group on the category,
while search or matching will be using index matching with match:text
,
which will tokenize and stem the contents of the field.
Ranking and relevance were briefly mentioned above; what is really the relevance of a hit? How can one change the relevance calculations? It is time to introduce rank profiles and ranking expressions — simple, yet powerful methods for tuning the relevance.
Relevance is a measure of how well a given document matches a query. The default relevance is calculated by a formula that takes several matching factors into consideration. It computes, in essence, how well the document matches the terms in the query. The default Vespa ranking function and its limitations is described in ranking with nativeRank.
Ranking signals that might be useful, like freshness (the age of the document compared to the time of the query) or any other document or query features, are not a part of the nativeRank calculation. These need to be added to the ranking function depending on application specifics.
Some use cases for tweaking the relevance calculations:
Vespa allows creating any number of rank profiles:
named collections of ranking and relevance calculations that one can choose from at query time.
A number of built-in functions and expressions are available to create highly
specialized ranking expressions and users can define their own functions in the schema.
During the conversion of the news dataset, the conversion script counted both the number of times a news article was shown (impressions) and how many clicks it received. A high number of clicks relative to impressions indicates that the news article was generally popular. We can use this signal in our ranking. Since both clicks and impressions are attribute fields, these fields can be updated at scale with very high throughput.
We can use this signal in our ranking,
by including a popularity
rank profile,
as defined below at the bottom of schemas/news.sd
.
Note that rank profiles are defined outside the document
block:
schema news { document news { field news_id type string { indexing: summary | attribute attribute: fast-search } field category type string { indexing: summary | attribute } field subcategory type string { indexing: summary | attribute } field title type string { indexing: index | summary index: enable-bm25 } field abstract type string { indexing: index | summary index: enable-bm25 } field body type string { indexing: index | summary index: enable-bm25 } field url type string { indexing: index | summary } field date type int { indexing: summary | attribute attribute: fast-search } field clicks type int { indexing: summary | attribute } field impressions type int { indexing: summary | attribute } } fieldset default { fields: title, abstract, body } rank-profile popularity inherits default { function popularity() { expression: if (attribute(impressions) > 0, attribute(clicks) / attribute(impressions), 0) } first-phase { expression: nativeRank(title, abstract) + 10 * popularity } } }
rank-profile popularity inherits default
This configures Vespa to create a new rank profile named popularity
,
which inherits all the properties of the default rank-profile;
only properties that are explicitly defined, or overridden,
will differ from those of the default rank-profile.
first-phase
Relevance calculations in Vespa are two-phased. The calculations done in the
first phase are performed on every single document matching your query,
while the second phase calculations are only done on the top n
documents
as determined by the calculations done in the first phase.
See phased ranking.
function popularity()
This sets up a function that can be called from other expressions. This function calculates the number of clicks divided by impressions for indicating popularity. However, this isn't really the best way of calculating this as an article with a low number of impressions can score high on such a value, even though uncertainty is high.
expression: nativeRank + 10 * popularity
This expression is used to rank documents.
Here, the default ranking expression — the nativeRank
of the default
fieldset —
is included to make the query relevant,
while the second term calls the popularity
function.
The weighted sum of these two terms is the final relevance for each document.
Note that the weight here, 10
, is set by observation.
A better approach would be to learn such values using machine learning.
More information can be found in the schema reference.
Deploy the popularity rank profile:
$ vespa deploy --wait 300 my-app
Run a query:
$ vespa query -v \ 'yql=select * from news where default contains "music"' \ 'ranking=popularity'
and find documents with high popularity
values at the top.
Note that we must specify the rank profile to use with the run time ranking
parameter.
After completing this part of the tutorial, you should now have a basic understanding of how you can load data into Vespa and effectively search for content. In the next part of the tutorial, we'll start with the basics for transforming this search app into a recommendation system.