Vespa tutorial pt. 1: Blog searching

Introduction

This is the first of a series of tutorials where data from WordPress.com (WP) will be used to highlight how Vespa can be used to store, search and recommend blog posts. The data was made available during a Kaggle challenge to predict which blog posts someone would like based on their past behavior. It contains many ingredients that are necessary to showcase needs, challenges and possible solutions that are useful for those interested in building and deploying such applications in production.

At any given time, Vespa will store a set of documents (also called a content pool), which in this case is formed by the blog posts available. Our end goal with this series of tutorials is to build an application where:

  1. Users will be able to search and manipulate the pool of blog posts available.
  2. Users will get blog post recommendations from the content pool based on their interest.

This tutorial will address:

  • How to describe the dataset used as well as any information connected to the data that we consider relevant to this tutorial.
  • How to set up a basic blog post search engine using Vespa.

The next tutorial will show how to extend this basic search engine application with machine learned models to create a blog recommendation engine.

Dataset

The dataset used throughout these tutorials contains blog posts written by WP bloggers and actions, in this case ‘likes’, performed by WP readers in blog posts they have interacted with. The dataset is publicly available at Kaggle and was released during a challenge to develop algorithms to help predict which blog posts users would most likely ‘like’ if they were exposed to them. In the following is a short description of the data that we will use.

From the content side, the data includes these fields per blog post:

post_idunique numerical id identifying the blog post
date_gmtstring representing date of blog post creation in GMT format yyyy-mm-dd hh:mm:ss
authorunique numerical id identifying the author of the blog post
urlblog post URL
titleblog post title
blogunique numerical id identifying the blog that the blog post belongs to
tagsarray of strings representing the tags of the blog posts
contentbody text of the blog post, in html format
categoriesarray of strings representing the categories the blog post was assigned to

For the user actions:

post_idunique numerical id identifying the blog post
uidunique numerical id identifying the user that liked post_id
dtdate of the interaction in GMT format yyyy-mm-dd hh:mm:ss

Downloading raw data

For the purposes of this tutorial, it is sufficient to use the first release of training data that consists of 5 weeks of posts as well as all the “like” actions that occurred during those 5 weeks.

This first release of training data is available here, but only for logged-in users of Kaggle; you will need to make an account or log in using your account on one of the supported digital identity platforms (Facebook, Google, Yahoo) to be able to download the file.

Once you have the zip file downloaded, unzip it. The 1,196,111 line trainPosts.json will be our practice document data. This file is around 5GB in size.

Dataset & Resource usage

Indexing the full data set requires 23GB disk space. These tutorials have been tested with a Docker container with 10GB RAM. We used similar settings as described in the vespa quick start guide. As in the guide we assume that the $VESPA_SAMPLE_APPS env variable points to the directory with your local clone of the vespa sample apps. bash $ docker run -m 10G --detach --name vespa --hostname vespa-tutorial --privileged \ --volume $VESPA_SAMPLE_APPS:/vespa-sample-apps --publish 8080:8080 vespaengine/vespa

Searching blog posts

This tutorial provides an overview of the major features of Vespa. The objective is to build a Vespa based blog post search engine application. Functional specification:

  • Blog post title, content, tags and categories must all be searchable
  • Allow blog posts to be sorted by both relevance and date
  • Allow grouping of search results by tag or category

In terms of data, Vespa operates with the notion of documents. A document represents a single, searchable item in your system, e.g., a blog post, a Flickr photo, or a Yahoo News article. Each document type must be defined in your Vespa configuration through a search definition. You can think of a search definition as being similar to a table definition in a relational database; it consists of a set of fields, each with a given name, a specific type, and some optional properties.

As an example, for this simple blog post search application, we could create the document type blog_post with the following fields:

url
of type uri
title
of type string
content
of type string (string fields can be of any length)
date_gmt
of type string (to store the creation date in GMT format)

The data fed into Vespa must match the structure of the search definition, and the hits returned when searching will be on this format as well.

Application Packages

A Vespa application package is the set of configuration files and Java plugins that together define the behavior of a Vespa system: what functionality to use, the available document types, how ranking will be done and how data will be processed during feeding and indexing. A search definition, e.g., blog_post.sd, is a required part of an application package — the other required files are services.xml and hosts.xml.

The sample application blog search creates a simple but functional blog post search engine. The application package is found in src/main/application.

Services Specification

services.xml defines the services that make up the Vespa application — which services to run and how many nodes per service:

<?xml version='1.0' encoding='UTF-8'?>
<services version='1.0'>

  <container id='default' version='1.0'>
    <search/>
    <document-api/>
    <nodes>
      <node hostalias='node1'/>
    </nodes>
  </container>

  <content id='blog_post' version='1.0'>
    <search>
      <visibility-delay>1.0</visibility-delay>
    </search>
    <redundancy>1</redundancy>
    <documents>
      <document mode='index' type='blog_post'/>
    </documents>
    <nodes>
      <node hostalias='node1'/>
    </nodes>
    <engine>
      <proton>
        <searchable-copies>1</searchable-copies>
      </proton>
    </engine>
  </content>

</services>

<container> defines the container cluster for document, query and result processing:
<search> sets up the search endpoint for Vespa queries. The default port for both is 8080.
<document-api> sets up the document endpoint for feeding.
<nodes> defines the nodes required per service. (See the reference for more on container cluster setup.)

<content> defines how documents are stored and searched within Vespa:
<redundancy> denotes how many copies to keep of each document.
<documents> assigns the document types in the search definition — setting the mode attribute to index for the document type enables indexed search (as opposed to streaming search). The content cluster capacity can be increased by adding node elements — see elastic Vespa. (See also the reference for more on content cluster setup.)
<nodes> defines the hosts for the content cluster.

Deployment Specification

hosts.xml contains a list of all the hosts/nodes that is part of the application, with an alias for each of them. This tutorial uses a single node:

<?xml version="1.0" encoding="utf-8" ?>
<hosts>
  <host name="localhost">
    <alias>node1</alias>
  </host>
</hosts>

Search Definition

The blog_post document type mentioned in src/main/application/service.xml is defined in a search definition. src/main/application/searchdefinitions/blog_post.sd contains the search definition for a document of type blog_post:

search blog_post {

    document blog_post {

        field date_gmt type string {
            indexing: summary
        }

        field language type string {
            indexing: summary
        }

        field author type string {
            indexing: summary
        }

        field url type string {
            indexing: summary
        }

        field title type string {
            indexing: summary | index
        }

        field blog type string {
            indexing: summary
        }

        field post_id type string {
            indexing: summary
        }

        field tags type array<string> {
            indexing: summary
        }

        field blogname type string {
            indexing: summary
        }

        field content type string {
            indexing: summary | index
        }

        field categories type array<string> {
            indexing: summary
        }

        field date type int {
            indexing: summary | attribute
        }

    }


    fieldset default {
        fields: title, content
    }


    rank-profile post inherits default {

        first-phase {
            expression:nativeRank(title, content)
        }

    }

}

document is wrapped inside another element called search. The name following these elements, here blog_post, must be exactly the same for both.

The field property indexing configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing — see indexing language. Each part of the indexing pipeline is separated by the pipe character ‘|’, and the two keywords used above — index and summary — are the most common ones:

Deploy the Application Package

Once done with the application package, deploy the Vespa application — build and start Vespa as in the quick start. We assume that the vespa source code repository is mounted at /vespa-sample-apps as in the quick start guide. Deploy the application:

$ cd /vespa-sample-apps/blog-search
$ vespa-deploy prepare src/main/application && vespa-deploy activate

This prints that the application was activated successfully and also the checksum, timestamp and generation for this deployment (more on that later). Pointing a browser to http://localhost:8080/ApplicationStatus returns JSON-formatted information about the active application, including its checksum, timestamp and generation (and should be the same as the values when vespa-deploy activate was run). The generation will increase by 1 each time a new application is successfully deployed, and is the easiest way to verify that the correct version is active.

The Vespa node is now configured and ready for use, so it is time to feed it some data.

Feeding Data

As mentioned before, the data fed to Vespa must match the search definition for the document type. The data downloaded from Kaggle, contained in trainPosts.json, must be converted to a valid Vespa document format before it can be fed to Vespa. Find a parser in the utility repository for this tutorial. Since the full data set is unnecessarily large for the purposes of this first part of the tutorial, we use only the first 10,000 lines of it, but feel free to load all 1,1M entries if you prefer:

$ head -10000 trainPosts.json > trainPostsSmall.json
$ python parse.py trainPostsSmall.json > tutorial_feed.json

With Vespa-compatible data, send this to Vespa using one of the tools Vespa provides for feeding. In this part of the tutorial, the Java feeding API is used, which is suitable for most applications requiring high throughput.

$ java -jar $VESPA_HOME/lib/jars/vespa-http-client-jar-with-dependencies.jar --verbose --file tutorial_feed.json --host localhost --port 8080

Note that in the sample-apps/blog-search directory, there is a file with sample data. You may also feed this file using this method.

Track feeding progress

Use the Metrics API to track number of documents indexed:

$ curl -s 'http://localhost:19112/state/v1/metrics' | tr ',' '\n' | grep -A 2 proton.doctypes.blog_post.numdocs

You can also inspect the search node state by

$ vespa-proton-cmd --local getState  

Fetch documents

Although searching is the most useful way to access the documents, one can fetch documents by document id using the Document API:

$ curl -s 'http://localhost:8080/document/v1/blog-search/blog_post/docid/1750271' | python -m json.tool

The first query

Searching with Vespa is done using a simple HTTP interface, with basic GET requests. The general form of an unstructured search request is:

<host>/<templatename>?<param1=value1>&<param2=value2>...

The template name is optional. The only mandatory parameter is the query itself, given with query=<query string> for the simple query language, or with yql=<yql query> when using the advanced query syntax.

  • The simple query language is intended to be usable directly by end users, and provides a somewhat simplistic interface to Vespa.
  • The advanced query syntax is intended for programmatic use, and is the syntax we use in these tutorials. It uses the YQL query language.

More details can be found in the Search API.

Simple query language examples

Given the above search definition, where the fields title and content are part of the field set default, any document containing the word “music” in one or more of these two fields matches our query below:

$ curl -s 'http://localhost:8080/search/?query=music' | python -m json.tool

Looking at the output, please note:

  • The field documentid in the output and how it matches the value we assigned to each put operation when feeding data to Vespa.
  • Each hit has a property named relevance, which indicates how well the given document matches our query, using a pre-defined default ranking function. You have full control over ranking — more about ranking and ordering later. The hits are sorted by this value.
  • When multiple hits have the same relevance score their internal ordering is undefined. However, their internal ordering will not change unless the documents are re-indexed.

Advanced query syntax examples

If you add &tracelevel=2 to the end of the simple query above, you will see the query is parsed

$ curl -s 'http://localhost:8080/search/?query=music&tracelevel=2' | python -m json.tool | grep Query
"message": "Query parsed to: select * from sources * where default contains \"music\";"

which can also be written in YQL as:

$ curl -s 'http://localhost:8080/search/?yql=select+*+from+sources+*+where+default+contains+%22music%22%3B' | python -m json.tool

Other examples

yql=select+title+from+sources+*+where+title+contains+%22music%22%3B

Once more a search for the single term “music”, but this time with the explicit field title. This means that we only want to match documents that contain the word “music” in the field title. As expected, you will see fewer hits for this query, than for the previous one.

yql=select+*+from+sources+*+where+default+contains+%22music%22+AND+default+contains+%22festival%22%3B

This is a query for the two terms “music” and “festival”, combined with an AND operation; it finds documents that match both terms — but not just one of them.

yql=select+*+from+sources+*+where+sddocname+contains+%22blog_post%22%3B

This is a single-term query in the special field sddocname for the value “blog_post”. This is a common and useful Vespa trick to get the number of indexed documents for a certain document type (search definition): sddocname is a special and reserved field which is always set to the name of the document type for a given document. Our 1196 documents are all of type blog_post, and will therefore automatically have the field sddocname set to that value.

This means that the query above really means “Return all documents of type blog_post”, and as such all 1196 documents in our index will be returned.

Refer to the Search API for more information.

Relevance and Ranking

Ranking and relevance were briefly mentioned above; what is really the relevance of a hit, and how can one change the relevance calculations? It is time to introduce rank profiles and rank expressions — simple, yet powerful methods for tuning the relevance.

Relevance is a measure of how well a given document matches a query. The default relevance is calculated by a formula that takes several factors into consideration, but it computes, in essence, how well the document matches the terms in the query.

When building specialized applications using Vespa, there are use cases for tweaking the relevance calculations:

  • Personalize search results based on some property; age, nationality, language, friends and friends of friends, and so on.
  • Rank fresh (age) documents higher, while still considering other relevance measures.
  • Rank documents by geographical location, searching for relevant resources nearby.

Vespa allows creating any number of rank profiles: named collections of ranking and relevance calculations that one can choose from at query time. A number of built-in functions and expressions are available to create highly specialized rank expressions.

Blog popularity signal

It is time to include the notion of blog popularity into the ranking function. Do this by including the post_popularity rank profile below at the bottom of src/main/application/searchdefinitions/blog_post.sd, just below the post rank profile.

    rank-profile post_popularity inherits default {

        first-phase {
            expression: nativeRank(title, content) + 10 * if(isNan(attribute(popularity)), 0, attribute(popularity))
        }

    }

Also, add a popularity field at the end of the document definition:

        field popularity type double {
            indexing: summary | attribute
        }

Notes (more information can be found in the search definition reference):

  • rank-profile post_popularity inherits default

    This configures Vespa to create a new rank profile named post_popularity, which inherits all the properties of the default rank-profile; only properties that are explicitly defined, or overridden, will differ from those of the default rank-profile.

  • first-phase

    Relevance calculations in Vespa are two-phased. The calculations done in the first phase are performed on every single document matching your query, while the second phase calculations are only done on the top n documents as determined by the calculations done in the first phase.

  • expression: nativeRank(title, content) + 10 * if(isNan(attribute(popularity)), 0, attribute(popularity))

Still using the basic search relevance for title and content, boosting documents based on some document level popularity signal.

This expression is used to rank documents. Here, the default ranking expression — the nativeRank of the default field set — is included to make the query relevant, while the custom, second term includes the document value attribute(popularity), if this is set. The weighted sum of these two terms is the final relevance for each document.

Deploy the configuration:

$ vespa-deploy prepare src/main/application && vespa-deploy activate

Use parse.py — which has a -p option to calculate and add a popularity field — and then feed the parsed data:

$ python parse.py -p trainPostsSmall.json > tutorial_feed_with_popularity.json
$ java -jar $VESPA_HOME/lib/jars/vespa-http-client-jar-with-dependencies.jar --file tutorial_feed_with_popularity.json --host localhost --port 8080

After feeding, query

$ curl -s 'http://localhost:8080/search/?query=music&ranking=post_popularity' | python -m json.tool

and find documents with high popularity values at the top.

Sorting and Grouping

What is an attribute?

An attribute is an in-memory field, which means that Vespa keeps the contents of the field in memory at all times; this behavior is different from that of regular index fields, which may be moved to a disk-based index as more documents are added and the index grows. Since attributes are kept in memory, they are excellent for fields which require fast access, e.g., fields used for sorting or grouping query results. The downside is that they make Vespa use more memory per document. Thus, by default, no index is generated for attributes, and search over these defaults to a linear scan. To build an index for an attribute field, include attribute:fast-search in the field definition.

Defining an attribute field

A field with indexing attribute will be present in memory at all time for very fast access; an example is found in blog_post.sd:

field date type int {
    indexing: summary | attribute
}

The data has format YYYYMMDD. And since the field is an int, it can be used for range searches.

Example queries using attribute field

yql=select+*+from+sources+*+where+default+contains+%2220120426%22%3B

This is a single-term query for the term 20120426 in the default field set. (The strings %22 and %3B are URL encodings for " and ;.) In the search definition, the field date is not included in the default field set. Nevertheless, the string “20120426” is found in the content of many posts, which are returned then as results.

yql=select+*+from+sources+*+where+date+contains+%2220120426%22%3B

To get documents that were created 26 April 2012, and whose date field is 20120426, replace default with date in the YQL query string. Note that since date has not been defined with attribute:fast-search, searching will be done by scanning all documents.

yql=select+*+from+sources+*+where+default+contains+%22recipe%22+AND+date+contains+%2220120426%22%3B

A query with two terms; a search in the default field set for the term “recipe” combined with a search in the date field for the value 20120426. This search will be faster than the previous example, as the term “recipe” is for a field for which there is an index, and for which the search core will evaluate the query first. (This speedup is only noticeable with the full data set!)

Range searches

The examples above searched over date just as any other field, and requested documents where the value was exactly 20120426. Since the field is of type int, however, we can use it for range searches as well, using the “less than” and “greater than” operators (< and >, or %3C and %3E URL encoded). The query

yql=select+*+from+sources+*+where+date+%3C+20120401%3B

finds all documents where the value of date is less than 20120401, i.e., all documents from before April 2012, while

yql=select+*+from+sources+*+where+date+%3C+20120401+AND+date+%3E+20120229%3B

finds all documents exactly from March 2012.

Sorting

The first feature we will look at is how an attribute can be used to change the order of the hits that are returned when you do a query. By now, you have probably noticed that hits are returned in order of descending relevance, i.e., how well the document matches the query — if not, take a moment to verify this.

Now try to send the following query to Vespa, and look at the order of the hits:

$ curl -s 'http://localhost:8080/search/?yql=select+*+from+sources+*+where+default+contains+%22music%22+AND+default+contains+%22festival%22+order+by+date%3B' | python -m json.tool

By default, sorting is done in ascending order. This can also be specified by appending asc after the sort attribute name. To sort the result in descending order, add the keyword desc after the sort attribute name.

$ curl -s 'http://localhost:8080/search/?yql=select+*+from+sources+*+where+default+contains+%22music%22+AND+default+contains+%22festival%22+order+by+date+desc%3B' | python -m json.tool

Query time data grouping

Grouping is the concept of looking through all matching documents at query-time and then performing operations with specified fields across all the documents — some common use cases include:

  • Find all the unique values for a given field, make one group per unique value, and return the count of documents per group.
  • Group documents by time and date in fixed-width or custom-width buckets. An example of fixed-width buckets could be to group all documents by year, while an example of custom buckets could be to sort bug tickets by date of creation into the buckets Today, Past Week, Past Month, Past Year, and Everything else.
  • Calculate the minimum/maximum/average value for a given field.

Displaying such groups and their sizes (in terms of matching documents per group) on a search result page, with a link to each such group, is a common way to let end-users refine and narrow down their search.

For now we will only do a very simple grouping query to get a list of unique values for date ordered by the number of documents they occur in and top 3 is shown:

curl -s 'http://localhost:8080/search/?yql=select%20*%20from%20sources%20*%20where%20sddocname%20contains%20%22blog_post%22%20limit%200%20%7C%20all(group(date)%20max(3)%20order(-count())each(output(count())))%3B' | python -m json.tool

With the full data set, you will get the following output:

{
    "root": {
        "children": [
            {
                "children": [
                    {
                        "children": [
                            {
                                "fields": {
                                    "count()": 43
                                },
                                "id": "group:long:20120419",
                                "relevance": 1.0,
                                "value": "20120419"
                            },
                            {
                                "fields": {
                                    "count()": 40
                                },
                                "id": "group:long:20120424",
                                "relevance": 0.6666666666666666,
                                "value": "20120424"
                            },
                            {
                                "fields": {
                                    "count()": 39
                                },
                                "id": "group:long:20120417",
                                "relevance": 0.3333333333333333,
                                "value": "20120417"
                            }
                        ],
                        "continuation": {
                            "next": "BGAAABEBGBC"
                        },
                        "id": "grouplist:date",
                        "label": "date",
                        "relevance": 1.0
                    }
                ],
                "continuation": {
                    "this": ""
                },
                "id": "group:root:0",
                "relevance": 1.0
            }
        ],
        "coverage": {
            "coverage": 100,
            "documents": 1000,
            "full": true,
            "nodes": 0,
            "results": 1,
            "resultsFull": 1
        },
        "fields": {
            "totalCount": 1000
        },
        "id": "toplevel",
        "relevance": 1.0
    }
}

As you can see, the three most common unique values of date are listed, along with their respective counts.

Try to change the filter part of the YQL+ expression — the where clause — to a text match of “recipe”, or restrict date to be less than 20120401, and see how the list of unique values changes as the set of matching documents for your query changes. Try to search for the single term “Verizon” as well — a word we know is not present in our document set, and as such will match no documents — and you will see that the list of groups is empty.

Attribute limitations

Memory usage

Attributes are kept in memory at all time, as opposed to normal indexes where the data is mostly kept on disk. Even with large search nodes, one will notice that it is not practical to define all the search definition fields as attributes, as it will heavily restrict the number of documents per search node. Some Vespa installations have more than 1 billion documents per node — having megabytes of text in memory per document is not an option.

Matching

Another limitation is the way matching is done for attributes. Consider the field blogname from our search definition, and the document for the blog called “Thinking about museums”. In our original input, the value for blogname is a string built of up the three words “Thinking”, “about”, and “museums”, with a single whitespace character between them. How should we be able to search this field?

For normal index fields, Vespa does something called tokenization on the string. In our case this means that the string above is split into the three tokens “Thinking”, “about” and “museums”, enabling Vespa to match this document both for the single-term queries “Thinking”, “about” and “museums”, the exact phrase query “Thinking about museums”, and a query with two or more tokens in either order (e.g. “museums thinking”). This is how we all have come to expect normal free text search to work.

However, there is a limitation in Vespa when it comes to attribute fields and matching; attributes do not support normal token-based matching — only exact matching or prefix matching. Exact matching is the default, and, as the name implies, it requires you to search for the exact contents of the field in order to get a match.

When to use attributes

There are both advantages and drawbacks of using attributes — it enables sorting and grouping, but requires more memory and gives limited matching capabilities. When to use attributes depends on the application; in general, use attributes for:

  • fields used for sorting, e.g., a last-update timestamp,
  • fields used for grouping, e.g., problem severity, and
  • fields that are not long string fields.

Finally, all numeric fields should always be attributes.

Clean environment by removing all documents

vespa-remove-index removes all documents:

$ vespa-stop-services
$ vespa-remove-index
$ vespa-start-services

Conclusion

This concludes the basic Vespa tutorial. You should now have a basic understanding of how Vespa can help build your application. In the next part of the tutorial we will proceed to show how can we use Statistics and Machine Learning to extend a basic search application into a recommendation system.