News search and recommendation tutorial - applications, feeding and querying

Introduction

This is the second part of the tutorial series for setting up a Vespa application for personalized news recommendations. The parts are:

  1. Getting started
  2. A basic news search application - application packages, feeding, query
  3. News search - sorting, grouping, and ranking
  4. Generating embeddings for users and news articles
  5. News recommendation - partial updates (news embeddings), ANNs, filtering
  6. News recommendation with searchers - custom searchers, doc processors
  7. News recommendation with parent-child - parent-child, tensor ranking
  8. Advanced news recommendation - intermission - training a ranking model
  9. Advanced news recommendation - ML models

In this part, we will build upon the minimal Vespa application in the previous part. First, we’ll take a look at the Microsoft News Dataset (MIND), which we’ll be using throughout the tutorial. We’ll use this to set up the search schema, deploy the application and feed some data. We’ll round off with some basic querying before moving on to the next part of the tutorial: searching for content.

For reference, the final state of this tutorial can be found in the app-2-feed-and-query sub-directory of the news sample application.

The Microsoft News Dataset

During these tutorials, we will use the Microsoft News Dataset (MIND). This is a large-scale dataset for news recommendation research. It contains over 160.000 articles, 15 million impressions logs, and 1 million users. We will not use the full dataset in this tutorial. To make the tutorial easier to follow along, we will use the much smaller DEMO part containing only 5000 users. However, readers are free to use the entire dataset at their own discretion.

The MIND dataset description contains an introduction to the contents of this dataset. For this tutorial, there are particularly two pieces of data that we will use:

  • News article content which contains data such as title, abstract, news category, and entities extracted from the title and abstract.
  • Impressions which contain a list of news articles that were shown to a user, labeled with whether or not the user clicked on them.

We’ll start with developing a search application, so we’ll focus on the news content at first. We’ll use the impression data as we begin building the recommendation system later in this series.

Let’s start by downloading the data. The news sample app directory will be our starting point. We’ve included a script to download the data for us:

$ git clone https://github.com/vespa-engine/sample-apps.git
$ cd sample-apps/news
$ ./bin/download-mind.sh demo

The argument defines which dataset to download. Here, we download the demo dataset, but small and large are valid options. Both the training and validation parts are downloaded to a directory called mind. The demo dataset is around 27 Mb. Both train and dev datasets will be downloaded.

Taking a look at the data, in mind/train/news.tsv, we see tab-separated lines like the following:

N16680  travel  traveltripideas The Most Beautiful Natural Wonder in Every State        While humans have built some impressive, gravity-defying, and awe-inspiring marvels   here are the most photographed structures in the world   the natural world may have us beat.      https://www.msn.com/en-us/travel/traveltripideas/the-most-beautiful-natural-wonder-in-every-state/ss-AAF8Brj?ocid=chopendata    []      []

Here we see the news article id, a category, a subcategory, the title, an abstract, and a URL to the article’s content. The last two fields contain the identified entities in the title and abstract. This particular news item has no such entities.

Note that the body content of the news article is retrievable by the URL. The dataset repository contains tools to download this. For the purposes of this tutorial, we won't be using this data, but feel free to download yourself.

Let’s start building a Vespa application to make this data searchable. We’ll create the directory my-app under the news sample app directory to contain your Vespa application:

$ mkdir my-app

Application Packages

Vespa's overall architecture

A Vespa application package is the set of configuration files and Java plugins that together define the behavior of a Vespa system: what functionality to use, the available document types, how ranking will be done and how data will be processed during feeding and indexing. The schema, e.g., news.sd, is a required part of an application package — the other files needed are services.xml and hosts.xml. We mentioned these files in the previous part but didn’t really explain them at the time. We’ll go through them here, starting with the services specification.

Services Specification

The services.xml file defines the services that make up the Vespa application — which services to run and how many nodes per service. Write the following to my-app/services.xml:




  
    
    
    
      
    
  

  
    1
    
      
    
    
      
    
  


Quite a lot is set up here:

  • <container> defines the container cluster for document, query and result processing
  • <search> sets up the query endpoint. The default port is 8080.
  • <document-api> sets up the document endpoint for feeding.
  • <nodes> defines the nodes required per service. (See the reference for more on container cluster setup).
  • <content> defines how documents are stored and searched.
  • <redundancy> denotes how many copies to keep of each document.
  • <documents> assigns the document types in the schema — the content cluster capacity can be increased by adding node elements — see elastic Vespa. (See also the reference for more on content cluster setup.)

Deployment Specification

The hosts.xml file contains a list of all the hosts/nodes that are part of the application, with an alias for each of them. Write the following to my-app/hosts.xml:



  
    node1
  

This sets up the alias node1 to represent the localhost. You saw this alias in the services specification above.

Schema

In terms of data, Vespa operates with the notion of documents. A document represents a single, searchable item in your system, e.g., a news article, a photo, or a user. Each document type must be defined in the Vespa configuration through a schema. Think of the document type in a schema as similar to a table definition in a relational database - it consists of a set of fields, each with a given name, a specific type, and some optional properties.

The data fed into Vespa must match the structure of the schema, and the results returned when searching will be in this format as well.

The news document type mentioned in the services.xml file above is defined in a schema. Schemas are found under the schemas directory in the application package, and must have the same name as the document type mentioned in services.xml.

Given the MIND dataset described above, we’ll set up the schema as follows. Write the following to my-app/schemas/news.sd:

schema news {
    document news {
        field news_id type string {
            indexing: summary | attribute
            attribute: fast-search
        }
        field category type string {
            indexing: summary | attribute
        }
        field subcategory type string {
            indexing: summary | attribute
        }
        field title type string {
            indexing: index | summary
            index: enable-bm25
        }
        field abstract type string {
            indexing: index | summary
            index: enable-bm25
        }
        field body type string {
            indexing: index | summary
            index: enable-bm25
        }
        field url type string {
            indexing: index | summary
        }
        field date type int {
            indexing: summary | attribute
        }
        field clicks type int {
            indexing: summary | attribute
        }
        field impressions type int {
            indexing: summary | attribute
        }
    }

    fieldset default {
        fields: title, abstract, body
    }

}

The document is wrapped inside another element called schema. The name following these elements, here news, must be exactly the same for both.

This document contains several fields. Each field has a type, such as string, int, or tensor. Fields also have properties. For instance, property indexing configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing — see indexing language. Each part of the indexing pipeline is separated by the pipe character ‘|’:

Here, we also use the index property, which sets up parameters for how Vespa should index the field. For the title, abstract, and body fields, we instruct Vespa to set up an index compatible with bm25 ranking for text search.

Deploy the Application Package

With the three necessary files above, we are ready to deploy the application package. Make sure it looks like this (use ls if tree is not installed):

$ tree my-app/
my-app/
├── hosts.xml
├── schemas
│   └── news.sd
└── services.xml

$ (cd my-app && zip -r - .) | \
  curl --header Content-Type:application/zip --data-binary @- \
  localhost:19071/application/v2/tenant/default/prepareandactivate

Continue after the application is successfully deployed.

$ docker run -m 10G --detach --name vespa --hostname vespa-tutorial \
    --volume `pwd`:/app --publish 8080:8080 vespaengine/vespa
$ docker exec vespa bash -c 'curl -s --head http://localhost:19071/ApplicationStatus'
$ docker exec vespa bash -c '/opt/vespa/bin/vespa-deploy prepare /app/my-app && /opt/vespa/bin/vespa-deploy activate'
$ curl -s --head http://localhost:8080/ApplicationStatus

Feeding data

The data fed to Vespa must match the schema for the document type. The downloaded MIND data must be converted to a valid Vespa document format before it can be fed to Vespa. Again, we have a script to help us with this:

$ python3 src/python/convert_to_vespa_format.py mind

The argument is where to find the downloaded data above, which was in the mind directory. This script creates a new file in that directory called vespa.json. This contains all 28603 news articles in the data set. This file can now be fed to Vespa. Use the method described in the previous part, using the vespa-http-client:

$ java -jar vespa-http-client-jar-with-dependencies.jar \
  --verbose --file mind/vespa.json --endpoint http://localhost:8080
$ docker exec vespa bash -c 'java -jar /opt/vespa/lib/jars/vespa-http-client-jar-with-dependencies.jar \
    --verbose --file /app/mind/vespa.json --host localhost --port 8080'
$ docker exec vespa bash -c 'curl -s http://localhost:19092/metrics/v1/values' | tr "," "\n" | grep content.proton.documentdb.documents.active

You can verify that specific documents are fed by fetching documents by document id using the Document API:

$ curl -s 'http://localhost:8080/document/v1/news/news/docid/N10864' | python -m json.tool

The first query

Searching with Vespa is done using HTTP GET or HTTP POST requests, like:

<host:port>/search?yql=value1&param2=value2...

or with a JSON-query:

{
	"yql" : value1,
	param2 : value2,
	...
}

The only mandatory parameter is the query, using yql=<yql query>. More details can be found in the Query API.

Consider the query:

select * from sources * where default contains \"music\";"

Given the above schema, where the fields title, abstract and body are part of the fieldset default, any document containing the word “music” in one or more of these fields matches that query. Let’s try that with either a GET query:

$ curl -s 'http://localhost:8080/search/?yql=select+*+from+sources+*+where+default+contains+%22music%22%3B' | python -m json.tool

or a POST JSON query:

$ curl -s -H "Content-Type: application/json" --data '{"yql" : "select * from sources * where default contains \"music\";"}' \
http://localhost:8080/search/ | python -m json.tool

Note that you can use the built-in query builder found at localhost:8080/querybuilder/ which can help you build queries with, for instance, autocompletion of YQL.

Looking at the output, please note:

  • The field documentid in the output and how it matches the value we assigned to each put operation when feeding data to Vespa.
  • Each hit has a property named relevance, which indicates how well the given document matches our query, using a pre-defined default ranking function. You have full control over ranking — more about ranking and ordering later. The hits are sorted by this value.
  • When multiple hits have the same relevance score, their internal ordering is undefined. However, their internal ordering will not change unless the documents are re-indexed.
  • You can add &tracelevel=9 to dump query parsing details.
  • The totalCount field at the top level contains the number of documents that matched the query.

Other examples

{"yql" : "select title from sources * where title contains \"music\";"}

Again, this is a search for the single term “music”, but this time explicitly in the title field. This means that we only want to match documents that contain the word “music” in the field title. As expected, you will see fewer hits for this query than for the previous one.

{"yql" : "select * from sources * where title contains \"music\" AND default contains \"festival\";"}

This is a query for the two terms “music” and “festival”, combined with an AND operation; it finds documents that match both terms — but not just one of them.

{"yql" : "select * from sources * where sddocname contains \"news\";"}

This is a single-term query in the special field sddocname for the value "news". This is a common and useful Vespa trick to get the number of indexed documents for a certain document type: sddocname is a special and reserved field which is always set to the name of the document type for a given document. The documents are all of type news, and will automatically have the field sddocname set to that value.

This means that the query above really means “return all documents of type news”, and as such, all documents in the index are returned.

Conclusion

We now have a Vespa application running with searchable data. In the next part of the tutorial, we’ll explore searching with sorting, grouping, and ranking results.

$ docker rm -f vespa