Text Search Tutorial

Introduction

In this tutorial, we will guide you through the setup of a text search application built on top of Vespa. At the end you will be able to store text documents in Vespa and search them via text queries. The application built here will be the foundation to other tutorials that will add future improvements, such as creating ranking functions based on Machine Learning (ML) models.

The main goal here is to setup a text search app based on simple term-match features such as BM25 1 and nativeRank. We will cover how to create, deploy and feed the Vespa application. We are going to go from raw data to a fully functional text search app. In addition we will showcase how easy it is to switch and experiment with different ranking functions in Vespa.

Preamble

We start by acquiring the scripts and code required to follow this tutorial from our sample apps repository. The first step is then to clone the sample-apps repo from GitHub and move into the text-search directory. Start in an empty directory:

$ git clone --depth 1 https://github.com/vespa-engine/sample-apps.git
$ cd sample-apps/text-search

This repository contains a fully-fledged Vespa application including a front-end search UI. This tutorial however will start with the basics and develop the application over multiple parts.

Dataset

We use a dataset called MS MARCO throughout this tutorial. MS MARCO is a collection of large scale datasets released by Microsoft with the intent of helping the advance of deep learning research related to search. There are many tasks associated with MS MARCO datasets, but here we are interested in the task of building an end-to-end search application capable of returning relevant documents to a text query.

For the purposes of this tutorial we have included a small sample of the dataset under the msmarco/sample directory which contains only around 1000 documents. This is sufficient for following along with this tutorial, however if you want to experiment with the entire dataset of more than 3 million documents, download the data with the following command:

$ ./bin/download-msmarco.sh

It will create a msmarco/download directory within the text-search directory and download the required files. Note that it currently takes around 21G of disk space, and the conversion scripts below can take a fair amount of time.

The sample or downloaded data needs to be converted to the format expected by Vespa. This includes extracting documents, queries and relevance judgements from the files we downloaded and then converting to the Vespa format. If you downloaded the entire dataset, we take a small sample of 1,000 queries and 100,000 documents for the convenience of following this tutorial on a laptop. To convert the data, run the following script (for either the sample or downloaded dataset):

$ ./bin/convert-msmarco.sh

To adjust the number of queries and documents to sample, edit this file to your liking. After running the script we end up with a file msmarco/vespa.json containing lines such as the one below:

{
  "put": "id:msmarco:msmarco::D1555982",
  "fields": {
    "id": "D1555982",
    "url": "https://answers.yahoo.com/question/index?qid=20071007114826AAwCFvR",
    "title": "The hot glowing surfaces of stars emit energy in the form of electromagnetic radiation  ",
    "body": "Science   Mathematics Physics The hot glowing surfaces of stars emit energy in the form of electromagnetic radiation ... "
  }
}

In addition to vespa.json we also have a test-queries.tsv file containing a list of the sampled queries along with the document id that is relevant to each particular query. Each of those relevant documents are guaranteed to be on the sampled pool of document that is included on vespa.json so that we have a fair chance of retrieving it when sending sample queries to our Vespa application.

Create a Vespa Application Package

A Vespa application package is the set of configuration files and Java plugins that together define the behavior of a Vespa system: what functionality to use, the available document types, how ranking will be done and how data will be processed during feeding and indexing. Lets define the minimum set of required files to create our basic text search application, which are msmarco.sd, services.xml and hosts.xml. All those files need to be included within the application package directory.

For this tutorial we will create a new Vespa application rather than using the one in the repository, so we create a directory for this application:

$ mkdir application

Search definition

A search definition is a configuration of a document type and how it should be stored, indexed, ranked, searched and presented. For this application we define a document type called msmarco. Write the following to application/searchdefinitions/msmarco.sd:

search msmarco {
    document msmarco {
        field id type string {
            indexing: attribute | summary
        }
        field title type string {
            indexing: index | summary
            index: enable-bm25
        }
        field url type string {
            indexing: summary
        }
        field body type string {
            indexing: index | summary
            index: enable-bm25
            summary: dynamic
        }
    }

    document-summary minimal {
        summary id type string {  }
    }

    fieldset default {
        fields: title, body
    }

    rank-profile default {
        first-phase {
            expression: nativeRank(title, body)
        }
    }

    rank-profile bm25 inherits default {
        first-phase {
            expression: bm25(title) + bm25(body)
        }
    }
}

Here, we define the msmarco search definition, which includes primarily two things: a definition of fields the msmarco document type should have, and a definition on how Vespa should rank documents given a query.

The document section contains the fields of the document, their types and how Vespa should index them. The field property indexing configures the indexing pipeline for a field. For more information see search definitions - indexing. Note that we are enabling the usage of BM25 for the fields title and body by including index: enable-bm25 in the respective fields. This is a necessary step to allow us to use them in the bm25 ranking profile.

Next, the document summary class minimal is defined. Document summaries are used to control what data is returned for a query. The minimal summary here only returns the document id, which is useful for speeding up relevance testing as less data needs to be returned. The default document summary is defined by which fields are indexed with the summary command, which in this case are all the fields. In addition, we’ve set up the body field to show a dynamic summary, meaning that Vespa will try to extract relevant parts of the document. For more information, refer to the the reference documentation on document summaries.

Document summaries can be selected by using the summary query parameter.

Fieldsets provide a way to group fields together to be able to search multiple fields. That way a query such as

curl -s http://localhost:8081/search/?query=what+is+dad+bod

will match all documents containing the words what, is, dad, and bod in either the title and/or the body.

Vespa allows creating any number of rank profiles: named collections of ranking and relevance calculations that one can choose from at query time. A number of built-in functions and expressions are available to create highly specialized rank expressions. In this tutorial we define our default ranking-profile to be based on nativeRank, which is a linear combination of the normalized scores computed by the several term-matching features described in the nativeRank documentation. In addition, we created a bm25 ranking-profile to compare with the one based on nativeRank. BM25 is faster to compute than nativeRank while still giving better results than nativeRank in some applications. The first-phase keyword indicates that the expression defined in the ranking-profile will be computed for every document matching your query.

Rank profiles are selected by using the ranking query parameter.

Services Specification

The services.xml defines the services that make up the Vespa application — which services to run and how many nodes per service. Write the following to application/services.xml:




  
    
    
    
  

  
    1
    
      
      
    
    
      
    
  

Some notes about the elements above:

  • <container> defines the container cluster for document, query and result processing
  • <search> sets up the search endpoint for Vespa queries. The default port is 8080.
  • <document-api> sets up the document endpoint for feeding.
  • <content> defines how documents are stored and searched
  • <redundancy> denotes how many copies to keep of each document.
  • <documents> assigns the document types in the search definition — the content cluster capacity can be increased by adding node elements — see elastic Vespa. (See also the reference for more on content cluster setup.)
  • <nodes> defines the hosts for the content cluster.

Deployment Specification

hosts.xml contains a list of all the hosts/nodes that is part of the application, with an alias for each of them. This tutorial uses a single node. Write the following to application/hosts.xml:



  
    node1
  

Deploy the application package

Once we have finished writing our application package, we can deploy it in a Docker container.

Note that indexing the full data set requires 47GB disk space. These tutorials have been tested with a Docker container with 12GB RAM. We used similar settings as described in the vespa quick start guide.

We will map our working directory into the /app directory inside the Docker container. To start the Vespa container:

$ docker run -m 12G --detach --name vespa-msmarco --hostname vespa-msmarco \
    --privileged --volume `pwd`:/app \
    --publish 8080:8080 --publish 19112:19112 vespaengine/vespa

Make sure that the configuration server is running - signified by a 200 OK response:

$ docker exec vespa-msmarco bash -c 'curl -s --head http://localhost:19071/ApplicationStatus'

Now, to deploy the Vespa application:

$ docker exec vespa-msmarco bash -c '/opt/vespa/bin/vespa-deploy prepare /app/application && /opt/vespa/bin/vespa-deploy activate'

(or alternatively, run the equivalent commands inside the docker container. This prints that the application was activated successfully and also the checksum, timestamp and generation for this deployment. The generation will increase by 1 each time a new application is successfully deployed, and is the easiest way to verify that the correct version is active.

After a short while, querying the port 8080 should return a 200 status code indicating that your application is up and running.

$ curl -s --head http://localhost:8080/ApplicationStatus

Feed data and run a test query

Feeding the data

The data fed to Vespa must match the search definition for the document type. The file vespa.json generated by the convert-msmarco.sh script described in the dataset section already has data in the appropriate format expected by Vespa. Feed it to Vespa using one of the tools Vespa provides for feeding, as for example the Java feeding API:

$ docker exec vespa-msmarco bash -c 'java -jar /opt/vespa/lib/jars/vespa-http-client-jar-with-dependencies.jar \
    --verbose --file /app/msmarco/vespa.json --host localhost --port 8080'
 

Run a test query

Once the data has started feeding, we can already send queries to our search app even before it has finished:

$ curl -s "http://localhost:8080/search/?query=what+is+dad+bod&summary=minimal"

Following is a partial output of the query above when using the small dataset sample:

{
  "root": {
    "id": "toplevel",
    "relevance": 1.0,
    "fields": {
      "totalCount": 3
    },
    "coverage": {
      "coverage": 100,
      "documents": 1000,
      "full": true,
      "nodes": 1,
      "results": 1,
      "resultsFull": 1
    },
    "children": [
      {
        "id": "index:msmarco/0/59444ddd06537a24953b73e6",
        "relevance": 0.2747543357589305,
        "source": "msmarco",
        "fields": {
          "sddocname": "msmarco",
          "id": "D2977840"
        }
      },
      ...
    ]
  }
}

As we can see, there was 3 documents that matched the query out of 1000 available in the corpus. The number of matched documents will be much larger when using the full dataset. The results are ranked by the relevance score, which in this case is delivered by the nativeRank algorithm that we defined as the default ranking-profile in our search definition file.

Compare and evaluate different ranking functions

Vespa allow us to easily experiment with different ranking-profile’s. For example, we could use the bm25 ranking-profile instead of the default ranking-profile by including the ranking parameter in the query:

$ curl -s "http://localhost:8080/search/?query=what+is+dad+bod&ranking=bm25"

In order to align with the guidelines of the MS MARCO competition, we have created an evaluate.py script to compute the mean reciprocal rank (MRR) metric given a file containing test queries. The script will loop through the queries, send them to Vespa, parse the results and compute the reciprocal rank for each query and log it to an output file.

$ ./src/python/evaluate.py bm25 msmarco
$ ./src/python/evaluate.py default msmarco

The commands above output the mean reciprocal rank score as well as generate two output files named test-output-default.tsv and test-output-bm25.tsv containing the reciprocal rank metric for each query sent. We can than aggregate those values to compute the mean reciprocal rank for each ranking-profile or plot those values to get a richer comparison between the two ranking functions. For the small dataset in the sample data, the MRR is approximately equal. For the full MSMARCO dataset on the other hand, we see a different picture:

Looking at the figure we can see that the faster BM25 feature has delivered superior results for this specific application.

To stop and remove the Docker container for this application:

$ docker rm -f vespa-msmarco
  1. Robertson, Stephen and Zaragoza, Hugo and others, 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval.