In this tutorial, we will guide you through the setup of a text search application built on top of Vespa. At the end you will be able to store text documents in Vespa and search them via text queries. The application built here will be the foundation to other tutorials that will add future improvements, such as creating ranking functions based on Machine Learning (ML) models.
The main goal here is to set up a text search app based on simple term-match features such as BM25 1 and nativeRank. We will cover how to create, deploy and feed the Vespa application. We are going to go from raw data to a fully functional text search app. In addition, we will showcase how easy it is to switch and experiment with different ranking functions in Vespa.
Prerequisites:
NO_SPACE
- the vespaengine/vespa container image + headroom for data requires disk space.
Read more.
curl
Note: Use 12 GB Ram for Docker if running with the full dataset.
This tutorial uses Vespa-CLI, Vespa CLI is the official command-line client for Vespa.ai. It is a single binary without any runtime dependencies and is available for Linux, macOS and Windows.
$ brew install vespa-cli
We start by acquiring the scripts and code required to follow this tutorial from the sampleapps repository.
The first step is then to clone the sample-apps
repo from GitHub and move into the text-search
directory.
Start in an empty directory:
$ vespa clone text-search text-search && cd text-search
The repository contains a fully-fledged Vespa application including a front-end search UI. This tutorial however will start with the basics and develop the application over multiple parts.
We use a dataset called MS MARCO throughout this tutorial. MS MARCO is a collection of large scale datasets released by Microsoft with the intent of helping the advance of deep learning research related to search. There are many tasks associated with MS MARCO datasets, but here we are interested in the task of building an end-to-end search application capable of returning relevant documents to a text query.
For the purposes of this tutorial we have included a small sample of the dataset under the msmarco/sample
directory
which contains only around 1000 documents. This is sufficient for following along with this tutorial.
However, if you want to experiment with the entire dataset of more than 3 million documents, download the data. Make sure to accept the terms and conditions MS MARCO dataset is released under. The following will download the entire MS Marco Document Ranking collection:
$ ./bin/download-msmarco.sh
It will create a msmarco/download
directory within the text-search
directory and download the required files.
Note that it currently takes around 21G of disk space, and the conversion scripts below can take a fair amount of time.
The sample or downloaded data needs to be converted to the format expected by Vespa. This includes extracting documents, queries and relevance judgements from the files we downloaded and then converting to the Vespa format. If you downloaded the entire dataset, we take a small sample of 1,000 queries and 100,000 documents for the convenience of following this tutorial on a laptop. To convert the data, run the following script:
$ ./bin/convert-msmarco.sh
To adjust the number of queries and documents to sample, edit this file to your liking.
After running the script we end up with a file msmarco/vespa.json
containing lines such as the one below:
In addition to vespa.json
we also have a test-queries.tsv
file containing a list of the sampled queries
along with the document id that is relevant to each particular query.
Each of those relevant documents are guaranteed to be in the sampled pool of documents that is included on vespa.json
so that we have a fair chance of retrieving it when sending sample queries to our Vespa application.
A Vespa application package is the set of configuration files and Java plugins
that together define the behavior of a Vespa system:
what functionality to use, the available document types, how ranking will be done
and how data will be processed during feeding and indexing.
Let's define the minimum set of required files to create our basic text search application,
which are msmarco.sd
, services.xml
.
For this tutorial we will create a new Vespa application rather than using the one in the repository, so we create a directory for this application:
$ mkdir -p app/schemas
A schema is a configuration of a document type and what we should compute over it.
For this application we define a document type called msmarco
.
Write the following to text-search/app/schemas/msmarco.sd
:
schema msmarco { document msmarco { field id type string { indexing: attribute | summary } field title type string { indexing: index | summary index: enable-bm25 } field url type string { indexing: index | summary } field body type string { indexing: index index: enable-bm25 } } document-summary minimal { summary id { } } fieldset default { fields: title, body, url } rank-profile default { first-phase { expression: nativeRank(title, body, url) } } rank-profile bm25 inherits default { first-phase { expression: bm25(title) + bm25(body) + bm25(url) } } }
Here, we define the msmarco
schema, which includes primarily two things:
a definition of fields the msmarco
document type should have,
and a definition on how Vespa should rank documents given a query.
The document
section contains the fields of the document, their types and how Vespa should index them.
The field property indexing
configures the indexing pipeline for a field.
For more information see schemas - indexing.
Note that we are enabling the usage of BM25 for the fields title
, body
and url
by including index: enable-bm25
in the respective fields.
Next, the document summary class minimal
is defined.
Document summaries are used to control what data is returned for a query.
The minimal
summary here only returns the document id,
which is useful for speeding up relevance testing as less data needs to be returned.
The default document summary is defined by which fields are indexed with the summary
command,
which in this case are all the fields. We do not include body
in the summary, this to save disk usage.
For more information, refer to the document summaries reference. Document summaries can be selected by using the summary query api parameter.
Fieldsets provide a way to group fields together to be able to search multiple fields. String fields grouped using fieldsets should share the same match and linguistic processing settings.
Vespa allows creating any number of rank profiles which are named collections of ranking and relevance calculations that one can choose from at query time.
A number of built-in rank features are available to
create highly specialized ranking expressions.
In this tutorial we define our default rank-profile to be based on nativeRank
,
which is a linear combination of the normalized scores computed by the several term-matching features
described in the nativeRank documentation.In addition,
we created a bm25 ranking-profile to compare with the one based on nativeRank.
BM25 is faster to compute than nativeRank while still giving better results than nativeRank in some applications.
The first-phase
keyword indicates that the expression
defined in the ranking-profile
will be computed for every document matching the query. Vespa ranking supports phased ranking
Rank profiles are selected at run-time by using the ranking
query api parameter.
The services.xml defines the services that make up
the Vespa application — which services to run and how many nodes per service.
Write the following to text-search/app/services.xml
:
<?xml version="1.0" encoding="UTF-8"?> <services version="1.0"> <container id="text_search" version="1.0"> <search /> <document-processing /> <document-api /> </container> <content id="msmarco" version="1.0"> <redundancy>1</redundancy> <documents> <document type="msmarco" mode="index" /> <document-processing cluster="text_search" /> </documents> <nodes> <node distribution-key="0" hostalias="node1" /> </nodes> </content> </services>
Some notes about the elements above:
<container>
defines the container cluster for document, query and result processing<search>
sets up the query endpoint. The default port is 8080.<document-api>
sets up the document endpoint for feeding.<content>
defines how documents are stored and searched<redundancy>
denotes how many copies to keep of each document.<documents>
assigns the document types in the schema —
the content cluster capacity can be increased by adding node elements —
see elasticity.
(See also the reference for more on content cluster setup.)<nodes>
defines the hosts for the content cluster.Once we have finished writing our application package, we can deploy it in a Docker container.
Note that indexing the full data set requires 47 GB disk space. These tutorials have been tested with a Docker container with 12 GB RAM. We used similar settings as described in the vespa quick start guide. Start the Vespa container:
$ docker run --detach --name vespa-msmarco --hostname vespa-msmarco \ --publish 8080:8080 --publish 19071:19071 \ vespaengine/vespa
Configure the Vespa CLI to use the local Docker container:
$ vespa config set target local
Starting the container can take a short while. Before continuing, make sure
that the configuration service is running by using vespa status
.
$ vespa status deploy --wait 300
Now, deploy the Vespa application that you have created in the app
directory:
$ vespa deploy --wait 300 app
The data fed to Vespa must match the document type in the schema.
The file vespa.json
generated by the convert-msmarco.sh
script described in the dataset section
already has data in the appropriate format expected by Vespa:
$ vespa feed -t http://localhost:8080 msmarco/vespa.json
Once the data is fed, send a query to the search app:
$ vespa query \ 'yql=select title,url,id from msmarco where userQuery()' \ 'query=what is dad bod' \ 'type=all'
This query combines YQL userQuery()
with Vespa's simple query language,
the default query type is using all
, requiring that all the terms match the document.
Following is a partial output of the query above when using the small dataset sample:
As we can see, there were 3 documents that matched the query out of 1000 available in the corpus.
The number of matched documents will be much larger when using the full dataset.
We can change retrieval mode from all
to any
:
$ vespa query \ 'yql=select title,url,id from msmarco where userQuery()' \ 'query=what is dad bod' \ 'type=any'
Which will retrieve and rank all documents which matches any of the query terms. As can be seen from the result, almost all documents matched the query. These type of queries can be performance optimized using the Vespa WeakAnd query operator:
$ vespa query \ 'yql=select title,url,id from msmarco where userQuery()' \ 'query=what is dad bod' \ 'type=weakAnd'
In this case, a much lesser set of documents where fully ranked due to using weakAnd
instead of any
.
In any case, the retrieved documents are ranked by the relevance score,
which in this case is delivered by the nativeRank
rank feature
that we defined as the default ranking-profile in our schema definition file.
Vespa allow us to easily experiment with different rank-profiles.
For example, we could use the bm25
rank-profile instead of the default
rank-profile
by including the ranking
parameter in the query:
$ vespa query \ 'yql=select title,url,id from msmarco where userQuery()' \ 'query=what is dad bod' \ 'ranking=bm25' \ 'type=weakAnd'
Note that the relevance score,
which is normalized in the range [0,1] for the default rank profile using nativeRank
,
changed to an un-normalized range when using the bm25
rank feature.
In order to align with the guidelines of the MS MARCO competition, we have created evaluate.py to compute the mean reciprocal rank (MRR) metric given a file containing test queries. The script loops through the queries, sends them to Vespa, parses the results and computes the reciprocal rank for each query, and logs it to an output file:
$ ./src/python/evaluate.py bm25 msmarco
$ ./src/python/evaluate.py default msmarco
The commands above output the mean reciprocal rank score as well as generate two output files
msmarco/test-output-default.tsv
and msmarco/test-output-bm25.tsv
containing the reciprocal rank metric for each query sent.
We can than aggregate those values to compute the mean reciprocal rank for each rank-profile
or plot those values to get a richer comparison between the two ranking functions.
For the small dataset in the sample data, the MRR is approximately equal.
For the full MSMARCO dataset on the other hand, we see a different picture:
Looking at the figure we can see that the faster BM25 feature has delivered superior results for this specific application.
Stop and remove the Docker container and data:
$ docker rm -f vespa-msmarco
Check out the Improving Text Search through ML.
Robertson, Stephen and Zaragoza, Hugo and others, 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval. ↩