This tutorial will guide you through setting up a simple text search application. At the end, you can index text documents in Vespa and search them via text queries. The application built here will be the foundation for other tutorials, such as creating ranking functions based on Machine Learning (ML) models.
The main goal is to set up a text search app based on simple text scoring features such as BM25 1 and nativeRank.
Prerequisites:
NO_SPACE
- the vespaengine/vespa container image + headroom for data requires disk space.
Read more.
curl
This tutorial uses Vespa-CLI to deploy, feed and query Vespa. Below, we use HomeBrew to download and install vespa-cli
, you can also
download a binary from GitHub for your OS/CPU architecture.
$ brew install vespa-cli
We acquire the scripts to follow this tutorial from the
sample-apps repository via
vespa clone
.
$ vespa clone text-search text-search && cd text-search
The repository contains a fully-fledged Vespa application, but below, we will build it all from scratch for educational purposes.
We use a dataset called MS MARCO throughout this tutorial.
MS MARCO is a collection of large-scale datasets released by Microsoft
with the intent of helping the advance of deep learning research related to search.
Many tasks are associated with MS MARCO datasets,
but we want to build an end-to-end search application that returns relevant documents to a text query.
We have included a small dataset sample for this tutorial under the ext/sample
directory, which contains around 1000 documents.
The sample data must be converted to Vespa JSON feed format. The following step includes extracting documents, queries and relevance judgments from the sample files:
$ ./bin/convert-msmarco.sh
After running the script, we end up with a file ext/vespa.json
containing lines such as the one below:
In addition to vespa.json
we also have a test-queries.tsv
file containing a list of the sampled queries
along with the document ID relevant to each particular query.
msmarco.sd
and services.xml
.For this tutorial, we will create a new Vespa application rather than using the one in the repository, so we will create a directory for this application:
$ mkdir -p app/schemas
A schema is a document-type configuration; a single vespa application can have multiple schemas with document types.
For this application, we define a schema msmarco
which must be saved in a file named schemas/msmarco.sd
.
Write the following to text-search/app/schemas/msmarco.sd
:
schema msmarco { document msmarco { field language type string { indexing: "en" | set_language } field id type string { indexing: attribute | summary match: word } field title type string { indexing: index | summary match: text index: enable-bm25 } field body type string { indexing: index match: text index: enable-bm25 } field url type string { indexing: index | summary index: enable-bm25 } } fieldset default { fields: title, body, url } document-summary minimal { summary id { } } document-summary url-tokens { summary url {} summary url-tokens { source: url tokens } from-disk } rank-profile default { first-phase { expression: nativeRank(title, body, url) } } rank-profile bm25 inherits default { first-phase { expression: bm25(title) + bm25(body) + bm25(url) } } }
A lot is going on here; let us go through it in detail.
The document
section contains the fields of the document, their types,
and how Vespa should index and match them.
The field property indexing
configures the indexing pipeline for a field.
For more information, see schemas - indexing.
The string data type is used to represent both unstructured and structured texts,
and there are significant differences between index and attribute. The above
schema includes default match
modes for attribute
and index
property for visibility.
Note that we are enabling the usage of BM25 for title
, body
and url
.
by including index: enable-bm25
. The language field is the only field not in the msmarco dataset. We hardcode its value
to "en" since the dataset is English. Using set_language
avoids automatic language detection and uses the value when processing the other
text fields. Read more in linguistics.
Fieldset allows searching across multiple fields. Defining fieldset
does not
add indexing/storage overhead. String fields grouped using fieldsets must share the same
match and linguistic processing settings because
the query processing that searches a field or fieldset uses one type of transformation.
Next, we define two document summaries.
Document summaries control what fields are available in the response; we include the url-tokens
document-summary to
demonstrate later how we can get visibility into how text is converted into searchable tokens.
You can define many rank profiles, named collections of score calculations, and ranking phases.
In this tutorial, we define our default
to be using nativeRank.
In addition, we have a bm25
rank-profile that uses bm25. Both are examples of
text-scoring rank-features in Vespa.
The services.xml defines the services that make up
the Vespa application — which services to run and how many nodes per service.
Write the following to text-search/app/services.xml
:
<?xml version="1.0" encoding="UTF-8"?> <services version="1.0"> <container id="text_search" version="1.0"> <search /> <document-processing /> <document-api /> </container> <content id="msmarco" version="1.0"> <min-redundancy>1</min-redundancy> <documents> <document type="msmarco" mode="index" /> <document-processing cluster="text_search" /> </documents> <nodes> <node distribution-key="0" hostalias="node1" /> </nodes> </content> </services>
Some notes about the elements above:
<container>
defines the container cluster for document, query and result processing<search>
sets up the query endpoint. The default port is 8080.<document-api>
sets up the document endpoint for feeding.<content>
defines how documents are stored and searched<min-redundancy>
denotes how many copies to keep of each document.<documents>
assigns the document types in the schema to content clusters —
the content cluster capacity can be increased by adding node elements —
see elasticity.
(See also the reference for more on content cluster setup.)<nodes>
defines the hosts for the content cluster.Once we have finished writing our application package, we can deploy it. We use settings similar to those in the Vespa quick start guide.
Start the Vespa container:
$ docker run --detach --name vespa-msmarco --hostname vespa-msmarco \ --publish 8080:8080 --publish 19071:19071 \ vespaengine/vespa
Notice that we publish two ports (:8080) is the data-plane port where we write and query documents, and 19071 is the control-plane where we can deploy the application.
Configure the Vespa CLI to use the local container:
$ vespa config set target local
Starting the container can take a short while. Make sure
that the configuration service is running by using vespa status
.
$ vespa status deploy --wait 300
Now, deploy the Vespa application from the app
directory:
$ vespa deploy --wait 300 app
The data fed to Vespa must match the document type in the schema.
The file vespa.json
generated by the convert-msmarco.sh
script described in the dataset section
already has data in the appropriate format expected by Vespa:
$ vespa feed -t http://localhost:8080 ext/vespa.json
This section demonstrates various ways to search the data using the Vespa query language. All
the examples use the vespa-cli
client, the tool uses the HTTP api and if you pass -v
, you will see the curl
equivalent
API request.
$ vespa query \ 'yql=select * from msmarco where userInput(@user-query)' \ 'user-query=what is dad bod' \ 'hits=3' \ 'language=en'
This query combines YQL userInput(), a robust
way to combine free text queries from users with application logic. Similar to set_language
in indexing, we specify
the language of the query using the language API parameter. This ensures
symmetric linguistic processing of both the query and the document text. Automatic language detection is inaccurate
for short query strings and might lead to asymmetric processing of queries and document texts.
Following is a partial output of the query above when using the small dataset sample:
As shown, 562 documents matched the query out of 996 in the corpus. The first-phase
ranking expression scores all the matching documents.
A few important observations:
default
when the query terms do not specify a field. In our case:fieldset default { fields: title, body, url }
what is dad bod
searches across all those three fields.default
fieldset in the schema, the above query would return zero hits as the query did not specify a field.relevance
holds the score computed by the rank profile. Vespa uses default
by default.
In our case:rank-profile default { first-phase { expression: nativeRank(title, body, url) } }
We can use query operator annotations for the userInput to control various
matching aspects. The following uses the defaultIndex
to specify which field (or fieldset) to search.
$ vespa query \ 'yql=select * from msmarco where {defaultIndex:"title"}userInput(@user-query)' \ 'user-query=what is dad bod' \ 'hits=3' \ 'language=en'
Notice how the query above matches fewer documents totalCount:116
because we limited the free text query to the title field. We can
change the grammar to specify how the user query text is parsed into a query execution plan.
In the following example, we use grammar:"all"
to specify that we only want to retrieve documents where all the query terms match the title field.
$ vespa query \ 'yql=select * from msmarco where {defaultIndex:"title", grammar:"all"}userInput(@user-query)' \ 'user-query=what is dad bod' \ 'hits=3' \ 'language=en'
This query, using all
, matches only one document. Notice how the relevance of the hit is the same as in the above example. The difference
between the two types of queries is in the matching specification.
We can use userInput
to build a query that searches multiple fields (or fieldsets):
$ vespa query \ 'yql=select * from msmarco where ({defaultIndex:"title", grammar:"all"}userInput(@user-query)) or ({defaultIndex:"url", grammar:"all"}userInput(@user-query))' \ 'user-query=what is dad bod' \ 'hits=3' \ 'language=en'
Sometimes, we want to add a query time boost if some field matches a query term; the following uses the rank query operator. The rank query operator allows us to retrieve using the first operand, and the remaining operands can only impact ranking.
It is important to note that the following approach for query time term boosting is in the context of using the nativeRank
text scoring feature.
$ vespa query \ 'yql=select * from msmarco where rank(userInput(@user-query), url contains ({weight:1000, significance:1.0}"www.answers.com"))' \ 'user-query=what is dad bod' \ 'hits=3' \ 'language=en'
The above will match the user query against the default fieldset and produce match features for the second operand. It does not
change the retrieval or matching as the number of documents exposed to ranking is the same as before. The
rank
operator can be used to implement a variety of use case around boosting.
Now, we can combine the userInput
with application logic. We add an application-specific query filter on the url
field
to demonstrate how to combine userInput
with other query time constraints.
We add ranked:false
to tell Vespa that this
specific term should not contribute to the relevance calculation and filter
:true` to ensure that the term is not
used for bolding/highlighting or dynamic snippeting.
$ vespa query \ 'yql=select * from msmarco where userInput(@user-query) and url contains ({filter:true,ranked:false}"huffingtonpost.co.uk")' \ 'user-query=what is dad bod' \ 'hits=3' \ 'language=en'
Notice that the relevance
stays the same since we used ranked:false
for the filter.
Let us see what is going on by adding query tracing:
$ vespa query \ 'yql=select * from msmarco where userInput(@user-query) and url contains ({filter:true,ranked:false}"huffingtonpost.co.uk")' \ 'user-query=what is dad bod' \ 'trace.level=3' \ 'language=en'
We can notice the following in the trace output:
query=[AND (WEAKAND(100) default:what default:is default:dad default:bod) |url:'huffingtonpost co uk']
Notice that the userInput
part is converted to a weakAnd query operator and that this operator is
AND'ed with a phrase search ('huffingtonpost co uk') in the url
field. Notice also the field scoping where the query terms are
prefixed with default
. Notice also that punctuation characters (.) are removed as part of the tokenization. Suppose this is a common pattern where we want to filter on specific strings.
In that case, we should create a separate field to avoid phrase matching, phrase matching is more expensive than a single token search.
Query tracing, combined with a summary using tokens can help debug matching.
$ vespa query \ 'yql=select * from msmarco where url contains ({filter:true,ranked:false}"huffingtonpost.co.uk")' \ 'trace.level=0' \ 'language=en' \ 'summary=url-tokens'
This gives us insight into how the input url
field was tokenized and indexed. Those are the tokens that the query can match.
Notice how punctuation characters like :
, ,
, .
, /
, _
and -
are removed as part of the text tokenization.
Observations:
ranked:false
.match:text
dadbod
is a token in the url, this cannot match dad
or bod
as it is represented as a single token dadbod
.Let us do a similar example to demonstrate the impact of linguistic stemming
$ vespa query \ 'yql=select * from msmarco where url contains ({filter:true,ranked:false}"http")' \ 'summary=url-tokens' \ 'language=en'
"url": "http://www.ourbabynamer.com/meaning-of-Anika.html", "url-tokens": [ "http", "www", "ourbabynamer", "com", "meaning", "of", "anika", "html" ]
Notice that a query for https
matches http
, because 'https' on the query is stemmed to http
.
If we turn off
stemming on the query side, searching for
https` directly, we
end up with 0 results.
$ vespa query \ 'yql=select * from msmarco where url contains ({filter:true,ranked:false,stem:false}"https")' \ 'summary=debug-tokens' \ 'language=en'
Similarly, if we pass a different language tag, which will not stem https to http, we also get 0 results:
$ vespa query \ 'yql=select * from msmarco where url contains ({filter:true,ranked:false}"https")' \ 'summary=debug-tokens' \ 'language=de'
The previous section covered free-text search matching, linguistics, and how to combine business logic with
free-text user queries. All the examples used a default
rank-profile using Vespa's nativeRank text scoring feature.
With free-text search, we can use other text scoring functions, like BM25. All the matching capabilities (or limitations) still apply, we can use fieldsets or fields; the difference is in the text scoring function where BM25 is different from nativeRank.
$ vespa query \ 'yql=select * from msmarco where userInput(@user-query)' \ 'user-query=what is dad bod' \ 'hits=3' \ 'language=en' \ 'ranking=bm25'
While the nativeRank
text score is normalized to the range 0 to 1, BM25 is unbounded, as demonstrated above. When
querying (matching), we can ask Vespa to compute both features in the same query.
Modify the schema and add a new rank-profile combined
:
schema msmarco { document msmarco { field language type string { indexing: "en" | set_language } field id type string { indexing: attribute | summary match: word } field title type string { indexing: index | summary match: text index: enable-bm25 } field body type string { indexing: index match: text index: enable-bm25 } field url type string { indexing: index | summary index: enable-bm25 } } fieldset default { fields: title, body, url } document-summary minimal { summary id { } } document-summary url-tokens { summary url {} summary url-tokens { source: url tokens } from-disk } rank-profile default { first-phase { expression: nativeRank(title, body, url) } } rank-profile bm25 inherits default { first-phase { expression: bm25(title) + bm25(body) + bm25(url) } } rank-profile combined inherits default { first-phase { expression: bm25(title) + bm25(body) + bm25(url) + nativeRank(title) + nativeRank(body) + nativeRank(url) } match-features { bm25(title) bm25(body) bm25(url) nativeRank(title) nativeRank(body) nativeRank(url) } } }
Then, re-deploy the Vespa application from the app
directory:
$ vespa deploy --wait 300 app
Adding or removing rank profiles is a live-change as it only impacts how we score documents, not how we index or match them.
Run a query with the new rank-profile:
$ vespa query \ 'yql=select * from msmarco where userInput(@user-query)' \ 'user-query=what is dad bod' \ 'hits=3' \ 'language=en' \ 'ranking=combined'
Which will produce a result like this:
Notice that matchfeatures
field that is added to the hit when using match-features
in the rank-profile. Here, we have all the computed features from the matched document, and the final relevance
score is the sum of these features (In this case).
This query and ranking example demonstrates that for a single query searching a set of fields via fieldset,
we can compute different types of text scoring features and use combinations.
Now consider the following where we limit matching to the title field:
$ vespa query \ 'yql=select * from msmarco where {defaultIndex:"title"}userInput(@user-query)' \ 'user-query=what is dad bod' \ 'hits=3' \ 'language=en' \ 'ranking=combined'
Now, we do not get features for body
or url
, because they were not matched by the query.
Check out the Improving Text Search through ML.
If do not want to proceed with the Improving Text Search through ML guide, you can stop and remove the container (and data):
$ docker rm -f vespa-msmarco
Robertson, Stephen and Zaragoza, Hugo and others, 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval. ↩