This tutorial will guide you through setting up a simple text search application.
At the end, you can index text documents in Vespa and search them via text queries.
The application built here will be the foundation for other tutorials,
such as creating ranking functions based on Machine Learning (ML) models.
The main goal is to set up a text search app based on simple text scoring features
such as BM251 and nativeRank.
Prerequisites:
Linux, macOS or Windows 10 Pro on x86_64 or arm64,
with Podman or Docker installed.
See Docker Containers for system limits and other settings.
For CPUs older than Haswell (2013), see CPU Support
This tutorial uses Vespa-CLI to deploy, feed and query Vespa. Below, we use HomeBrew to download and install vespa-cli, you can also
download a binary from GitHub for your OS/CPU architecture.
$ brew install vespa-cli
We acquire the scripts to follow this tutorial from the
sample-apps repository via
vespa clone.
$ vespa clone text-search text-search && cd text-search
The repository contains a fully-fledged Vespa application, but below, we will build
it all from scratch for educational purposes.
Dataset
We use a dataset called MS MARCO throughout this tutorial.
MS MARCO is a collection of large-scale datasets released by Microsoft
with the intent of helping the advance of deep learning research related to search.
Many tasks are associated with MS MARCO datasets,
but we want to build an end-to-end search application that returns relevant documents to a text query.
We have included a small dataset sample for this tutorial under the ext/sample directory, which contains around 1000 documents.
The sample data must be converted to Vespa JSON feed format.
The following step includes extracting documents, queries and relevance judgments from the sample files:
$ ./bin/convert-msmarco.sh
After running the script, we end up with a file ext/vespa.json containing lines such as the one below:
{"put":"id:msmarco:msmarco::D1555982","fields":{"id":"D1555982","url":"https://answers.yahoo.com/question/index?qid=20071007114826AAwCFvR","title":"The hot glowing surfaces of stars emit energy in the form of electromagnetic radiation","body":"Science Mathematics Physics The hot glowing surfaces of stars emit energy in the form of electromagnetic radiation ... "}}
In addition to vespa.json we also have a test-queries.tsv file containing a list of the sampled queries
along with the document ID relevant to each particular query.
Create a Vespa Application Package
A Vespa application package is a set of configuration files and optional Java components that together define the behavior of a Vespa system. Let us define the minimum set of required files to create our basic text search application,
msmarco.sd and services.xml.
For this tutorial, we will create a new Vespa application rather than using the one in the repository,
so we will create a directory for this application:
$ mkdir -p app/schemas
Schema
A schema is a document-type configuration; a single vespa application can have multiple schemas with document types.
For this application, we define a schema msmarco which must be saved in a file named schemas/msmarco.sd.
Write the following to text-search/app/schemas/msmarco.sd:
schema msmarco {
document msmarco {
field language type string {
indexing: "en" | set_language
}
field id type string {
indexing: attribute | summary
match: word
}
field title type string {
indexing: index | summary
match: text
index: enable-bm25
}
field body type string {
indexing: index
match: text
index: enable-bm25
}
field url type string {
indexing: index | summary
index: enable-bm25
}
}
fieldset default {
fields: title, body, url
}
document-summary minimal {
summary id { }
}
document-summary url-tokens {
summary url {}
summary url-tokens {
source: url
tokens
}
from-disk
}
rank-profile default {
first-phase {
expression: nativeRank(title, body, url)
}
}
rank-profile bm25 inherits default {
first-phase {
expression: bm25(title) + bm25(body) + bm25(url)
}
}
}
A lot is going on here; let us go through it in detail.
Document type and fields
The document section contains the fields of the document, their types,
and how Vespa should index and match them.
The field property indexing configures the indexing pipeline for a field.
For more information, see schemas - indexing.
The string data type is used to represent both unstructured and structured texts,
and there are significant differences between index and attribute. The above
schema includes default match modes for attribute and index property for visibility.
Note that we are enabling the usage of BM25 for title, body and url.
by including index: enable-bm25. The language field is the only field not in the msmarco dataset. We hardcode its value
to "en" since the dataset is English. Using set_language avoids automatic language detection and uses the value when processing the other
text fields. Read more in linguistics.
Fieldset for matching across multiple fields
Fieldset allows searching across multiple fields. Defining fieldset does not
add indexing/storage overhead. String fields grouped using fieldsets must share the same
match and linguistic processing settings because
the query processing that searches a field or fieldset uses one type of transformation.
Document summaries to control search response contents
Next, we define two document summaries.
Document summaries control what fields are available in the response; we include the url-tokens document-summary to
demonstrate later how we can get visibility into how text is converted into searchable tokens.
Ranking to determine matched documents ordering
You can define many rank profiles,
named collections of score calculations, and ranking phases.
In this tutorial, we define our default to be using nativeRank.
In addition, we have a bm25 rank-profile that uses bm25. Both are examples of
text-scoring rank-features in Vespa.
Services Specification
The services.xml defines the services that make up
the Vespa application — which services to run and how many nodes per service.
Write the following to text-search/app/services.xml:
<content> defines how documents are stored and searched
<min-redundancy> denotes how many copies to keep of each document.
<documents> assigns the document types in the schema to content clusters —
the content cluster capacity can be increased by adding node elements —
see elasticity.
(See also the reference for more on content cluster setup.)
<nodes> defines the hosts for the content cluster.
Deploy the application package
Once we have finished writing our application package, we can deploy it.
We use settings similar to those in the Vespa quick start guide.
Notice that we publish two ports (:8080) is the data-plane port where we write and query documents, and 19071 is
the control-plane where we can deploy the application.
Configure the Vespa CLI to use the local container:
$ vespa config set target local
Starting the container can take a short while. Make sure
that the configuration service is running by using vespa status.
$ vespa status deploy --wait 300
Now, deploy the Vespa application from the app directory:
$ vespa deploy --wait 300 app
Feed the data
The data fed to Vespa must match the document type in the schema.
The file vespa.json generated by the convert-msmarco.sh script described in the dataset section
already has data in the appropriate format expected by Vespa:
This section demonstrates various ways to search the data using the Vespa query language. All
the examples use the vespa-cli client, the tool uses the HTTP api and if you pass -v, you will see the curl equivalent
API request.
$ vespa query \
'yql=select * from msmarco where userInput(@user-query)' \
'user-query=what is dad bod' \
'hits=3' \
'language=en'
This query combines YQL userInput(), a robust
way to combine free text queries from users with application logic. Similar to set_language in indexing, we specify
the language of the query using the language API parameter. This ensures
symmetric linguistic processing of both the query and the document text. Automatic language detection is inaccurate
for short query strings and might lead to asymmetric processing of queries and document texts.
Following is a partial output of the query above when using the small dataset sample:
{"root":{"id":"toplevel","relevance":1,"fields":{"totalCount":562},"children":[{"id":"id:msmarco:msmarco::D2977840","relevance":0.20676669550322158,"source":"msmarco","fields":{"sddocname":"msmarco","body":"<sep />After The Cut released a piece explaining <hi>what</hi> the <hi>dad</hi> <hi>bod</hi> <hi>is</hi> last week the internet pretty much exploded into debate over the trend <sep />","documentid":"id:msmarco:msmarco::D2977840","id":"D2977840","title":"What Is A Dad Bod An Insight Into The Latest Male Body Craze To Sweep The Internet","url":"http://www.huffingtonpost.co.uk/2015/05/05/what-is-a-dadbod-male-body_n_7212072.html"}}]}}
As shown, 562 documents matched the query out of 996 in the corpus. The first-phase ranking expression scores all the matching documents.
A few important observations:
We did not specify which fields to search in the query. Vespa will, by default, use a field set or field named default when the query terms do not specify a field. In our case:
fieldset default {
fields: title, body, url
}
Our query for what is dad bod searches across all those three fields.
If we did not specify a default fieldset in the schema, the above query would return zero hits as the query did not specify a field.
The hit relevance holds the score computed by the rank profile. Vespa uses default by default.
In our case:
We can use query operator annotations for the userInput to control various
matching aspects. The following uses the defaultIndex to specify which field (or fieldset) to search.
$ vespa query \
'yql=select * from msmarco where {defaultIndex:"title"}userInput(@user-query)' \
'user-query=what is dad bod' \
'hits=3' \
'language=en'
Notice how the query above matches fewer documents totalCount:116 because we limited the free text query to the title field. We can
change the grammar to specify how the user query text is parsed into a query execution plan.
In the following example, we use grammar:"all" to specify that we only want to retrieve documents where all the query terms match the title field.
$ vespa query \
'yql=select * from msmarco where {defaultIndex:"title", grammar:"all"}userInput(@user-query)' \
'user-query=what is dad bod' \
'hits=3' \
'language=en'
This query, using all, matches only one document. Notice how the relevance of the hit is the same as in the above example. The difference
between the two types of queries is in the matching specification.
We can use userInput to build a query that searches multiple fields (or fieldsets):
$ vespa query \
'yql=select * from msmarco where ({defaultIndex:"title", grammar:"all"}userInput(@user-query)) or ({defaultIndex:"url", grammar:"all"}userInput(@user-query))' \
'user-query=what is dad bod' \
'hits=3' \
'language=en'
Boosting by query terms
Sometimes, we want to add a query time boost if some field matches a query term; the following uses the rank query operator.
The rank query operator allows us to retrieve using the first operand, and the remaining operands can only impact ranking.
It is important to note that the following approach for query time term boosting is in the context of using the nativeRank text scoring feature.
$ vespa query \
'yql=select * from msmarco where rank(userInput(@user-query), url contains ({weight:1000, significance:1.0}"www.answers.com"))' \
'user-query=what is dad bod' \
'hits=3' \
'language=en'
The above will match the user query against the default fieldset and produce match features for the second operand. It does not
change the retrieval or matching as the number of documents exposed to ranking is the same as before. The
rank operator can be used to implement a variety of use case around boosting.
Combine free text with filters
Now, we can combine the userInput with application logic. We add an application-specific query filter on the url field
to demonstrate how to combine userInput with other query time constraints.
We add ranked:false to tell Vespa that this
specific term should not contribute to the relevance calculation and filter:true` to ensure that the term is not
used for bolding/highlighting or dynamic snippeting.
$ vespa query \
'yql=select * from msmarco where userInput(@user-query) and url contains ({filter:true,ranked:false}"huffingtonpost.co.uk")' \
'user-query=what is dad bod' \
'hits=3' \
'language=en'
Notice that the relevance stays the same since we used ranked:false for the filter.
Let us see what is going on by adding query tracing:
$ vespa query \
'yql=select * from msmarco where userInput(@user-query) and url contains ({filter:true,ranked:false}"huffingtonpost.co.uk")' \
'user-query=what is dad bod' \
'trace.level=3' \
'language=en'
We can notice the following in the trace output:
query=[AND (WEAKAND(100) default:what default:is default:dad default:bod) |url:'huffingtonpost co uk']
Notice that the userInput part is converted to a weakAnd query operator and that this operator is
AND'ed with a phrase search ('huffingtonpost co uk') in the url field. Notice also the field scoping where the query terms are
prefixed with default. Notice also that punctuation characters (.) are removed as part of the tokenization. Suppose this is a common pattern where we want to filter on specific strings.
In that case, we should create a separate field to avoid phrase matching, phrase matching is more expensive than a single token search.
Debugging token string matching
Query tracing, combined with a summary using tokens can help debug matching.
$ vespa query \
'yql=select * from msmarco where url contains ({filter:true,ranked:false}"huffingtonpost.co.uk")' \
'trace.level=0' \
'language=en' \
'summary=url-tokens'
This gives us insight into how the input url field was tokenized and indexed. Those are the tokens that the query can match.
Notice how punctuation characters like :, ,, ., /, _ and - are removed as part of the text tokenization.
Observations:
Relevance is 0.0, because the term uses ranked:false.
We cannot match "://" because those are not searchable characters with match:text
dadbod is a token in the url, this cannot match dad or bod as it is represented as a single token dadbod.
Let us do a similar example to demonstrate the impact of linguistic stemming
$ vespa query \
'yql=select * from msmarco where url contains ({filter:true,ranked:false}"http")' \
'summary=url-tokens' \
'language=en'
Notice that a query for https matches http, because 'https' on the query is stemmed to http.
If we turn off stemming on the query side, searching for https` directly, we
end up with 0 results.
$ vespa query \
'yql=select * from msmarco where url contains ({filter:true,ranked:false,stem:false}"https")' \
'summary=debug-tokens' \
'language=en'
Similarly, if we pass a different language tag, which will not stem https to http, we also get 0 results:
$ vespa query \
'yql=select * from msmarco where url contains ({filter:true,ranked:false}"https")' \
'summary=debug-tokens' \
'language=de'
Ranking
The previous section covered free-text search matching, linguistics, and how to combine business logic with
free-text user queries. All the examples used a default rank-profile using Vespa's nativeRank text scoring feature.
With free-text search, we can use other text scoring functions, like BM25. All the matching
capabilities (or limitations) still apply, we can use fieldsets or fields; the difference is in the text scoring function where BM25
is different from nativeRank.
$ vespa query \
'yql=select * from msmarco where userInput(@user-query)' \
'user-query=what is dad bod' \
'hits=3' \
'language=en' \
'ranking=bm25'
While the nativeRank text score is normalized to the range 0 to 1, BM25 is unbounded, as demonstrated above. When
querying (matching), we can ask Vespa to compute both features in the same query.
Modify the schema and add a new rank-profile combined:
schema msmarco {
document msmarco {
field language type string {
indexing: "en" | set_language
}
field id type string {
indexing: attribute | summary
match: word
}
field title type string {
indexing: index | summary
match: text
index: enable-bm25
}
field body type string {
indexing: index
match: text
index: enable-bm25
}
field url type string {
indexing: index | summary
index: enable-bm25
}
}
fieldset default {
fields: title, body, url
}
document-summary minimal {
summary id { }
}
document-summary url-tokens {
summary url {}
summary url-tokens {
source: url
tokens
}
from-disk
}
rank-profile default {
first-phase {
expression: nativeRank(title, body, url)
}
}
rank-profile bm25 inherits default {
first-phase {
expression: bm25(title) + bm25(body) + bm25(url)
}
}
rank-profile combined inherits default {
first-phase {
expression: bm25(title) + bm25(body) + bm25(url) + nativeRank(title) + nativeRank(body) + nativeRank(url)
}
match-features {
bm25(title)
bm25(body)
bm25(url)
nativeRank(title)
nativeRank(body)
nativeRank(url)
}
}
}
Then, re-deploy the Vespa application from the app directory:
$ vespa deploy --wait 300 app
Adding or removing rank profiles is a live-change as it only impacts how we score documents, not how we index or match
them.
Run a query with the new rank-profile:
$ vespa query \
'yql=select * from msmarco where userInput(@user-query)' \
'user-query=what is dad bod' \
'hits=3' \
'language=en' \
'ranking=combined'
Which will produce a result like this:
{"root":{"id":"toplevel","relevance":1,"fields":{"totalCount":562},"children":[{"id":"id:msmarco:msmarco::D2977840","relevance":25.482783473796484,"source":"msmarco","fields":{"matchfeatures":{"bm25(body)":19.51565699523739,"bm25(title)":4.978933753876959,"bm25(url)":0.3678926381724701,"nativeRank(body)":0.3010929113058281,"nativeRank(title)":0.24814575272673867,"nativeRank(url)":0.07106142247709807},"sddocname":"msmarco","documentid":"id:msmarco:msmarco::D2977840","id":"D2977840","title":"What Is A Dad Bod An Insight Into The Latest Male Body Craze To Sweep The Internet","url":"http://www.huffingtonpost.co.uk/2015/05/05/what-is-a-dadbod-male-body_n_7212072.html"}}]}}
Notice that matchfeatures field that is added to the hit when using match-features in the rank-profile. Here, we have all the computed features from the matched document, and the final relevance score is the sum of these features (In this case).
This query and ranking example demonstrates that for a single query searching a set of fields via fieldset,
we can compute different types of text scoring features and use combinations.
Now consider the following where we limit matching to the title field:
$ vespa query \
'yql=select * from msmarco where {defaultIndex:"title"}userInput(@user-query)' \
'user-query=what is dad bod' \
'hits=3' \
'language=en' \
'ranking=combined'
Now, we do not get features for body or url, because they were not matched by the query.
Robertson, Stephen and Zaragoza, Hugo and others, 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval. ↩