This is a practical Vespa query performance guide. It uses the Last.fm tracks dataset to illustrate Vespa query performance. Latency numbers mentioned in the guide are obtained from running this guide on a MacBook Pro x86.
This guide covers the following query serving performance aspects:
The guide includes step-by-step instructions on how to reproduce the experiments. This guide is best read after having read the Vespa Overview documentation first.
Prerequisites:
NO_SPACE
- the vespaengine/vespa container image + headroom for data requires disk space.
Read more.
curl
to download the dataset and run the Vespa health-checks.This tutorial uses Vespa-CLI, Vespa CLI is the official command-line client for Vespa.ai. It is a single binary without any runtime dependencies and is available for Linux, macOS and Windows.
$ brew install vespa-cli
This guide uses the Last.fm tracks dataset. Note that the dataset is released under the following terms:
Research only, strictly non-commercial. For details, or if you are unsure, please contact Last.fm. Also, Last.fm has the right to advertise and refer to any work derived from the dataset.
To download the dataset directly (120 MB zip file), run:
$ curl -L -o lastfm_test.zip \ http://millionsongdataset.com/sites/default/files/lastfm/lastfm_test.zip $ unzip lastfm_test.zip
The downloaded data needs to be converted to the JSON format expected by Vespa.
This python script is used to traverse the dataset files and create a JSONL formatted feed file with Vespa feed operations. The schema for this feed is introduced in the next sections.
Run the script and create the feed.jsonl
file:
$ python3 create-vespa-feed.py lastfm_test > feed.jsonl
A Vespa application package is the set of configuration files and Java plugins that together define the behavior of a Vespa system: what functionality to use, the available document types, how ranking will be done, and how data will be processed during feeding and indexing.
The minimum required files to create the basic search application are track.sd
and services.xml
.
Create directories for the configuration files:
$ mkdir -p app/schemas; mkdir -p app/search/query-profiles/
A Vespa schema is a configuration of a document type and ranking and
compute specifications. This app use a track
schema defined as:
schema track { document track { field track_id type string { indexing: summary | attribute rank: filter match: word } field title type string { indexing: summary | index index: enable-bm25 } field artist type string { indexing: summary | index } field tags type weightedset<string> { indexing: summary | attribute } field similar type tensor<float>(trackid{}) { indexing: summary | attribute } } fieldset default { fields: title, artist } }
Notice that the track_id
field has :
The services.xml defines the services that make up the Vespa application — which services to run and how many nodes per service.
<?xml version="1.0" encoding="UTF-8"?> <services version="1.0"> <container id="default" version="1.0"> <search/> <document-api/> </container> <content id="tracks" version="1.0"> <redundancy>1</redundancy> <documents> <document type="track" mode="index"></document> </documents> <nodes> <node distribution-key="0" hostalias="node1"></node> </nodes> </content> </services>
The default query profile can be used to override default query api settings for all queries.
The following enables presentation.timing and
renders weightedset
fields as a JSON maps.
<query-profile id="default"> <field name="presentation.timing">true</field> <field name="renderer.json.jsonWsets">true</field> </query-profile>
The application package can now be deployed to a running Vespa instance. See also the Vespa quick start guide.
Start the Vespa container image using Docker:
$ docker run --detach --name vespa --hostname vespa-container \ --publish 8080:8080 --publish 19071:19071 --publish 19110:19110 \ vespaengine/vespa
Starting the container can take a short while. Before continuing, make sure
that the configuration service is running by using vespa status deploy
.
$ vespa config set target local $ vespa status deploy --wait 300
Once ready, the Vespa application can be deployed using the Vespa CLI:
$ vespa deploy --wait 300 app
Feed the feed file generated in the previous section:
$ vespa feed -t http://localhost:8080 feed.jsonl
The following sections uses the Vespa query api and
formulate queries using Vespa query language.
For readability, all query examples are expressed using the
vespa-cli command which supports running queries against a Vespa instance.
The CLI uses the Vespa http search api internally.
Use vespa query -v
to see the actual http request sent:
$ vespa query -v 'yql=select ..'
The first query uses where true
to match all track
documents.
It also uses hits to specify how many
documents to return in the response:
$ vespa query \ 'yql=select artist, title, track_id, tags from track where true' \ 'hits=1'
The result json output for this query will look something like this:
Observations:
coverage.nodes
) and the
coverage (coverage.coverage
) was 100%,
see graceful-degradation for more information about
the coverage
element, and Vespa timeout behavior. Vespa's default timeout is 0.5 seconds.totalCount
) out of
95666 documents available (coverage.documents
).The response timing
has three fields. A Vespa query is executed in two protocol phases:
See also Life of a query in Vespa. The timing
in the response measures the time it takes to execute these two phases:
querytime
- Time to execute the first protocol phase/matching phase.summaryfetchtime
- Time to execute the summary fill protocol phase for the globally ordered top-k hits.searchtime
Is roughly the sum of the above and is close to what a client will observe (except network latency).All three metrics are second resolution. Moving on, the following query performs a free text query:
$ vespa query \ 'yql=select artist, title, track_id from track where userQuery()' \ 'query=total eclipse of the heart' \ 'hits=1'
This query request combines YQL userQuery()
with Vespa's simple query language, the
default query type is
using all
requiring that all the terms match.
The above example searches for total AND eclipse AND of AND the AND heart in the fieldset default
,
which in the schema includes the title
and artist
fields.
Since the request did not specify any ranking parameters,
the matched documents were ranked by Vespa's default
text rank feature: nativeRank.
The result output for the above query:
This query only matched one document because the query terms were ANDed.
Matching can be relaxed to type=any
instead using
query model type.
$ vespa query \ 'yql=select artist, title, track_id from track where userQuery()' \ 'query=total eclipse of the heart' \ 'hits=1' \ 'type=any'
Now, the query matches 24,053 documents and is considerably slower than the previous all
query.
Comparing querytime
of these two query examples, the one which matches the most documents
have the highest querytime
. In worst case, the search query matches all documents, and
without any techniques for early termination or skipping, all matches are exposed to ranking.
Query performance is greatly impacted by the number of documents
that matches the query specification. Generally, type any
queries
requires more query compute resources than type all
.
There is an algorithmic optimization available for type=any
queries, using
the weakAnd
query operator which implements the WAND algorithm.
See the using wand with Vespa for an
introduction to the algorithm.
Run the same query, but instead of type=any
use type=weakAnd
:
$ vespa query \ 'yql=select artist, title, track_id from track where userQuery()' \ 'query=total eclipse of the heart' \ 'hits=1' \ 'type=weakAnd'
Compared to the type any
query which fully ranked 24,053 documents,
weakAnd
only fully ranks 3,679 documents.
Also notice that the faster search returns the same document at the first position.
Conceptually a search query is about finding the documents that match the query,
then score the documents using a ranking model.
In the worst case, a search query can match all documents which will expose
all of them to the ranking.
The previous examples used hits=1
query parameter, and in the previous
query examples, the summaryfetchtime
has been close to constant.
The following query requests considerably more hits, note that the result is piped to head
to increase readability:
$ vespa query \ 'yql=select artist, title, track_id from track where userQuery()' \ 'query=total eclipse of the heart' \ 'hits=200' \ 'type=weakAnd' |head -40
Increasing number of hits, increases summaryfetchtime
significantly from
the previous query examples, while querytime
is relatively unchanged.
Repeating the query a second time will reduce the summaryfetchtime
due to the content node summary cache,
see caches in Vespa for details.
There are largely four factors which determines the summaryfetchtime
:
querytime
.
With many content nodes in the group the query was dispatched to,
we expect that top-ranking hits would be distributed across the nodes so that each node
does less work.summaryfetchtime
than smaller docs. Less is more.attribute
will be read from memory. For the default
summary, or others
containing at least one non-attribute field, a fill will potentially access data
from summary storage on disk. Read more about in-memory attribute fields.Creating a dedicated document-summary which
only contain the track_id
field can improve performance, since track_id
is defined in the schema with
attribute
, any summary fetches using this document summary will be reading data from in-memory.
In addition, since the summary only contain one field, it saves network time as less data is
transferred during the summary fill phase.
document-summary track_id { summary track_id { } }
The new schema then becomes:
schema track { document track { field track_id type string { indexing: summary | attribute rank: filter match: word } field title type string { indexing: summary | index index: enable-bm25 } field artist type string { indexing: summary | index } field tags type weightedset<string> { indexing: summary | attribute } field similar type tensor<float>(trackid{}) { indexing: summary | attribute } } fieldset default { fields: title, artist } document-summary track_id { summary track_id { } } }
Re-deploy the application:
$ vespa deploy --wait 300 app
Re-executing the query using the track_id
document-summary
is done by
setting the summary
query request parameter:
$ vespa query \ 'yql=select artist, title, track_id from track where userQuery()' \ 'query=total eclipse of the heart' \ 'hits=200' \ 'type=weakAnd' \ 'summary=track_id' |head -40
In this particular case the summaryfetchtime
difference is not that large, but for larger number of hits and
larger documents the difference is significant. Especially in single content node deployments.
A note on select field scoping with YQL, e.g. select title, track_id from ..
.
When using the default summary by not using a summary parameter, all fields are delivered from the
content nodes to the stateless search container in the summary fill phase,
regardless of field scoping. The search container removes the set of fields not
selected and renders the result. Hence, select scoping only reduces the amount of data
transferred back to the client, and does not impact or optimize the performance
of the internal communication and potential summary cache miss.
For optimal performance for use cases asking for large number of hits to the client it is
recommended to use dedicated document summaries.
Note also that Vespa per default limits the max hits to 400 per default,
the behavior can be overridden in the
default queryProfile.
When requesting large amount of data with hits, it is recommended to use result compression. Vespa will compress if the HTTP client uses the Accept-Encoding HTTP request header:
Accept-Encoding: gzip
The previous section covered free text searching in a fieldset
containing fields with
indexing:index
. See indexing reference.
Fields of type string are
treated differently depending on having index
or attribute
:
index
integrates with linguistic processing and is matched using
match:text.
attribute
does not integrate with linguistic processing and is matched using
match:word.
With index
Vespa builds inverted index data structures which roughly consists of:
rank: filter
can help guide the decision on what format to use. Bitvector
representation is the most compacting post list representation.With attribute
, Vespa per default, does not build any inverted index like data structures for
potential faster query evaluation. See Wikipedia:Inverted Index
and Vespa internals.
The reason for this default setting is that Vespa attribute
fields can be used
for many different aspects: ranking, result grouping,
result sorting, and finally searching/matching.
The following section focuses on the tags
field which we defined with attribute
,
matching in this field will be performed using match:word
which is the
default match mode for string fields with indexing: attribute
.
The tags
field is of type weightedset.
field tags type weightedset<string> { indexing: summary | attribute }
Weightedset is a field type that allows representing a tag with an integer weight, which can be used for ranking.
In this case, there is no inverted index structure,
and matching against the tags
field is performed as a linear scan.
The following scans for documents where tags
match rock:
$ vespa query \ 'yql=select track_id, tags from track where tags contains "rock"' \ 'hits=1'
The query matches 8,160 documents, notice that for match: word
, matching can also include whitespace,
or generally punctuation characters which are removed and not searchable
when using match:text
with string fields that have index
:
$ vespa query \ 'yql=select track_id, tags from track where tags contains "classic rock"' \ 'hits=1'
The above query matches exactly tags with "classic rock", not "rock" and also not "classic rock music".
Another query searching for rock or pop:
$ vespa query \ 'yql=select track_id, tags from track where tags contains "rock" or tags contains "pop"' \ 'hits=1'
In all these examples searching the tags
field, the matching is done by a linear scan through all
track
documents. The tags
search can be combined with regular
free text query terms searching fields that do have inverted index structures:
$ vespa query \ 'yql=select track_id, tags from track where tags contains "rock" and userQuery()' \ 'hits=1' \ 'query=total eclipse of the heart'
In this case - the query terms searching the default fieldset will restrict the number of documents that needs to be scanned for the tags constraint. This query is automatically optimized by the Vespa query planner.
This section adds fast-search
to the tags
field to speed up searches where there are no
other query filters which restricts the search. The schema with fast-search
:
schema track { document track { field track_id type string { indexing: summary | attribute rank: filter match: word } field title type string { indexing: summary | index index: enable-bm25 } field artist type string { indexing: summary | index } field tags type weightedset<string> { indexing: summary | attribute attribute: fast-search } field similar type tensor<float>(trackid{}) { indexing: summary | attribute } } fieldset default { fields: title, artist } document-summary track_id { summary track_id { } } }
Re-deploy the application:
$ vespa deploy --wait 300 app
The above will print a WARNING:
vespa deploy --wait 300 app/ Uploading application package ... done Success: Deployed app/ WARNING Change(s) between active and new application that require restart: In cluster 'tracks' of type 'search': Restart services of type 'searchnode' because: 1) Document type 'track': Field 'tags' changed: add attribute 'fast-search' Waiting up to 300 seconds for query service to become available ...
To enable fast-search
, content node(s) needs to be restarted to re-build the fast-search data structures
for the attribute.
The following uses vespa-sentinel-cmd command tool to restart the searchnode process:
$ docker exec vespa vespa-sentinel-cmd restart searchnode
This step requires waiting for the searchnode, use the health state api:
$ curl -s http://localhost:19110/state/v1/health
Wait for status code to flip to up
before querying again:
Once up, execute the tags
query again:
$ vespa query \ 'yql=select track_id, tags from track where tags contains "rock" or tags contains "pop"' \ 'hits=1'
Now the querytime
will be a few milliseconds since Vespa has built index structures to support
fast-search
in the attribute. The downside of enabling fast-search
is
increased memory usage and slightly reduced indexing throughput. See also
when to use fast-search for attributes.
For use cases requiring match:text
when searching multivalued string field types
like weightedset, see
searching multi-value fields.
For fields that don't need any match ranking features, it's strongly recommended to use rank: filter.
field availability type int { indexing: summary | attribute rank: filter attribute { fast-search } }
With the settings above, bit vector posting list representations are used. This is especially efficient
when used in combination with TAAT (term at a time)
query evaluation. For some cases with many query terms, enabling rank: filter
can reduce match latency
by 75%.
This section covers multi-value query operators and their query performance characteristics. Many real-world search and recommendation use cases involve structured multivalued queries.
Assuming a process has learned a sparse user profile representation, which, for a given user, based on past interactions with a service, could produce a user profile with hard rock, rock, metal and finnish metal. Sparse features from a fixed vocabulary/feature space.
Retrieving and ranking using sparse representations can be done using
the dot product between the sparse user profile representation and document representation. In the
track example, the tags
field could be the document side sparse representation. Each document
is tagged with multiple tags
using a weight, and similar the sparse user profile
representation could use weights.
In the following examples, the dotProduct() and wand() query operators are used.
To configure ranking, add a rank-profile
to the schema:
schema track { document track { field track_id type string { indexing: summary | attribute rank: filter match: word } field title type string { indexing: summary | index index: enable-bm25 } field artist type string { indexing: summary | index } field tags type weightedset<string> { indexing: summary | attribute attribute: fast-search } field similar type tensor<float>(trackid{}) { indexing: summary | attribute } } fieldset default { fields: title, artist } document-summary track_id { summary track_id { } } rank-profile personalized { first-phase { expression: rawScore(tags) } } }
The dotProduct
and wand
query operators produce a rank feature
called
rawScore(name). This feature calculates
the sparse dot product between the query and document weights.
Deploy the application again:
$ vespa deploy --wait 300 app
The dotProduct query operator accepts a field to match over and supports parameter substitution. Using substitution is recommended for large inputs as it saves compute resources when parsing the YQL input.
The following example assumes a learned sparse representation, with equal weight:
userProfile={"hard rock":1, "rock":1,"metal":1, "finnish metal":1}
This userProfile is referenced as a parameter
where dotProduct(tags, @userProfile)
$ vespa query \ 'yql=select track_id, title, artist, tags from track where dotProduct(tags, @userProfile)' \ 'userProfile={"hard rock":1, "rock":1,"metal":1, "finnish metal":1}' \ 'hits=1' \ 'ranking=personalized'
The query also specifies the rank-profile
personalized
, if not specified, ranking would be
using nativeRank
. The above query returns the following response:
Notice that the query above, will brute-force rank all tracks where the tags
field matches any of the multivalued
userProfile features. Due to this, the query ranks 10,323 tracks as seen by totalCount
.
Including for example pop in the userProfile list increases the number of hits to 13,638.
For a large user profile with many learned features/tags, one would easily match and rank the entire document collection.
Also notice the relevance
score which is 400 since the document matches all the query input tags (4x100 = 400).
To optimize the evaluation, the wand query operator
can be used. The wand
query operator supports setting a target number of top ranking hits that gets
exposes to the first-phase
ranking function.
Repeating the query from above, replacing dotProduct
with wand
:
$ vespa query \ 'yql=select track_id, title, artist, tags from track where {targetHits:10}wand(tags, @userProfile)' \ 'userProfile={"hard rock":1, "rock":1,"metal":1, "finnish metal":1}' \ 'hits=1' \ 'ranking=personalized'
The wand
query operator retrieves the exact same hit at rank 1 which is the expected behavior.
The wand
query operator is safe, meaning, it returns the same top-k results as the dotProduct
query operator.
For larger document collections, the wand query operator can significantly
improve query performance compared to dotProduct
.
wand is a type of query operator which performs matching and ranking interleaved and skipping documents which cannot make it into the top k results. See the using wand with Vespa guide for more details on the WAND algorithm.
Finally, these multi-value query operators works on both single valued fields, and array fields,
but optimal performance is achieved using the weightedset
field type. The weightedset
field type only supports integer weights. The next section
covers tensors that support more floating point number types.
The previous sections covered matching and where query matching query operators also produced rank features which could be used to influence the order of the hits returned. In this section we look at ranking with tensor computations using tensor expressions.
Tensor computations can be used to calculate dense dot products, sparse dot products, matrix multiplication, neural networks and more. Tensor computations can be performed on documents that are retrieved by the query matching operators. The only exception to this is dense single order tensors (vectors) where Vespa also supports "matching" using (approximate) nearest neighbor search.
The track
schema was defined with a similar
tensor field with one named mapped dimension.
Mapped tensors can be used to represent sparse feature representations, similar
to the weightedset
field, but in a more generic way, and here using float
to represent
the tensor cell value.
field similar type tensor<float>(trackid{}) { indexing: summary | attribute }
Inspecting one document, using the vespa-cli (Wraps Vespa document/v1 api):
$ vespa document get id:music:track::TRQIQMT128E0791D9C
Returns:
In the lastfm collection, each track lists similar tracks with a similarity score using float resolution, according to this
similarity algorithm the most similar track to this sample document is TRWJIPT128E0791D99
with a similarity score of 1.0.
Searching for that doc using the query api:
$ vespa query \ 'yql=select title, artist from track where track_id contains "TRWJIPT128E0791D99"' \ 'hits=1'
Note that track_id
was not defined with fast-search
so searching it without any other query terms makes this
query a linear scan over all tracks.
The query returns:
Given a single track, one could just retrieve the document and display the offline computed similar tracks, but, if a user has listened to multiple tracks in a real time session, one could use a sparse dot product between the user recent activity and the track similarity fields. For example, listening to the following tracks:
TRQIQMT128E0791D9C
Summer Of '69 by Bryan AdamsTRWJIPT128E0791D99
Run To You by Bryan AdamsTRGVORX128F4291DF1
Broken Wings by Mr. MisterCould be represented as a query tensor query(user_liked)
and passed with the query request like this:
input.query(user_liked)={{trackid:TRUAXHV128F42694E8 }:1.0,{trackid:TRQIQMT128E0791D9C}:1.0,{trackid:TRGVORX128F4291DF1}:1.0}
Both the document tensor and the query tensor are defined with trackid{}
as the named mapped dimension. The
sparse tensor dot product can then be expression in a rank-profile
:
rank-profile similar { inputs { query(user_liked) tensor<float>(trackid{}) } first-phase { expression: sum(attribute(similar) * query(user_liked)) } }
See tensor user guide for more on tensor fields and tensor computations
with Vespa. Adding this rank-profile
to the document schema:
schema track { document track { field track_id type string { indexing: summary | attribute rank: filter match: word } field title type string { indexing: summary | index index: enable-bm25 } field artist type string { indexing: summary | index } field tags type weightedset<string> { indexing: summary | attribute attribute: fast-search } field similar type tensor<float>(trackid{}) { indexing: summary | attribute } } fieldset default { fields: title, artist } document-summary track_id { summary track_id { } } rank-profile personalized { first-phase { expression: rawScore(tags) } } rank-profile similar { inputs { query(user_liked) tensor<float>(trackid{}) } first-phase { expression: sum(attribute(similar) * query(user_liked)) } } }
Deploy the application again :
$ vespa deploy --wait 300 app
The track list of recently played tracks (or liked):
TRQIQMT128E0791D9C
Summer Of '69 by Bryan AdamsTRWJIPT128E0791D99
Run To You by Bryan AdamsTRGVORX128F4291DF1
Broken Wings by Mr. MisterIs represented as the query(user_liked)
query tensor
input.query(user_liked)={{trackid:TRUAXHV128F42694E8 }:1.0,{trackid:TRQIQMT128E0791D9C}:1.0,{trackid:TRGVORX128F4291DF1}:1.0}
The first query example runs the tensor computation over all tracks using where true
, notice also
ranking=similar
, without it, ranking with nativeRank
would not take into account the query tensor:
$ vespa query \ 'yql=select title, artist, track_id from track where true' \ 'input.query(user_liked)={{trackid:TRUAXHV128F42694E8}:1.0,{trackid:TRQIQMT128E0791D9C}:1.0,{trackid:TRGVORX128F4291DF1}:1.0}' \ 'ranking=similar' \ 'hits=5'
This query also retrieved some of the previous liked tracks. These can be removed
from the result set using the not
query operator, in YQL represented as !
.
where !(track_id in (@userLiked))
The in query operator
is the most efficient multi-value filtering query operator, either
using a positive filter (match if any of the keys matches) or negative filter using not
(remove from result if any of the keys matches).
See more examples in feature-tuning set filtering.
Run query with the not
filter:
$ vespa query \ 'yql=select title, artist, track_id from track where !(track_id in (@userLiked))' \ 'input.query(user_liked)={{trackid:TRQIQMT128E0791D9C}:1.0,{trackid:TRWJIPT128E0791D99}:1.0,{trackid:TRGVORX128F4291DF1}:1.0}' \ 'ranking=similar' \ 'hits=5' \ 'userLiked=TRQIQMT128E0791D9C,TRWJIPT128E0791D99,TRGVORX128F4291DF1'
Note that the tensor query input format is slightly different from the variable substitution supported for
the multivalued query operators wand
, in
and dotProduct
.
The above query produces the following result:
This query retrieves 95,663 documents, and the three tracks previously liked were removed from the result.
The following example filters by a tags query, tags:popular
, reducing the complexity of the
query as fewer documents gets ranked by the tensor ranking expression:
$ vespa query \ 'yql=select title,artist, track_id from track where tags contains "popular" and !(track_id in (@userLiked))' \ 'input.query(user_liked)={{trackid:TRQIQMT128E0791D9C}:1.0,{trackid:TRWJIPT128E0791D99}:1.0,{trackid:TRGVORX128F4291DF1}:1.0}' \ 'ranking=similar' \ 'hits=5' \ 'userLiked=TRQIQMT128E0791D9C,TRWJIPT128E0791D99,TRGVORX128F4291DF1'
With fewer matches to score using the tensor expression the latency decreases. In this query case,
latency is strictly linear with number of matches. One could also use a combination of wand
for
efficient retrieval and tensor computations for ranking. Notice that querytime
of the unconstrained search
was around 120 ms which is on the high side for real-time serving.
The sparse tensor product can be optimized by adding attribute: fast-search
to the mapped field tensor.
attribute: fast-search
is supported for tensor
fields using mapped dimensions, or mixed tensors using
both mapped and dense dimensions. The cost of doing this is increased memory usage. The schema
with attribute: fast-search
added to the similar
tensor field:
schema track { document track { field track_id type string { indexing: summary | attribute rank: filter match: word } field title type string { indexing: summary | index index: enable-bm25 } field artist type string { indexing: summary | index } field tags type weightedset<string> { indexing: summary | attribute attribute: fast-search } field similar type tensor<float>(trackid{}) { indexing: summary | attribute attribute: fast-search } } fieldset default { fields: title, artist } document-summary track_id { summary track_id { } } rank-profile personalized { first-phase { expression: rawScore(tags) } } rank-profile similar { inputs { query(user_liked) tensor<float>(trackid{}) } first-phase { expression: sum(attribute(similar) * query(user_liked)) } } }
Deploy the application again :
$ vespa deploy --wait 300 app
And again, adding fast-search
, requires a re-start of the searchnode process:
$ docker exec vespa vespa-sentinel-cmd restart searchnode
Wait for the searchnode to start by waiting for status:code:up
:
$ curl -s http://localhost:19110/state/v1/health
Re-run the tensor ranking query:
$ vespa query \ 'yql=select title,artist, track_id from track where !(track_id in (@userLiked))' \ 'input.query(user_liked)={{trackid:TRQIQMT128E0791D9C}:1.0,{trackid:TRWJIPT128E0791D99}:1.0,{trackid:TRGVORX128F4291DF1}:1.0}' \ 'ranking=similar' \ 'hits=5' \ 'userLiked=TRQIQMT128E0791D9C,TRWJIPT128E0791D99,TRGVORX128F4291DF1'
The querytime
dropped to 40 ms instead of 120 ms without the fast-search
option.
See also performance considerations when using tensor expression.
Vespa supports int8
, bfloat16
, float
and double
precision cell types.
A tradeoff between speed, accuracy and memory usage.
So far in this guide all search queries and ranking computations have been performed using
single threaded execution.
To enable multithreaded execution, a setting needs to be added to services.xml
.
Multithreaded search and ranking can improve query latency significantly and make better
use of multi-cpu core architectures.
The following adds a tuning
element to services.xml
overriding
requestthreads:persearch.
The default number of threads used persearch
is one.
<?xml version="1.0" encoding="UTF-8"?> <services version="1.0"> <container id="default" version="1.0"> <search/> <document-api/> </container> <content id="tracks" version="1.0"> <engine> <proton> <tuning> <searchnode> <requestthreads> <persearch>4</persearch> </requestthreads> </searchnode> </tuning> </proton> </engine> <redundancy>1</redundancy> <documents> <document type="track" mode="index"></document> </documents> <nodes> <node distribution-key="0" hostalias="node1"></node> </nodes> </content> </services>
Deploy the application again :
$ vespa deploy --wait 300 app
Changing the global threads per search requires a restart of the searchnode
process:
$ docker exec vespa vespa-sentinel-cmd restart searchnode
Wait for the searchnode
to start:
$ curl -s localhost:19110/state/v1/health
Then repeat the tensor ranking query:
$ vespa query \ 'yql=select title,artist, track_id from track where !(track_id in (@userLiked))' \ 'input.query(user_liked)={{trackid:TRQIQMT128E0791D9C}:1.0,{trackid:TRWJIPT128E0791D99}:1.0,{trackid:TRGVORX128F4291DF1}:1.0}' \ 'ranking=similar' \ 'hits=5' \ 'userLiked=TRQIQMT128E0791D9C,TRWJIPT128E0791D99,TRGVORX128F4291DF1'
Now, the content node(s) will parallelize the matching and ranking
using multiple search threads and querytime
drops to about 15 ms.
The setting in services.xml
sets the global persearch value,
It is possible to tune down the number of threads used for a query with
rank-profile
overrides using num-threads-per-search.
Note that the per rank-profile setting can only be used to tune the number of threads
to a lower number than the global default.
This adds a new rank-profile
similar-t2
using num-threads-per-search: 2
instead
of the global 4 setting. It's also possible to set the number of threads in the query request
using ranking.matching.numThreadsPerSearch.
schema track { document track { field track_id type string { indexing: summary | attribute rank: filter match: word } field title type string { indexing: summary | index index: enable-bm25 } field artist type string { indexing: summary | index } field tags type weightedset<string> { indexing: summary | attribute attribute: fast-search } field similar type tensor<float>(trackid{}) { indexing: summary | attribute attribute: fast-search } } fieldset default { fields: title, artist } document-summary track_id { summary track_id { } } rank-profile personalized { first-phase { expression: rawScore(tags) } } rank-profile similar { inputs { query(user_liked) tensor<float>(trackid{}) } first-phase { expression: sum(attribute(similar) * query(user_liked)) } } rank-profile similar-t2 inherits similar { num-threads-per-search: 2 } }
Deploy the application again :
$ vespa deploy --wait 300 app
And adding a new rank-profile does not require any restart, repeat the query again,
now using the similar-t2
profile:
$ vespa query \ 'yql=select title,artist, track_id from track where !(track_id in (@userLiked))' \ 'input.query(user_liked)={{trackid:TRQIQMT128E0791D9C}:1.0,{trackid:TRWJIPT128E0791D99}:1.0,{trackid:TRGVORX128F4291DF1}:1.0}' \ 'ranking=similar-t2' \ 'hits=5' \ 'userLiked=TRQIQMT128E0791D9C,TRWJIPT128E0791D99,TRGVORX128F4291DF1'
By using multiple rank profiles like above, developers can find the sweet-spot where latency does not improve much by using more threads. Using more threads per search limits query concurrency as more threads will be occupied per query. Read more in Vespa sizing guide:reduce latency with multithreaded search.
Vespa has an advanced query operator that allows selecting the
documents with the k-largest or k-smallest values of a fast-search
attribute field.
To demonstrate this query operator, this guide introduces a popularity
field. Since the last.fm dataset does not have a real popularity metric,
the number of tags per track is used as a proxy of the true track popularity.
The following script runs through the dataset and count the number of tags and creates a Vespa partial update feed operation per track.
import os
import sys
import json
directory = sys.argv[1]
seen_tracks = set()
def process_file(filename):
global seen_tracks
with open(filename) as fp:
doc = json.load(fp)
title = doc['title']
artist = doc['artist']
hash = title + artist
if hash in seen_tracks:
return
else:
seen_tracks.add(hash)
track_id = doc['track_id']
tags = doc['tags']
tags_dict = dict()
for t in tags:
k,v = t[0],int(t[1])
tags_dict[k] = v
n = len(tags_dict)
vespa_doc = {
"update": "id:music:track::%s" % track_id,
"fields": {
"popularity": {
"assign": n
}
}
}
print(json.dumps(vespa_doc))
sorted_files = []
for root, dirs, files in os.walk(directory):
for filename in files:
filename = os.path.join(root, filename)
sorted_files.append(filename)
sorted_files.sort()
for filename in sorted_files:
process_file(filename)
With this script, run through the dataset and create the partial update feed :
$ python3 create-popularity-updates.py lastfm_test > updates.jsonl
Add the popularity
field to the track schema, the field is defined with fast-search
.
Also, a popularity
rank-profile
is added, this profile using one thread per search:
schema track { document track { field track_id type string { indexing: summary | attribute rank: filter match: word } field title type string { indexing: summary | index index: enable-bm25 } field artist type string { indexing: summary | index } field tags type weightedset<string> { indexing: summary | attribute attribute: fast-search } field similar type tensor<float>(trackid{}) { indexing: summary | attribute attribute: fast-search } field popularity type int { indexing: summary | attribute attribute: fast-search } } fieldset default { fields: title, artist } document-summary track_id { summary track_id { } } rank-profile personalized { first-phase { expression: rawScore(tags) } } rank-profile similar { inputs { query(user_liked) tensor<float>(trackid{}) } first-phase { expression: sum(attribute(similar) * query(user_liked)) } } rank-profile similar-t2 inherits similar { num-threads-per-search: 2 } rank-profile popularity { num-threads-per-search: 1 first-phase { expression: attribute(popularity) } } }
Deploy the application again :
$ vespa deploy --wait 300 app
Adding a new field does not require a restart, apply the partial updates by:
$ vespa feed -t http://localhost:8080 updates.jsonl
With that feed job completed, it is possible to select the five tracks with the highest popularity by using the range() query operator with hitLimit:
$ vespa query \ 'yql=select track_id, popularity from track where {hitLimit:5,descending:true}range(popularity,0,Infinity)' \ 'ranking=popularity'
The search returned 1,352 documents, while we asked for just five. The reason
is that the hitLimit
annotation for the range
operator only specifies the lower bound.
Documents that are tied with the same popularity
value within the 5 largest values are returned.
The range()
query operator with hitLimit
can be used to efficiently implement
top-k selection for ranking a subset of the documents in the index.
For example, use the range
search with hitLimit
to only run the
track recommendation tensor computation
over the most popular tracks:
$ vespa query \ 'yql=select title,artist, track_id, popularity from track where {hitLimit:5,descending:true}range(popularity,0,Infinity) and !(track_id in (@userLiked))' \ 'input.query(user_liked)={{trackid:TRQIQMT128E0791D9C}:1.0,{trackid:TRWJIPT128E0791D99}:1.0,{trackid:TRGVORX128F4291DF1}:1.0}' \ 'ranking=similar' \ 'hits=5' \ 'userLiked=TRQIQMT128E0791D9C,TRWJIPT128E0791D99,TRGVORX128F4291DF1'
Notice that this query returns 1,349 documents while the range
search from previous example returned
1,352 documents. This is due to the not
filter.
The range search with hitLimit
can be used for cases where one wants to select efficiently top-k of a
single valued numeric attribute
with fast-search
. Some use cases which can be efficiently implemented
by using it:
long
to represent a timestamp (e.g., using Unix epoch).hitLimit
.Do note that any other query or filter terms in the query are applied after having found the top-k documents, so an aggressive filter removing many documents might end up recalling 0 documents.
This behavior is illustrated with this query:
$ vespa query \ 'yql=select track_id, popularity from track where {hitLimit:5,descending:true}range(popularity,0,Infinity) and popularity=99'
This query fails to retrieve any documents because the range search finds 1,352 documents where popularity is 100, and'ing that top-k result with the popularity=99 filter constraint ends up with 0 results.
Using range search query operator with hitLimit
is practical for search use cases
like auto-complete or search suggestions
where one typically use match: prefix or
n-gram matching using match: gram. Limiting the short
few first character searches to include a hitLimit
range on popularity
can greatly improve the query performance and at the same time match against popular suggestions.
As the user types more characters, the number of matches is greatly reduced, so ranking can focus on more factors
than just the single popularity attribute and increase the hitLimit
.
An alternative to range
search with hitLimit
is using
early termination with match-phase
which enables early-termination of search and first-phase
ranking
using a document field to determine the search evaluation order.
Match-phase early-termination uses a field with attribute during matching and ranking to impact the
order the search and ranking is performed in.
If a query is likely to generate more than ranking.matchPhase.maxHits
per node, the search core
will early terminate the search and matching and evaluate the query in the order dictated
by the ranking.matchPhase.attribute
attribute field.
Match phase early termination requires a single valued numeric field with attribute
and fast-search
.
See Match phase query parameters.
Match-phase limit cannot terminate/early stop any potential second-phase
ranking expression,
only matching and first-phase
ranking, hence the name: match phase limit.
The following enables matchPhase
early termination with maxHits
target set to 100:
$ vespa query \ 'yql=select track_id, popularity from track where true' \ 'ranking=popularity' \ 'ranking.matchPhase.maxHits=100' \ 'ranking.matchPhase.attribute=popularity' \ 'hits=2'
Which will produce the following result:
In this case, totalCount became 1,476, a few more than the range
search with hitLimit
. Notice
also the presence of coverage:degraded
- This informs the client that this result was not fully evaluated
over all matched documents. Read more about graceful result degradation.
Note that the example uses the popularity
rank-profile which was configured with one
thread per search, for low settings of maxHits
, this is the recommended setting.
rank-profile popularity { num-threads-per-search: 1 first-phase { expression: attribute(popularity) } }
The core difference from capped range search is that match-phase
is safe as filters works inline
with the search, and are not applied after finding the top-k documents.
This query does not trigger match-phase early termination because there are few hits matching the query:
$ vespa query \ 'yql=select track_id, popularity from track where popularity=99' \ 'ranking=popularity' \ 'ranking.matchPhase.maxHits=100' \ 'ranking.matchPhase.attribute=popularity' \ 'hits=2'
Generally, prefer match-phase
early termination over range
search with hitLimit
.
Match phase limiting can also be used in combination with text search queries:
$ vespa query \ 'yql=select title, artist, popularity from track where userQuery()' \ 'query=love songs' \ 'type=any' \ 'ranking=popularity' \ 'ranking.matchPhase.maxHits=100' \ 'ranking.matchPhase.attribute=popularity' \ 'hits=2'
Since this query uses type=any
the above query retrieves a lot more documents than
the target matchPhase.maxHits
so early termination is triggered, which will then cause the search core to match
and rank tracks with the highest popularity.
Early termination using match-phase limits is a powerful feature that can keep latency and cost in check for many large scale serving use cases where a document quality signal is available. Match phase termination also supports specifying a result diversity constraint. See Result diversification blog post. Note that result diversity is normally obtained with Vespa result grouping, the match-phase diversity is used to ensure that diverse hits are also collected if early termination kicks in.
This section introduces query tracing. Tracing helps understand where time (and cost) is spent, and how to best optimize the query or schema settings. Query tracing can be enabled using the following parameters:
A simple example query with tracing enabled:
$ vespa query 'yql=select track_id from track where tags contains "rock"' \ 'trace.level=3' 'trace.timestamps=true' 'trace.explainLevel=1' 'hits=1'
The first part of the trace traces the query through the stateless container search chain. For each searcher invoked in the chain a timestamp relative to the start of the query request is emitted:
The trace runs all the way to the query is dispatched to the content node(s) and the merged response is returned up to the client.
In this case, with tracing it has taken 2ms of processing in the stateless container, before the query is about to be put on the wire on its way to the content nodes.
The first protocol phase is the next trace message. In this case the reply, is ready read from the wire at timestamp 6, so approximately 4 ms was spent in the first protocol matching phase, including network serialization and deserialization.
Inside this message is the content node traces of the query, timestamp_ms
is relative to the start of the query
on the content node. In this case, the content node uses 1.98 ms to evaluate the first protocol phase
of the query (duration_ms
).
More explanation of the content node traces
is coming soon. It includes information like
These traces can help guide both feature tuning decisions and scaling and sizing.
Later in the trace one can also see the second query protocol phase which is the summary fill:
And finally an overall breakdown of the two phases:
Also try the Trace Visualizer for a flame-graph of the query trace.
This concludes this tutorial. The following removes the container and the data:
$ docker rm -f vespa