This guide explores how to search and rank over structured multivalued fields, how one can build a query retrieval strategy. The examples in this guide use the weightedset field type. The generic map<key-type,value-type> field type does not currently support ranking and can only be used for matching and filtering.
When building a search application we need to think about:
There is a lot of text matching options we should think about when designing and mapping our document model to a Vespa document schema:
At query time, we can take the user query and translate it into a valid Vespa query request which implements our matching and retrieval strategy over the designed document schema.
The documents which match the query and are retrieved by the query are scored using a ranking model. Once a document is retrieved by the query logic the document can be scored using the full flexibility of the Vespa ranking framework.
In the following sections we explore matching and ranking over multivalued string fields.
Prerequisites:
NO_SPACE
- the vespaengine/vespa container image + headroom for data requires disk space.
Read more.
Assuming we have the following sample data document where we have a structured tag-like field where there is a weight associated with each element.
{ "put": "id:photos:photo::0", "fields": { "title": "Mira in the sunset", "description": "A sunny afternoon with our dogs", "tags": { "no filter":1, "light": 3, "black and white": 3, "clear sky": 2, "sunset dogs": 4 } } }
Structured data like the tags
, where we both want to match and rank is best represented using
the weightedset field type.
The Vespa weightedset field type can be used to represent:
How should we design our Vespa schema, and how should we match and search this data model for end-user free text queries?
schema photo { stemming: none document photo { field title type string { indexing: summary | index match:text index: enable-bm25 } field description type string { indexing: summary | index match:text index: enable-bm25 } field interestingness type float { indexing: summary | attribute } field tags type weightedset<string> { indexing: summary | index match:text index: enable-bm25 } } fieldset default { fields: title, description, tags } rank-profile default { first-phase { expression: nativeRank } } }
In the schema we disable stemming and also enable bm25 text ranking feature for all string fields.
Since all string fields shares the same match settings we can use a fieldset so that queries does not need to mention all three fields.
We also include a default rank profile (this is the implicit default rank profile) using the Vespa nativeRank text matching rank feature.
Along with the schema, we also need a services.xml file to make up a Vespa application package:
<?xml version="1.0" encoding="UTF-8"?> <services version="1.0"> <container id="default" version="1.0"> <search /> <document-api /> <nodes> <node hostalias="node1"></node> </nodes> </container> <content id="photos" version="1.0"> <redundancy>1</redundancy> <documents> <document type="photo" mode="index"/> </documents> <nodes> <node hostalias="node1" distribution-key="0" /> </nodes> </content> </services>
This example uses the vespa container image:
$ docker pull vespaengine/vespa $ docker run --detach --name vespa --hostname vespa-container \ --publish 8080:8080 --publish 19071:19071 \ vespaengine/vespa
Install Vespa-cli using Homebrew:
$ brew install vespa-cli
Deploy the application:
$ vespa deploy --wait 300 my-app
Feed a sample document:
$ vespa document -v doc.json
Assuming a free text query sunset photos featuring dogs, translate the user query into a Vespa query request using YQL:
$ vespa query 'yql=select * from photos where userQuery()' \ 'query=sunset photos featuring dogs' \ 'type=all'
The above query returns 0 hits, since the query requires that all query terms matches the document. By adding trace.level to the query request we can see how the query is parsed and executed against the content nodes:
$ vespa query 'yql=select * from photos where userQuery()' \ 'query=sunset photos featuring dogs' \ 'type=all' \ 'trace.level=3'
In the trace we can see the query which is dispatched to the content nodes:
query=[AND sunshot photos featuring dogs]
Using tracing is very useful when debugging why documents match or does not match.
Since the sample document does not contain the term featuring or photos, the query fails to retrieve the example document. Relax the query matching to instead of requiring that all terms match, to use any. See model.type query api reference for supported query types:
$ vespa query 'yql=select * from photos where userQuery()' \ 'query=sunset photos featuring dogs' \ 'type=any'
Changing the type to any
, recalls the sample document as we no longer require that all query terms must match.
With type
it also possible to require that individual query terms match by using +
:
$ vespa query 'yql=select * from photos where userQuery()' \ 'query=+sunset photos featuring +dogs' \ 'type=any'
In this example sunset
and dogs
must be matched. Note that we have disabled stemming so querying
for dogs
won't recall documents with dog
. This is one of the reasons we disabled stemming, to demonstrate
that stemming has impact on recall. Requiring dog
will cause the query to not recall our sample document.
$ vespa query 'yql=select * from photos where userQuery()' \ 'query=+sunset photos featuring +dog' \ 'type=any'
Now, explore how Vespa matches the multivalued tags field of
type weightedset.
Notice that we change back to type=all
.
In this example we also use the default-index
query parameter to limit matching to the tags
field.
$ vespa query 'yql=select * from photos where userQuery()' \ 'query=clear sky' \ 'type=all' \ 'default-index=tags'
The query matches the document which is no surprise since a tag contains the exact content clear sky
.
Let us search for just clear
instead:
$ vespa query 'yql=select * from photos where userQuery()' \ 'query=clear' \ 'type=all' \ 'default-index=tags'
Also matches the document, this demonstrates that matching is partial, it does not require
to match the set element exactly. clear
matches clear sky
and sky
will match clear sky
.
But what about black sky
:
$ vespa query 'yql=select * from photos where userQuery()' \ 'query=black sky' \ 'type=all' \ 'default-index=tags'
Also matches the document. This is an example of cross-element matching. With weightedset
using indexing:index
with match:text
multi term queries match across elements.
This might be a good decision, as we increase recall, however in some cases we want to differentiate an exact match from a partial match during ranking, so that exact matches are ranked higher than partial matches.
We have now explored querying and matching, now it's time to focus on how to rank the documents matched.
You might not have noticed, but in the above examples, each of the queries produced a relevance
score per hit,
this score was in our previous examples calculated using the default
rank profile
which in our case used nativeRank.
We can start by analyzing other rank features by asking Vespa to produce them for us.
We use match-features
to return rank features with the retrieved documents.
We explicitly mention which ranking features we want to have calculated and returned.
Notice that we don't change the actual scoring, we still use nativeRank
as the scoring function:
schema photo { stemming: none document photo { field title type string { indexing: summary | index match:text index: enable-bm25 } field description type string { indexing: summary | index match:text index: enable-bm25 } field interestingness type float { indexing: summary | attribute } field tags type weightedset<string> { indexing: summary | index match:text index: enable-bm25 } } fieldset default { fields: title, description, tags } rank-profile default { first-phase { expression: nativeRank } match-features { bm25(title) bm25(description) bm25(tags) nativeRank nativeRank(title) nativeRank(description) elementSimilarity(tags) elementCompleteness(tags).elementWeight elementCompleteness(tags).fieldCompleteness elementCompleteness(tags).queryCompleteness elementCompleteness(tags).completeness } } }
Re-deploy with the changed rank profile:
$ vespa deploy --wait 300 my-app
Now we will see a list of features in the response:
$ vespa query 'yql=select * from photos where userQuery()' \ 'query=clear sky' \ 'type=any'
The output includes matchfeatures where we can see the various scores for the features:
Especially look at the elementCompleteness
and elementSimilarity
rank features which
are example of features for indexed multivalued string
fields.
We can also notice that elementCompleteness(tags).fieldCompleteness
is 1.0 which means
that the tag was matched exactly and the "elementCompleteness(tags).elementWeight
outputs
the weight of the best matched element.
The elementSimilarity(tags)
ranking feature is very flexible and even allow us to override
the calculation and output new features.
In this example we defined two new ranking features:
elementSimilarity(tags).sumWeight
which uses the sum of matching elements using field completeness x weight.elementSimilarity(tags).maxWeight
which uses the max over the matching elements using field completeness x weight.schema photo { stemming: none document photo { field title type string { indexing: summary | index match:text index: enable-bm25 } field description type string { indexing: summary | index match:text index: enable-bm25 } field interestingness type float { indexing: summary | attribute } field tags type weightedset<string> { indexing: summary | index match:text index: enable-bm25 } } fieldset default { fields: title, description, tags } rank-profile default { rank-properties { elementSimilarity(tags).output.sumWeight: "sum(f*w)" elementSimilarity(tags).output.maxWeight: "max(f*w)" } first-phase { expression: nativeRank } match-features { bm25(title) bm25(description) bm25(tags) nativeRank nativeRank(title) nativeRank(description) elementSimilarity(tags) elementSimilarity(tags).sumWeight elementSimilarity(tags).maxWeight elementCompleteness(tags).elementWeight elementCompleteness(tags).fieldCompleteness elementCompleteness(tags).queryCompleteness elementCompleteness(tags).completeness } } }
Re-deploy with the changed rank profile:
$ vespa deploy --wait 300 my-app
Now we will see a list of features in the response:
$ vespa query 'yql=select * from photos where userQuery()' \ 'query=clear sky' 'type=any'
Each hit returned contains a matchfeatures field where we can see the various scores for the features.
Now, we can include these features in a ranking expression used in first-phase
to actually change the ranking.
The actual best scoring function is data dependent.
A trained function using machine learning is by far the easiest way.
The bag of words bm25 ranking feature is not normalized, so combining it in a linear function is challenging, as the score range of the feature is unbound. To overcome this, and allow easy exploration without changing the rank profile, make the parameters in the function overridable on a per-query basis by:
first-phase { expression { query(titleWeight)*bm25(title) + query(descriptionWeight)*bm25(description) + query(tagWeight)*elementSimilarity(tags).maxWeight } }
schema photo { stemming: none document photo { field title type string { indexing: summary | index match:text index: enable-bm25 } field description type string { indexing: summary | index match:text index: enable-bm25 } field interestingness type float { indexing: summary | attribute } field tags type weightedset<string> { indexing: summary | index match:text index: enable-bm25 } } fieldset default { fields: title, description, tags } rank-profile tunable inherits default { inputs { query(titleWeight): 2 query(descriptionWeight): 1 query(tagWeight): 2 } rank-properties { elementSimilarity(tags).output.sumWeight: "sum(f*w)" elementSimilarity(tags).output.maxWeight: "max(f*w)" } first-phase { expression { query(titleWeight)*bm25(title) + query(descriptionWeight)*bm25(description) + query(tagWeight)*elementSimilarity(tags).maxWeight } } match-features { bm25(title) bm25(description) bm25(tags) elementSimilarity(tags).maxWeight firstPhase } } }
Re-deploy:
$ vespa deploy --wait 300 my-app
Run a query with the new rank profile:
$ vespa query 'yql=select * from photos where userQuery()' \ 'query=clear sky' 'type=any' 'ranking=tunable'
With the function above, since 'clear sky' does not match any of the title or description fields, the bm25 features becomes zero.
Our elementSimilarity(tags).maxWeight
feature is 2.0 and the first phase
expression becomes 4 which is reflected in the hit relevance score.
Change the query(tagWeight)
with the query request and observe that the relevance becomes 6.0:
$ vespa query 'yql=select * from photos where userQuery()' \ 'query=clear sky' \ 'type=any' \ 'ranking=tunable' \ 'input.query(tagWeight)=3'
Similar, we could also include a document-only signal to our ranking function by:
inputs { query(titleWeight): 2 query(descriptionWeight): 1 query(tagWeight): 2 query(staticWeight): 1 } rank-properties { elementSimilarity(tags).output.sumWeight: "sum(f*w)" elementSimilarity(tags).output.maxWeight: "max(f*w)" } first-phase { expression { query(titleWeight)*bm25(title) + query(descriptionWeight)*bm25(description) + query(tagWeight)*elementSimilarity(tags).maxWeight + query(staticWeight)*attribute(interestingness) } }
That concludes the matching and ranking experiments. To shut down the container:
$ docker rm -f vespa