Many of our users use Vespa to power large scale RAG Applications.
This blueprint aims to exemplify many of the best practices we have learned while supporting these users.
While many RAG tutorials exist, this blueprint provides a customizable template that:
This tutorial will show how we can develop a high-quality RAG application with an evaluation-driven mindset, while being a resource you can revisit for making informed choices for your own use case.
We will guide you through the following steps:
All the accompanying code can be found in our sample app repo.
Each step will contain reasoning behind the choices and design of the blueprint, as well as pointers for customizing to your own application.
Prerequisites:
NO_SPACE
- the vespaengine/vespa container image + headroom for data requires disk space.
Read more.
The sample use case is a document search application, for a user who wants to get answers and insights quickly from a document collection containing company documents, notes, learning material, training logs. To make the blueprint more realistic, we required a dataset with more structured fields than are commonly found in public datasets. Therefore, we used a Large Language Model (LLM) to generate a custom one.
It is a toy example, with only 100 documents, but we think it will illustrate the necessary concepts. You can also feel confident that the blueprint will provide a starting point that can scale as you want, with minimal changes.
Below you can see a sample document from the dataset:
{
"put": "id:doc:doc::78",
"fields": {
"created_timestamp": 1717750000,
"modified_timestamp": 1717750000,
"text": "# Feature Brainstorm: SynapseFlow Model Monitoring Dashboard v1\n\n**Goal:** Provide users with basic insights into their deployed model's performance and health.\n\n**Key Metrics to Display:**\n- **Inference Latency:** Avg, p95, p99 (Histogram).\n- **Request Rate / Throughput:** Requests per second/minute.\n- **Error Rate:** Percentage of 5xx errors.\n- **CPU/Memory Usage:** Per deployment/instance.\n- **GPU Usage / Temp (if applicable).**\n\n**Visualizations:**\n- Time series graphs for all key metrics.\n- Ability to select time range (last hour, day, week).\n- Filter by deployment ID.\n\n**Data Sources:**\n- Prometheus metrics from model server (see `code_review_pr123_metrics.md`).\n- Kubernetes metrics (via Kube State Metrics or cAdvisor).\n\n**Future Ideas (v2+):**\n- Data drift detection.\n- Concept drift detection.\n- Alerting on anomalies or threshold breaches.\n- Custom metric ingestion.\n\n## <MORE_TEXT:HERE> (UI mock-up sketches, specific Prometheus queries)",
"favorite": true,
"last_opened_timestamp": 1717750000,
"open_count": 3,
"title": "feature_brainstorm_monitoring_dashboard.md",
"id": "78"
}
}
In order to evaluate the quality of the RAG application, we also need a set of representative queries, with annotated relevant documents. Crucially, you need a set of representative queries that thoroughly cover your expected use case. More is better, but some eval is always better than none.
We used gemini-2.5-pro
to create our queries and relevant document labels. Please check out our blog post to learn more about using LLM-as-a-judge.
We decided to generate some queries that need several documents to provide a good answer, and some that only need one document. If these queries are representative of the use case, we will show that they can be a great starting point for creating an (initial) ranking expression that can be used for retrieving and ranking candidate documents. But, it can (and should) also be improved, for example by collecting user interaction data, human labeling and/ or using an LLM to generate relevance feedback following the initial ranking expression.
Here is the schema that we will use for our sample application.
# Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
schema doc {
document doc {
field id type string {
indexing: summary | attribute
}
field title type string {
indexing: index | summary
index: enable-bm25
}
field text type string {
}
field created_timestamp type long {
indexing: attribute | summary
}
field modified_timestamp type long {
indexing: attribute | summary
}
field last_opened_timestamp type long {
indexing: attribute | summary
}
field open_count type int {
indexing: attribute | summary
}
field favorite type bool {
indexing: attribute | summary
}
}
field title_embedding type tensor<int8>(x[96]) {
indexing: input title | embed | pack_bits | attribute | index
attribute {
distance-metric: hamming
}
}
field chunks type array<string> {
indexing: input text | chunk fixed-length 1024 | summary | index
index: enable-bm25
}
field chunk_embeddings type tensor<int8>(chunk{}, x[96]) {
indexing: input text | chunk fixed-length 1024 | embed | pack_bits | attribute | index
attribute {
distance-metric: hamming
}
}
fieldset default {
fields: title, chunks
}
document-summary no-chunks {
summary id {}
summary title {}
summary created_timestamp {}
summary modified_timestamp {}
summary last_opened_timestamp {}
summary open_count {}
summary favorite {}
summary chunks {}
}
document-summary top_3_chunks {
from-disk
summary chunks_top3 {
source: chunks
select-elements-by: top_3_chunk_sim_scores #this needs to be added a summary-feature to the rank-profile
}
}
}
Keep reading for an explanation and reasoning behind the choices in the schema.
When building a RAG application, your first key decision is choosing the "searchable unit." This is the basic block of information your system will search through and return as context to the LLM. For instance, if you have millions of documents, some hundreds of pages long, what should be your searchable unit?
Consider these points when selecting your searchable unit:
We recommend erring on the side of using slightly larger units.
With Vespa, it is now possible to return only the top k most relevant chunks of a document, and include and combine both document-level and chunk-level features in ranking.
Assume you have chosen a document as your searchable unit. Your documents may then contain text index fields of highly variable lengths. Consider for example a corpus of web pages. Some might be very long, while the average is well within the recommended size. See scaling retrieval size for more details.
While we recommend implementing guards against too long documents in your feeding pipeline, you still probably do not want to return every chunk of the top k documents to an LLM for RAG.
In Vespa, we now have a solution for this problem. Below, we show how you can score both documents as well as individual chunks, and use that score to select the best chunks to be returned in a summary, instead of returning all chunks belonging to the top k ranked documents.
Compute closeness per chunk in a ranking function; use elementwise(bm25(chunks), i, double)
for a per-chunk text signal. See rank feature reference
Now available: elementwise rank functions and filtering on the content nodes.
This allows you to pick a large document as the searchable unit, while still addressing the potential drawbacks many encounter as follows:
This allows you to index larger units, while avoiding data duplication and performance issues, by returning only the most relevant parts.
Vespa also supports automatic chunking in the indexing language.
Here are the parts of the schema, which defines the searchable unit as a document with a text field, and automatically chunks it into smaller parts of 1024 characters, which each are embedded and indexed separately:
field chunks type array<string> {
indexing: input text | chunk fixed-length 1024 | summary | index
index: enable-bm25
}
field chunk_embeddings type tensor<int8>(chunk{}, x[96]) {
indexing: input text | chunk fixed-length 1024 | embed | pack_bits | attribute | index
attribute {
distance-metric: hamming
}
}
In Vespa, we can specify which chunks to be returned with a summary feature, see docs for details. For this blueprint, we will return the top 3 chunks based on the similarity score of the chunk embeddings, which is calculated in the ranking phase. Note that this feature could be any chunk-level summary feature defined in your rank-profile.
Here is how the summary feature is calculated in the rank-profile:
# This function unpack the bits of each dimenrion of the mapped chunk_embeddings attribute tensor
function chunk_emb_vecs() {
expression: unpack_bits(attribute(chunk_embeddings))
}
# This function calculate the dot product between the query embedding vector and the chunk embeddings (both are now float) over the x dimension
function chunk_dot_prod() {
expression: reduce(query(float_embedding) * chunk_emb_vecs(), sum, x)
}
# This function calculate the L2 normalized length of an input tensor
function vector_norms(t) {
expression: sqrt(sum(pow(t, 2), x))
}
# Here we calculate cosine similarity by dividing the dot product by the product of the L2 normalized query embedding and document embeddings
function chunk_sim_scores() {
expression: chunk_dot_prod() / (vector_norms(chunk_emb_vecs()) * vector_norms(query(float_embedding)))
}
function top_3_chunk_text_scores() {
expression: top(3, chunk_text_scores())
}
function top_3_chunk_sim_scores() {
expression: top(3, chunk_sim_scores())
}
summary-features {
top_3_chunk_sim_scores
}
Note that we want to use the float-representation of the query-embedding, and thus also need to convert the binary embedding of the chunks to float. After that, we can calculate the similarity score between the query embedding and the chunk embeddings using cosine similarity (the dot product, and then normalize it by the norms of the embeddings).
See ranking expressions for more details on the top
-function, and other functions available for ranking expressions.
Now, we can use this summary feature in our document summary to return the top 3 chunks of the document, which will be used as context for the LLM. Note that we can also define a document summary that returns all chunks, which might be useful for another use case, such as deep research.
document-summary top_3_chunks {
from-disk
summary chunks_top3 {
source: chunks
select-elements-by: top_3_chunk_sim_scores #this needs to be added a summary-feature to the rank-profile
}
}
We recommend indexing different textual content as separate indexes. These can be searched together, using field-sets
In our schema, this is exemplified by the sections below, which define the title
and chunks
fields as separate indexed text fields.
...
field title type title {
indexing: index | summary
index: enable-bm25
}
field chunks type array<string> {
indexing: input text | chunk fixed-length 1024 | summary | index
index: enable-bm25
}
Whether you should have separate embedding fields, depends on whether the added memory usage is justified by the quality improvement you could get from the additional embedding field.
We choose to index both a title_embedding
and a chunk_embeddings
field for this blueprint, as we aim to minimize cost by embedding the binary vectors.
field title_embedding type tensor<int8>(title{}, x[96]) {
indexing: input text | embed | pack_bits | attribute | index
attribute {
distance-metric: hamming
}
}
field chunk_embeddings type tensor<int8>(chunk{}, x[96]) {
indexing: input text | chunk fixed-length 1024 | embed | pack_bits | attribute | index
attribute {
distance-metric: hamming
}
}
Indexing several embedding fields may not be worth the cost for you. Evaluate whether the cost-quality trade-off is worth it for your application.
If you have different vector space representations of your document (e.g images), indexing them separately is likely worth it, as they are likely to provide signals that are complementary to the text-based embeddings.
We recommend modeling metadata and signals as structured fields in your schema. Below are some general recommendations, as well as the implementation in our blueprint schema.
Metadata — knowledge about your data:
In our blueprint schema, we include these metadata fields to demonstrate these concepts:
id
- document identifiertitle
- document name/filename for display and text matchingcreated_timestamp
, modified_timestamp
- temporal metadata for filtering and ranking by recencySignals — observations about your data:
In our blueprint schema, we include several of these signals:
last_opened_timestamp
- user engagement signal for personalizationopen_count
- popularity signal indicating document importancefavorite
- explicit user preference signal, can be used for boosting relevant contentThese fields are configured as attribute | summary
to enable efficient filtering, sorting, and grouping operations while being returned in search results. The timestamp fields allow for temporal filtering (e.g., "recent documents") and recency-based ranking, while usage signals like open_count
and favorite
can boost frequently accessed or explicitly marked important documents.
Consider parent-child relationships for low-cardinality metadata. Most large scale RAG application schemas contain at least a hundred structured fields.
Vespa supports both Local LLMs, and any OpenAI-compatible API for LLM generation. For details, see LLMs in Vespa
The recommended way to provide an API key is by using the secret store in Vespa Cloud.
To enable this, you need to create a vault (if you don't have one already) and a secret through the Vespa Cloud console. If your vault is named sample-apps
and contains a secret with the name openai-api-key
, you would use the following configuration in your services.xml
to set up the OpenAI client to use that secret:
<secrets>
<openai-api-key vault="sample-apps" name="openai-dev" />
</secrets>
<!-- Setup the client to OpenAI -->
<component id="openai" class="ai.vespa.llm.clients.OpenAI">
<config name="ai.vespa.llm.clients.llm-client">
<apiKeySecretRef>openai-api-key</apiKeySecretRef>
</config>
</component>
Alternatively, for local deployments, you can set the X-LLM-API-KEY
header in your query to use the OpenAI client for generation.
To test generation using the OpenAI client, post a query that runs the openai
search chain, with format=sse
. (Use format=json
for a streaming json response including both the search hits and the LLM-generated tokens.)
vespa query \
--timeout 60 \
--header="X-LLM-API-KEY:<your-api-key>" \
yql='select *
from doc
where userInput(@query) or
({label:"title_label", targetHits:100}nearestNeighbor(title_embedding, embedding)) or
({label:"chunks_label", targetHits:100}nearestNeighbor(chunk_embeddings, embedding))' \
query="Summarize the key architectural decisions documented for SynapseFlow's v0.2 release." \
searchChain=openai \
format=sse \
hits=5
This section provides recommendations for structuring your Vespa application package. See also the application package docs for more details on the application package structure. Note that this is not mandatory, and it might be simpler to start without query profiles and rank profiles, but as you scale out your application, it will be beneficial to have a well-structured application package.
Consider the following structure for our application package:
app
├── models
│  └── lightgbm_model.json
├── schemas
│  ├── doc
│  │  ├── collect-second-phase.profile
│  │  ├── collect-training-data.profile
│  │  ├── learned-linear.profile
│  │  ├── match-only.profile
│  │  └── second-with-gbdt.profile
│  └── doc.sd
├── search
│  └── query-profiles
│  ├── deepresearch-with-gbdt.xml
│  ├── deepresearch.xml
│  ├── hybrid-with-gbdt.xml
│  ├── hybrid.xml
│  ├── rag-with-gbdt.xml
│  └── rag.xml
├── security
│  └── clients.pem
└── services.xml
You can see that we have separated the query profiles, and rank profiles into their own directories.
Query profiles let you maintain collections of query parameters in one file. Clients choose a query profile → the profile sets everything else. This lets us change behavior for a use case without involving clients.
Let us take a closer look at 3 of the query profiles in our sample application.
hybrid
rag
deepresearch
This query profile will be the one used by clients for traditional search, where the user is presented a limited number of hits. Our other query profiles will inherit this one (but may override some fields).
<query-profile id="hybrid">
<field name="schema">doc</field>
<field name="ranking.features.query(embedding)">embed(@query)</field>
<field name="ranking.features.query(float_embedding)">embed(@query)</field>
<field name="yql">
select *
from %{schema}
where userInput(@query) or
({label:"title_label", targetHits:100}nearestNeighbor(title_embedding, embedding)) or
({label:"chunks_label", targetHits:100}nearestNeighbor(chunk_embeddings, embedding))
</field>
<field name="hits">10</field>
<field name="ranking.profile">learned-linear</field>
<field name="presentation.summary">top_3_chunks</field>
</query-profile>
This will be the query profile where the openai
searchChain will be added, to generate a response based on the retrieved context.
Here, we set some configuration that are specific to this use case.
<query-profile id="rag" inherits="hybrid">
<field name="hits">50</field>
<field name="searchChain">openai</field>
<field name="presentation.format">sse</field>
</query-profile>
Again, we will inherit from the hybrid
query profile, but override with a targetHits
value of 10 000 (original was 100) that prioritizes recall over latency.
We will also increase number of hits to be returned, and increase the timeout to 5 seconds.
<query-profile id="deepresearch" inherits="hybrid">
<field name="yql">
select *
from %{schema}
where userInput(@query) or
({label:"title_label", targetHits:10000}nearestNeighbor(title_embedding, embedding)) or
({label:"chunks_label", targetHits:10000}nearestNeighbor(chunk_embeddings, embedding))
</field>
<field name="hits">100</field>
<field name="timeout">5s</field>
</query-profile>
We will leave out the LLM-generation for this one, and let an LLM agent on the client side be responsible for using this API call as a tool, and to determine whether enough relevant context to answer has been retrieved.
Note that the targetHits
parameter set here does not really makes sense until your dataset reach a certain scale.
As we add more rank-profiles, we can also inherit the existing query profiles, only to override the ranking.profile
field to use a different rank profile. This is what we have done for the rag-with-gbdt
and deepresearch-with-gbdt
query profiles, which will use the second-with-gbdt
rank profile instead of the learned-linear
rank profile.
<query-profile id="rag-with-gbdt" inherits="hybrid-with-gbdt">
<field name="hits">50</field>
<field name="searchChain">openai</field>
<field name="presentation.format">sse</field>
</query-profile>
To build a great RAG application, assume you’ll need many ranking models. This will allow you to bucket-test alternatives continuously and to serve different use cases, including data collection for different phases, and the rank profiles to be used in production.
Separate common functions/setup into parent rank profiles and use .profile
files.
Before we move on, it might be useful to recap Vespa´s phased ranking approach.
Below is a schematic overview of how to think about retrieval and ranking for this RAG blueprint. Since we are developing this as a tutorial using a small toy dataset, the application can be deployed in a single machine, using a single docker container, where only one container node and one container node will run. This is obviously not the case for most real-world RAG applications, so this is cruical to have in mind as you want to scale your application.
It is worth noting that parameters such as targetHits
(for the match phase) and rerank-count
(for first and second phase) are applied per content node. Also note that the stateless container nodes can also be scaled independently to handle increased query load.
This section will contain important considerations for the retrieval-phase of a RAG application in Vespa.
The goal of the retrieval phase is to retrieve candidate documents efficiently, and maximize recall, without exposing too many documents to ranking.
As you could see from the schema, we create and index both a text representation and a vector representation for each chunk of the document. This will allow us to use both text-based features and semantic features for both recall and ranking.
The text and vector representation complement each other well:
Our recommendation is to default to hybrid retrieval:
select *
from doc
where userInput(@query) or
({label:"title_label", targetHits:1000}nearestNeighbor(title_embedding, embedding)) or
({label:"chunks_label", targetHits:1000}nearestNeighbor(chunk_embeddings, embedding))
In generic domains, or if you have fine-tuned an embedding model for your specific data, you might consider a vector-only approach:
select *
from doc
where rank({targetHits:10000}nearestNeighbor(embeddings_field, query_embedding, userInput(@query)))
Notice that only the first argument of the rank-operator will be used to determine if a document is a match, while all arguments are used for calculating rank features. This mean we can do vector only for matching, but still use text-based features such as bm25
and nativeRank
for ranking.
Note that if you do this, it makes sense to increase the number of targetHits
for the nearestNeighbor
-operator.
For our sample application, we add three different retrieval operators (that are combined with OR
), one with weakAnd
for text matching, and two nearestNeighbor
operators for vector matching, one for the title and one for the chunks. This will allow us to retrieve both relevant documents based on text and vector similarity, while also allowing us to return the most relevant chunks of the documents.
select *
from doc
where userInput(@query) or
({targetHits:100}nearestNeighbor(title_embedding, embedding)) or
({targetHits:100}nearestNeighbor(chunk_embeddings, embedding))
Choice of embedding model will be a trade-off between inference time (both indexing and query time), memory usage (embedding dimensions) and quality. There are many good open-source models available, and we recommend checking out the MTEB leaderboard, and look at the Retrieval
-column to gauge performance, while also considering the memory usage, vector dimensions, and context length of the model.
See model hub for a list of provided models ready to use with Vespa. See also Huggingface Embedder for details on using other models (exported as ONNX) with Vespa.
In addition to dense vector representation, Vespa supports sparse embeddings (token weights) and multi-vector (ColBERT-style) embeddings. See our example notebook of using the bge-m3 model, which supports both, with Vespa.
Vespa also supports Matryoshka embeddings, which can be a great way of reducing inference cost for retrieval phases, by using a subset of the embedding dimensions, while using more dimensions for increased precision in the later ranking phases.
For domain-specific applications or less popular languages, you may want to consider finetuning a model on your own data.
Another decision to make is which precision you will use for your embeddings. See binarization docs for an introduction to binarization in Vespa.
For most cases, binary vectors (in Vespa, packed into int8
-representation) will provide an attractive tradeoff, especially for recall during match-phase.
Consider these factors to determine whether this holds true for your application:
field binary_chunk_embeddings type tensor<int8>(chunk{}, x) {
indexing: input text | chunk fixed-length 1024 | embed | pack_bits | attribute | index
attribute { distance-metric: hamming }
}
If you need higher precision vector similarity, you should use bfloat16 precision, and consider paging these vectors to disk to avoid large memory cost. Note that this means that when accessing this field in ranking, they will also need to be read from disk, so you need to restrict the number of hits that accesses this field to avoid performance issues.
field chunk_embeddings type tensor<bfloat16>(chunk{}, x) {
indexing: input text | chunk fixed-length 1024 | embed | attribute
attribute: paged
}
For example, if you want to calculate closeness
for a paged embedding vector in first-phase, consider configuring your retrieval operators (typically weakAnd
and/or nearestNeighbor
, optionally combined with filters) so that not too many hits are matched. Another option is to enable match-phase limiting, see match-phase docs. In essence, you restrict the number of matches by specifying an attribute field.
In our blueprint, we choose to index binary vectors of the documents. This does not prevent us from using the float-representation of the query embedding though.
By unpacking the binary document chunk embeddings to their float representations (using unpack_bits
), we can calculate the similarity between query and document with slightly higher precision using a float-binary
dot product, instead of hamming distance (binary-binary
)
Below, you can see how we can do this:
rank-profile collect-training-data {
inputs {
query(embedding) tensor<int8>(x[96])
query(float_embedding) tensor<float>(x[768])
}
function chunk_emb_vecs() {
expression: unpack_bits(attribute(chunk_embeddings))
}
function chunk_dot_prod() {
expression: reduce(query(float_embedding) * chunk_emb_vecs(), sum, x)
}
function vector_norms(t) {
expression: sqrt(sum(pow(t, 2), x))
}
function chunk_sim_scores() {
expression: chunk_dot_prod() / (vector_norms(chunk_emb_vecs()) * vector_norms(query(float_embedding)))
}
function top_3_chunk_text_scores() {
expression: top(3, chunk_text_scores())
}
function top_3_chunk_sim_scores() {
expression: top(3, chunk_sim_scores())
}
}
Vespa gives you extensive control over linguistics. You can decide match mode, stemming, normalization, or control derived tokens.
It is also possible to use more specific operators than weakAnd to match only close occurrences (near/ onear), multiple alternatives (equiv), weight items, set connectivity, and apply query-rewrite rules.
Don’t use this to increase recall — improve your embedding model instead.
Consider using it to improve precision when needed.
To know whether your retrieval phase is working well, you need to measure recall, number of total matches and the reported time spent.
We can use VespaMatchEvaluator
from the pyvespa client library to do this.
For this sample application, we set up an evaluation script that compares three different retrieval strategies, let us call them "retrieval arms":
nearestNeighbor
operators.userQuery()
.It is recommended to use a ranking profile that does not use any first-phase ranking, to run the match-phase evaluation faster.
The evaluation will output metrics like:
write_verbose=True
) to identify "offending" queries with regards to recall or performance.This will be valuable input for tuning each of them.
Run the complete evaluation script from the eval/
directory to see detailed comparisons between all three retrieval strategies on your dataset.
select * from doc where
({targetHits:100}nearestNeighbor(title_embedding, embedding)) or
({targetHits:100}nearestNeighbor(chunk_embeddings, embedding))
Metric | Value |
---|---|
Match Recall | 1.0000 |
Average Recall per Query | 1.0000 |
Total Relevant Documents | 51 |
Total Matched Relevant | 51 |
Average Matched per Query | 100.0000 |
Total Queries | 20 |
Search Time Average (s) | 0.0090 |
Search Time Q50 (s) | 0.0060 |
Search Time Q90 (s) | 0.0193 |
Search Time Q95 (s) | 0.0220 |
The userQuery
is just a convenience wrapper for weakAnd
, see reference/query-language-reference.html. The default targetHits
for weakAnd
is 100, but it is overridable.
select * from doc where userQuery()
Metric | Value |
---|---|
Match Recall | 1.0000 |
Average Recall per Query | 1.0000 |
Total Relevant Documents | 51 |
Total Matched Relevant | 51 |
Average Matched per Query | 88.7000 |
Total Queries | 20 |
Search Time Average (s) | 0.0071 |
Search Time Q50 (s) | 0.0060 |
Search Time Q90 (s) | 0.0132 |
Search Time Q95 (s) | 0.0171 |
select * from doc where
({targetHits:100}nearestNeighbor(title_embedding, embedding)) or
({targetHits:100}nearestNeighbor(chunk_embeddings, embedding)) or
userQuery()
Metric | Value |
---|---|
Match Recall | 1.0000 |
Average Recall per Query | 1.0000 |
Total Relevant Documents | 51 |
Total Matched Relevant | 51 |
Average Matched per Query | 100.0000 |
Total Queries | 20 |
Search Time Average (s) | 0.0076 |
Search Time Q50 (s) | 0.0055 |
Search Time Q90 (s) | 0.0150 |
Search Time Q95 (s) | 0.0201 |
We can see that all queries match all relevant documents, which is expected, since we use targetHits:100
in the nearestNeighbor
operator, and this is also the default for weakAnd
(and userQuery
). By setting targetHits
lower, we can see that recall will drop.
In general, you have these options if you want to increase recall:
targetHits
in your retrieval operators (e.g., nearestNeighbor
, weakAnd
).Conversely, if you want to reduce the latency of one of your retrieval 'arms' at the cost of a small trade-off in recall, you can:
weakAnd
parameters. This has potential to 3x your performance for the weakAnd
-parameter of your query, see blog post.Below are some empirically found default parameters that work well for most use cases:
rank-profile optimized inherits baseline {
filter-threshold: 0.05
weakand {
stopword-limit: 0.6
adjust-target: 0.01
}
}
See the reference for more details on the weakAnd
parameters.
These can also be set as query parameters.
For the first-phase ranking, we must use a computationally cheap function, as it is applied to all documents matched in the retrieval phase. For many applications, this can amount to millions of candidate documents.
Common options include (learned) linear combination of features including text similarity features, vector closeness, and metadata. It could also be a heuristic handwritten function.
Text features should include nativeRank or bm25 — not fieldMatch (it is too expensive).
Considerations for deciding whether to choose bm25
or nativeRank
:
For this blueprint, we opted for using bm25
for first phase, but you could evaluate and compare to see whether the additional cost of using nativeRank
is justified by increased quality.
The features we will use for first-phase ranking are not normalized (ie. they have values in different ranges). This means we can't just weight them equally and expect that to be a good proxy for relevance.
Below we will show how we can find (learn) optimal weights (coefficients) for each feature, so that we can combine them into a ranking-expression on the format:
a * bm25(title) + b * bm25(chunks) + c * max_chunk_sim_scores() + d * max_chunk_text_scores() + e * avg_top_3_chunk_sim_scores() + f * avg_top_3_chunk_text_scores()
The first thing we need to is to collect training data. We do this using the VespaFeatureCollector from the pyvespa library.
These are the features we will include:
rank-profile collect-training-data {
match-features {
bm25(title)
bm25(chunks)
max_chunk_sim_scores
max_chunk_text_scores
avg_top_3_chunk_sim_scores
avg_top_3_chunk_text_scores
}
# Since we need both binary embeddings (for match-phase) and float embeddings (for ranking) we define it as two inputs.
inputs {
query(embedding) tensor<int8>(x[96])
query(float_embedding) tensor<float>(x[768])
}
rank chunks {
element-gap: 0 # Fixed length chunking should not cause any positional gap between elements
}
function chunk_text_scores() {
expression: elementwise(bm25(chunks),chunk,float)
}
function chunk_emb_vecs() {
expression: unpack_bits(attribute(chunk_embeddings))
}
function chunk_dot_prod() {
expression: reduce(query(float_embedding) * chunk_emb_vecs(), sum, x)
}
function vector_norms(t) {
expression: sqrt(sum(pow(t, 2), x))
}
function chunk_sim_scores() {
expression: chunk_dot_prod() / (vector_norms(chunk_emb_vecs()) * vector_norms(query(float_embedding)))
}
function top_3_chunk_text_scores() {
expression: top(3, chunk_text_scores())
}
function top_3_chunk_sim_scores() {
expression: top(3, chunk_sim_scores())
}
function avg_top_3_chunk_text_scores() {
expression: reduce(top_3_chunk_text_scores(), avg, chunk)
}
function avg_top_3_chunk_sim_scores() {
expression: reduce(top_3_chunk_sim_scores(), avg, chunk)
}
function max_chunk_text_scores() {
expression: reduce(chunk_text_scores(), max, chunk)
}
function max_chunk_sim_scores() {
expression: reduce(chunk_sim_scores(), max, chunk)
}
first-phase {
expression {
# Not used in this profile
bm25(title) +
bm25(chunks) +
max_chunk_sim_scores() +
max_chunk_text_scores()
}
}
second-phase {
expression: random
}
}
As you can see, we rely on the bm25
and different vector similarity features (both document-level and chunk-level) for the first-phase ranking.
These are relatively cheap to calculate, and will likely provide good enough ranking signals for the first-phase ranking.
Running the command below will save a .csv-file with the collected features, which can be used to train a ranking model for the first-phase ranking.
python eval/collect_pyvespa.py --collect_matchfeatures
Our output file looks like this:
query_id | doc_id | relevance_label | relevance_score | match_avg_top_3_chunk_sim_scores | match_avg_top_3_chunk_text_scores | match_bm25(chunks) | match_bm25(title) | match_max_chunk_sim_scores | match_max_chunk_text_scores |
---|---|---|---|---|---|---|---|---|---|
alex_q_01 | 50 | 1 | 0.660597 | 0.248329 | 8.444725 | 7.717984 | 0. | 0.268457 | 8.444725 |
alex_q_01 | 82 | 1 | 0.649638 | 0.225300 | 12.327676 | 18.611592 | 2.453409 | 0.258905 | 15.644889 |
alex_q_01 | 1 | 1 | 0.245849 | 0.358027 | 15.100841 | 23.010389 | 4.333828 | 0.391143 | 20.582403 |
alex_q_01 | 28 | 0 | 0.988250 | 0.278074 | 0.179929 | 0.197420 | 0. | 0.278074 | 0.179929 |
alex_q_01 | 23 | 0 | 0.968268 | 0.203709 | 0.182603 | 0.196956 | 0. | 0.203709 | 0.182603 |
Note that the relevance_score
in this table is just the random expression we used in the second-phase
of the collect-training-data
rank profile, and will be dropped before training the model.
As you recall, a first-phase ranking expression must be cheap to evaluate. This most often means a heuristic handwritten combination of match features, or a linear model trained on match features.
We will demonstrate how to train a simple Logistic Regression model to predict relevance based on the collected match features. The full training script can be found in the sample-apps repository.
Some "gotchas" to be aware of:
query_id
and doc_id
columns before training.Run the training script
python eval/train_logistic_regression.py
Expect output like this:
------------------------------------------------------------
Cross-Validation Results (5-Fold, Standardized)
------------------------------------------------------------
Metric | Mean | Std Dev
------------------------------------------------------------
Accuracy | 0.9024 | 0.0294
Precision | 0.9236 | 0.0384
Recall | 0.8818 | 0.0984
F1-Score | 0.8970 | 0.0415
Log Loss | 0.2074 | 0.0353
ROC AUC | 0.9749 | 0.0103
Avg Precision | 0.9764 | 0.0117
------------------------------------------------------------
Transformed Coefficients (for original unscaled features):
--------------------------------------------------
avg_top_3_chunk_sim_scores : 13.383840
avg_top_3_chunk_text_scores : 0.203145
bm25(chunks) : 0.159914
bm25(title) : 0.191867
max_chunk_sim_scores : 10.067169
max_chunk_text_scores : 0.153392
Intercept : -7.798639
--------------------------------------------------
Which seems quite good. With such a small dataset however, it is easy to overfit. Let us evaluate on the unseen test queries to see how well the model generalizes.
First, we need to add the learned coefficients as inputs to a new rank profile in our schema, so that we can use them in Vespa.
rank-profile learned-linear inherits collect-training-data {
match-features:
inputs {
query(embedding) tensor<int8>(x[96])
query(float_embedding) tensor<float>(x[768])
query(intercept) double
query(avg_top_3_chunk_sim_scores_param) double
query(avg_top_3_chunk_text_scores_param) double
query(bm25_chunks_param) double
query(bm25_title_param) double
query(max_chunk_sim_scores_param) double
query(max_chunk_text_scores_param) double
}
first-phase {
expression {
query(intercept) +
query(avg_top_3_chunk_sim_scores_param) * avg_top_3_chunk_sim_scores() +
query(avg_top_3_chunk_text_scores_param) * avg_top_3_chunk_text_scores() +
query(bm25_title_param) * bm25(title) +
query(bm25_chunks_param) * bm25(chunks) +
query(max_chunk_sim_scores_param) * max_chunk_sim_scores() +
query(max_chunk_text_scores_param) * max_chunk_text_scores()
}
}
summary-features {
top_3_chunk_sim_scores
}
}
To allow for changing the parameters without redeploying the application, we will also add the values of the coefficients as query parameters to a new query profile.
<query-profile id="hybrid">
<field name="schema">doc</field>
<field name="ranking.features.query(embedding)">embed(@query)</field>
<field name="ranking.features.query(float_embedding)">embed(@query)</field>
<field name="ranking.features.query(intercept)">-7.798639</field>
<field name="ranking.features.query(avg_top_3_chunk_sim_scores_param)">13.383840</field>
<field name="ranking.features.query(avg_top_3_chunk_text_scores_param)">0.203145</field>
<field name="ranking.features.query(bm25_chunks_param)">0.159914</field>
<field name="ranking.features.query(bm25_title_param)">0.191867</field>
<field name="ranking.features.query(max_chunk_sim_scores_param)">10.067169</field>
<field name="ranking.features.query(max_chunk_text_scores_param)">0.153392</field>
<field name="yql">
select *
from %{schema}
where userInput(@query) or
({label:"title_label", targetHits:100}nearestNeighbor(title_embedding, embedding)) or
({label:"chunks_label", targetHits:100}nearestNeighbor(chunk_embeddings, embedding))
</field>
<field name="hits">10</field>
<field name="ranking.profile">learned-linear</field>
<field name="presentation.summary">top_3_chunks</field>
</query-profile>
Now we are ready to evaluate our first-phase ranking function. We can use the VespaEvaluator from the pyvespa library to evaluate the first-phase ranking function.
Run the following command to run the evaluation script
python eval/evaluate_ranking.py
We run the evaluation script on a set of unseen test queries, and get the following output:
{
"accuracy@1": 0.0000,
"accuracy@3": 0.0000,
"accuracy@5": 0.0500,
"accuracy@10": 0.3000,
"precision@10": 0.0350,
"recall@10": 0.1341,
"precision@20": 0.0425,
"recall@20": 0.3886,
"mrr@10": 0.0477,
"ndcg@10": 0.0600,
"map@100": 0.0669,
"searchtime_avg": 0.0222,
"searchtime_q50": 0.0165,
"searchtime_q90": 0.0555,
"searchtime_q95": 0.0604
}
For the first phase ranking, we care most about recall, as we just want to make sure that the candidate documents are ranked high enough to be included in the second-phase ranking. (the default number of documents that will be exposed to second-phase is 10 000, but can be controlled by the rerank-count
parameter).
We can see that our recall@20 is 0.39, which is not very good, but an OK start, and a lot better than random. We could later aim to improve on this by approximating a better function after we have learned one for second-phase ranking.
We can also see that our search time is quite fast, with an average of 22ms. You should consider whether this is well within your latency budget, as you want some headroom for second-phase ranking.
The ranking performance is not great, but this is expected for a simple linear model, where it only needs to be good enough to make sure that the most relevant documents are passed to the second-phase ranking, where ranking performance matters a lot more.
For the second-phase ranking, we can afford to use a more expensive ranking expression, since we will only run it on the top-k documents from the first-phase ranking (defined by the rerank-count
parameter, which defaults to 10,000 documents).
This is where we can significantly improve ranking quality by using more sophisticated models and features that would be too expensive to compute for all matched documents.
For second-phase ranking, we request Vespa's default set of rank features, which includes a comprehensive set of text features. See the rank features documentation for complete details.
We can collect both match features and rank features by running the same script as we did for first-phase ranking, with some additional parameters to collect rank features as well:
python eval/collect_pyvespa.py --collect_rankfeatures --collect_matchfeatures --collector_name rankfeatures-secondphase
This collects approximately 194 features, providing a rich feature set for training more sophisticated ranking models.
With the expanded feature set, we can train a Gradient Boosted Decision Tree (GBDT) model to predict document relevance. We use LightGBM for this purpose.
Vespa also supports XGBoost and ONNX models.
To train the model, run the following command (link to training script):
python eval/train_lightgbm.py --input_file eval/output/Vespa-training-data_match_rank_second_phase_20250623_135819.csv
The training process includes several important considerations:
Example training output:
------------------------------------------------------------
Cross-Validation Results (5-Fold)
------------------------------------------------------------
Metric | Mean | Std Dev
------------------------------------------------------------
Accuracy | 0.9214 | 0.0664
ROC AUC | 0.9863 | 0.0197
------------------------------------------------------------
Overall CV AUC: 0.9249 • ACC: 0.9216
------------------------------------------------------------
The trained model reveals which features are most important for ranking quality. For our sample application, the top features include:
Feature | Importance |
---|---|
nativeProximity | 168.8498 |
firstPhase | 151.7382 |
max_chunk_sim_scores | 69.4377 |
avg_top_3_chunk_text_scores | 56.5079 |
avg_top_3_chunk_sim_scores | 31.8700 |
nativeRank | 20.0716 |
nativeFieldMatch | 15.9914 |
elementSimilarity(chunks) | 9.7003 |
Key observations:
firstPhase
) being important validates that our first-phase ranking provides a good foundationThe trained LightGBM model is exported and added to your Vespa application package:
app/
├── models/
│ └── lightgbm_model.json
Create a new rank profile that uses this model:
rank-profile second-with-gbdt inherits collect-training-data {
...
second-phase {
expression: lightgbm("lightgbm_model.json")
}
...
}
And redeploy your application:
vespa deploy
Run the evaluate_ranking.py script to evaluate the GBDT-powered second-phase ranking on unseen test queries:
python evaluate_ranking.py --second_phase
Expected results show significant improvement over first-phase ranking:
{
"accuracy@1": 0.9000,
"accuracy@3": 0.9500,
"accuracy@5": 1.0000,
"accuracy@10": 1.0000,
"precision@10": 0.2350,
"recall@10": 0.9402,
"precision@20": 0.1275,
"recall@20": 0.9909,
"mrr@10": 0.9375,
"ndcg@10": 0.8586,
"map@100": 0.7780,
"searchtime_avg": 0.0328,
"searchtime_q50": 0.0305,
"searchtime_q90": 0.0483,
"searchtime_q95": 0.0606
}
Let us compare some selected metrics against the first-phase ranking results:
Metric | First-phase | Second-phase | Change |
---|---|---|---|
recall@10 | 0.1341 | 0.9402 | +0.8061 |
recall@20 | 0.3886 | 0.9909 | +0.6023 |
ndcg@10 | 0.0600 | 0.8586 | +0.7986 |
searchtime_avg | 0.0222 | 0.0328 | + 9ms |
This represents a dramatic improvement over first-phase ranking, with:
The slight increase in search time (from 22ms to 35ms average) is well worth the quality improvement.
Create new query profiles that leverage the improved ranking:
<query-profile id="hybrid-with-gbdt" inherits="hybrid">
<field name="ranking.profile">second-with-gbdt</field>
<field name="hits">20</field>
</query-profile>
<query-profile id="rag-with-gbdt" inherits="hybrid-with-gbdt">
<field name="hits">50</field>
<field name="searchChain">openai</field>
<field name="presentation.format">sse</field>
</query-profile>
Test the improved ranking:
vespa query query="what are key points learned for finetuning llms?" queryProfile=hybrid-with-gbdt
For RAG applications with LLM generation:
vespa query \
--timeout 60 \
query="what are key points learned for finetuning llms?" \
queryProfile=rag-with-gbdt
Model complexity considerations:
nativeProximity
and fieldMatch
Feature engineering:
Training data quality:
Performance monitoring:
rerank-count
based on quality vs. performance trade-offsThe second-phase ranking represents a crucial step in building high-quality RAG applications, providing the precision needed for effective LLM context while maintaining reasonable query latencies.
We also have the option of configuring global-phase ranking, which can rerank the top k (as set by rerank-count
parameter) documents from the second-phase ranking.
Common options for global-phase are cross-encoders or another GBDT model, trained for better separating top ranked documents on objectives such as LambdaMart. For RAG applications, we consider this less important than for search applications where the results are mainly consumed by an human, as LLMs don't care that much about the ordering of the results.
Finally, we will sketch out some opportunities for further improvements. As you have seen, we started out with only binary relevance labels for a few queries, and trained a model based on the relevant docs and a set of random documents.
As you may have noted, we have not discussed what most people think about when discussing RAG evals, evaluating the "Generation"-step. There are several tools available to do this, for example ragas and ARES. We refer to other sources for details on this, as this tutorial is probably enough to digest as it is.
This was useful initially, as we had no better way to retrieve the candidate documents. Now, that we have a reasonably good second-phase ranking, we could potentially generate a new set of relevance labels for queries that we did not have labels for by having an LLM do relevance judgments of the top k returned hits. This training dataset would likely be even better in separating the top documents.
In this tutorial, we have built a complete RAG application using Vespa, providing our recommendations for how to approach both retrieval phase with binary vectors and text matching, first-phase ranking with a linear combination of relatively cheap features to a more sophisticated second-phase ranking system with more expensive features and a GBDT model.
We hope that this tutorial, along with the provided code in our sample-apps repository, will serve as a useful reference for building your own RAG applications, with an evaluation-driven approach.
By using the principles demonstrated in this tutorial, you are empowered to build high-quality RAG applications that can scale to any dataset size, and any query load.
Q: Which embedding models can I use with Vespa? A: Vespa supports a variety of embedding models. For a list of vespa provided models on Vespa Cloud, see Model hub. See also embedding reference for how to use embedders. You can also use private models (gated by authentication with Bearer token from Vespa Cloud secret store).
Q: Do I need to use an LLM with Vespa? A: No, you are free to use Vespa as a search engine. We provide the option of calling out to LLMs from within a Vespa application for reduced latency compared to sending large search results sets several times over network as well as the option to deploy Local LLMs, optionally in your own infrastructure if you prefer. See Vespa Cloud Enclave
Q: Why do we use binary vectors for the document embeddings? A: Binary vectors takes up a lot less memory and are faster to compute distances on, with only a slight reduction in quality. See blog post for details.
Q: How can you say that Vespa can scale to any data and query load? A: Vespa can scale both the stateless container nodes and content nodes of your application. See overview and elasticity for details.