A key technique in RAG applications, and vector search applications in general, is to split longer text into
chunks. This lets you:
Generate a vector embedding for each chunk rather than for an entire document text, to capture the semantic information
of the text at a meaningful level.
Select specific chunks to add to the context window in GenAI applications rather than the entire document content.
Vespa contains the following functionality for working with chunks. Each is covered in a section below.
To add embeddings to your documents, use a tensor field:
search myDocumentType {
document myDocumentType {
field myEmbedding type tensor<float>(x[384]) {
indexing: attribute | index
}
}
}
This lets you add a single embedding to each document, but usually you want to have many.
In Vespa you can do that by adding mapped dimensions
to your tensor:
search myDocumentType {
document myDocumentType {
field myEmbeddings type tensor<float>(chunk{}, x[384]) {
indexing: attribute | index
}
}
}
With this you can feed tensors in JSON format
as part of your writes, e.g. writing an embedding tensor with chunks numbered 1 and 2:
You can of course combine this with chunking to have a single text field
chunked and embedded automatically:
search myDocumentType {
document myDocumentType {
field myText type string {
}
}
field myChunks type array<string> {
indexing: input myText | chunk sentence | summary | index
}
field myEmbeddings type tensor<float>(chunk{}, x[384]) {
indexing: input myText | chunk sentence | embed | attribute | index
}
}
Some things to note:
All fields of Vespa documents are stored and here we represent the text both as a single field and as
chunks of text, won't that consume a lot of unnecessary space? No, thanks to the wonders of modern compression,
the overhead from this can be ignored.
Why return the chunk array in results and not the full text field? This is because for large text we need to
select a subset of the text chunks rather than returning the full text.
We are chunking twice here, won't this be inefficient? No, Vespa will reuse the result of the first invocation
in cases like this.
Searching chunks
You can search in chunk text (if you added index), and in chunk embeddings (if you created embeddings).
Usually, you want to do both (hybrid search)
since text search gives you precise matches, and embedding nearest neighbor search gives you imprecise semantic matching.
A simple hybrid query can look like this:
yql=select * from doc where userInput(@query) or ({targetHits:10}nearestNeighbor(myEmbeddings, e))
input.query(e)=embed(@query)
query=Do Cholesterol Statin Drugs Cause Breast Cancer?
The embed function shown here can be used to embed a query text using the same model(s) as used for chunks.
If embedding outside Vespa you can
pass the tensor value instead. See the
nearest neighbor guide
for more.
Text matching works across chunks as if the chunks were re-joined into one text field. However, a proximity
gap is inserted between each chunks so that tokens in different chunks are by default very (infinitely) far away when
evaluating phrase and near matches (however, see
on configuring this).
Nearest neighbor search with many chunks will retrieve the documents where any single chunk embedding
is close to the query embedding.
Ranking with chunks
Ranking in Vespa is done by mathematical expressions
(hand-written or machine-learned) combining rank features. You'll typically want to use features that capture
both how well vector embeddings and textual query terms matched the chunks.
For vector search, the closeness(dimension,field) feature will contain the distance between the
query vector and the closest chunk embedding. In addition, the closest(field)
feature will return a tensor providing the label(s) of the chunk which was closest.
For text matching, all features are available as if the entire chunk array was a single string field, but with
an infinitely large proximity gap between each element to treat each element as independent. When the array elements
are chunks of the same text, you'd prefer to get a relevance contribution from matching adjacent elements since it means
you are matching adjacent words in the source text. To achieve this, configure the elementGap in your chunk array to
a low value (e.g. 0 to 3, depending on how well your chunking strategy identifies semantic transitions):
Using vector closeness and the normal text match features will help you rank documents mostly based
on the text having the single best match to the query. Sometimes it is also useful to capture how well the
text as a whole matches the query. For vectors, you can do this by computing and aggregating closeness
to each vector using a tensor expression
in your ranking expression, while for text matching you can use the elementSimilarity(field) feature,
or the elementwise(bm25(field),dimension,cell_type)
feature which returns a tensor containing the bm25 score of each chunk.
Layered ranking: Selecting chunks to return
A search result will contain the top ranked documents including all fields you are requesting or
configuring, including all chunks of those documents,
whether relevant or not. This is fine when every document has few chunks, but when they can have many, there
are two problems:
Putting many irrelevant chunks into the context window of the LLM decreases quality, or may
make the context window infeasibly large.
Sending many chunks over the network increases latency and can impacting other queries
running at the same time.
To solve both of these, we can use
layered ranking:
Rank the chunks in the highest ranked documents, and select only the best ones.
To do this, specify the ranking function that will select the chunks to return,
using select-elements-by.
Here's a full example:
schema docs {
document docs {
field myEmbeddings type tensor<float>(chunk{}, x[386]) {
indexing: attribute
}
field myChunks type array<string> {
indexing: index | summary
summary {
select-elements-by: best_chunks
}
}
}
rank-profile default {
inputs {
query(embedding) tensor<float>(x[386])
}
function my_distance() {
expression: euclidean_distance(query(embedding), attribute(myEmbeddings), x)
}
function my_distance_scores() {
expression: 1 / (1+my_distance)
}
function my_text_scores() {
expression: elementwise(bm25(myChunks), chunk, float)
}
function chunk_scores() {
expression: merge(my_distance_scores, my_text_scores, f(a,b)(a+b))
}
function best_chunks() {
expression: top(3, chunk_scores)
}
first-phase {
expression: sum(chunk_scores())
}
summary-features {
best_chunks
}
}
}
With this, we can use the powerful ranking framework in Vespa to select the best chunks to provide to the LLM,
without sending any chunks that won't be used over the network.