A common technique is to map unstructured data - say, text or images -
to points in an abstract vector space and then do the computation in that space.
For example, retrieve
similar data by finding nearby points in the vector space,
or using the vectors as input to a neural net.
This mapping is referred to as embedding.
Read more about embedding and embedding management in this
blog post.
Embedding vectors can be sent to Vespa in queries and writes:
Alternatively, you can use the embed function to generate the embeddings inside Vespa
to reduce vector transfer costs and make clients simpler:
Adding embeddings to schemas will change the characteristics of an application;
Memory usage will grow, and feeding latency might increase.
Read more on how to address this in binarizing vectors.
Configuring embedders
Embedders are components which must be configured in your
services.xml. Components are shared and can be used across schemas.
Both single and double quotes are permitted, and if you have only configured a single embedder,
you can skip the embedder id argument and the quotes.
The text argument can be supplied by a referenced parameter instead, using the @parameter syntax:
{"yql":"select * from doc where {targetHits:10}nearestNeighbor(embedding_field, query_embedding)","text":"my text to embed","input.query(query_embedding)":"embed(@text)",}
schema doc {
document doc {
field title type string {
indexing: summary | index
}
}
field embeddings type tensor<bfloat16>(x[384]) {
indexing {
input title | embed embedderId | attribute | index
}
}
}
Notice that the embedding field is defined outside the document clause in the schema.
If you have only configured a single embedder, you can skip the embedder id argument.
The input field can also be an array, where the output becomes a rank two tensor, see
this blog post:
schema doc {
document doc {
field chunks type array<string> {
indexing: index | summary
}
}
field embeddings type tensor<bfloat16>(p{},x[5]) {
indexing: input chunks | embed embedderId | attribute | index
}
}
Provided embedders
Vespa provides several embedders as part of the platform.
Huggingface Embedder
An embedder using any Huggingface tokenizer,
including multilingual tokenizers,
to produce tokens which are then input to a supplied transformer model in ONNX model format:
The transformer-model specifies the embedding model in ONNX format.
See exporting models to ONNX
for how to export embedding models from Huggingface to be compatible with Vespa's hugging-face-embedder.
See Limitations on Model Size and Complexity
for details on the ONNX model format supported by Vespa.
Use path to supply the model files from the application package,
url to supply them from a remote server, or
model-id to use a
model supplied by Vespa Cloud.
You can also use a model hosted in private Huggingface Model Hub by adding your Huggingface API token
to the secret store and referring to the secret
using secret-ref in the model tag.
See model config reference for more details.
See the reference
for all configuration parameters.
Huggingface embedder models
The following are examples of text embedding models that can be used with the hugging-face-embedder
and their output tensor dimensionality.
The resulting tensor type can be float,
bfloat16 or using binarized quantization into int8.
See blog post Combining matryoshka with binary-quantization
for more examples of using the Huggingface embedder with binary quantization.
The following models use pooling-strategymean,
which is the default pooling-strategy:
mxbai-embed-large-v1 produces tensor<float>(x[1024]). This model
is also useful for binarization, which can be triggered by using destination tensor<int8>(x[128]).
Use pooling-strategycls and normalizetrue.
nomic-embed-text-v1.5 produces tensor<float>(x[768]). This model
is also useful for binarization, which can be triggered by using destination tensor<int8>(x[96]). Use normalizetrue.
Snowflake arctic model series:
snowflake-arctic-embed-xs produces tensor<float>(x[384]).
Use pooling-strategycls and normalizetrue.
snowflake-arctic-embed-m produces tensor<float>(x[768]).
Use pooling-strategycls and normalizetrue.
The transformer-model specifies the embedding model in ONNX format.
See exporting models to ONNX,
for how to export embedding models from Huggingface to compatible ONNX format.
The tokenizer-vocab specifies the Huggingface vocab.txt file, with one valid token per line.
Note that the Bert embedder does not support the tokenizer.json formatted tokenizer configuration files.
This means that tokenization settings like max tokens should be set explicitly.
The transformer-output specifies the name given
to to embedding output in the model.onnx file;
this will differ depending on how the model is exported to
ONNX format. One common name is last_hidden_state,
especially in transformer-based models. Other common names are
output or
output_0,
embedding or
embeddings,
sentence_embedding,
pooled_output,
or
encoder_last_hidden_state.
The default is output_0.
The Bert embedder is limited to English (WordPiece) and
BERT-styled transformer models with three model inputs
(input_ids, attention_mask, token_type_ids).
Prefer using the Huggingface Embedder instead of the Bert embedder.
An embedder supporting ColBERT models. The
ColBERT embedder maps text to token embeddings, representing a text as multiple
contextualized embeddings. This produces better quality than reducing all tokens into a single vector.
The transformer-model specifies the ColBERT embedding model in ONNX format.
See exporting models to ONNX
for how to export embedding models from Huggingface to compatible ONNX format.
The vespa-engine/col-minilm page on the HF
model hub has a detailed example of how to export a colbert checkpoint to ONNX format for accelerated inference.
The max-query-tokens controls the maximum number of query text tokens that are represented as vectors,
and similarly, max-document-tokens controls the document side. These parameters
can be used to control resource usage.
The ColBERT token embeddings are represented as a
mixed tensor: tensor<float>(token{}, x[dim]) where
dim is the vector dimensionality of the contextualized token embeddings.
The colbert model checkpoint on Hugging Face hub
uses 128 dimensions.
The embedder destination tensor is defined in the schema, and
depending on the target tensor cell precision definition
the embedder can compress the representation:
If the target tensor cell type is int8, the ColBERT embedder compresses the token embeddings with binarization for
the document to reduce storage to 1-bit per value, reducing the token embedding storage footprint
by 32x compared to using float. The query representation is not compressed with binarization.
The following demonstrates two ways to use the ColBERT embedder in
the document schema to embed a document field.
schema doc {
document doc {
field text type string {..}
}
field colbert_tokens type tensor<float>(token{}, x[128]) {
indexing: input text | embed colbert | attribute
}
field colbert_tokens_compressed type tensor<int8>(token{}, x[16]) {
indexing: input text | embed colbert | attribute
}
}
The first field colbert_tokens stores the original representation as the tensor destination
cell type is float. The second field, the colbert_tokens_compressed tensor is compressed.
When using int8 tensor cell precision,
one should divide the original vector size by 8 (128/8 = 16).
You can also use bfloat16 instead of float to reduce storage by 2x compared to float.
field colbert_tokens type tensor<bfloat16>(token{}, x[128]) {
indexing: input text | embed colbert | attribute
}
You can also use the ColBERT embedder with an array of strings (representing chunks):
schema doc {
document doc {
field chunks type array<string> {..}
}
field colbert_tokens_compressed type tensor<int8>(chunk{}, token{}, x[16]) {
indexing: input text | embed colbert chunk | attribute
}
}
Here, we need a second mapped dimension in the target tensor and a second argument to embed,
telling the ColBERT embedder the name of the tensor dimension to use for the chunks.
Notice that the examples above did not specify the index function for creating a
HNSW index.
The colbert representation is intended to be used as a ranking model
and not for retrieval with Vespa's nearestNeighbor query operator,
where you can e.g., use a document-level vector and/or lexical matching.
See the sample applications for using ColBERT in ranking with variants of the MaxSim similarity operator
expressed using Vespa tensor computation expressions. See:
colbert and
colbert-long.
SPLADE embedder
An embedder supporting SPLADE models. The
SPLADE embedder maps text to mapped tensor, representing a text as a sparse vector of unique tokens and their weights.
The transformer-model specifies the SPLADE embedding model in ONNX format.
See exporting models to ONNX
for how to export embedding models from Huggingface to compatible ONNX format.
The splade token weights are represented as a
mapped tensor: tensor<float>(token{}).
The embedder destination tensor is defined in the schema.
The following demonstrates how to use the SPLADE embedder in the document schema to
embed a document field.
schema doc {
document doc {
field text type string {..}
}
field splade_tokens type tensor<float>(token{}) {
indexing: input text | embed splade | attribute
}
}
You can also use the SPLADE embedder with an array of strings (representing chunks). Here, also
using lower tensor cell precision bfloat16:
schema doc {
document doc {
field chunks type array<string> {..}
}
field splade_tokens type tensor<bfloat16>(chunk{}, token{}) {
indexing: input text | embed splade chunk | attribute
}
}
Here, we need a second mapped dimension in the target tensor and a second argument to embed,
telling the splade embedder the name of the tensor dimension to use for the chunks.
See the splade sample application for how to use SPLADE in ranking,
including also how to use the SPLADE embedder with an array of strings (representing chunks).
Vespa CloudThis content is applicable to Vespa Cloud deployments.
VoyageAI Embedder
An embedder that uses the VoyageAI embedding API
to generate high-quality embeddings for semantic search. This embedder calls the VoyageAI API service
and does not require local model files or ONNX inference. All embeddings returned by VoyageAI are normalized
to unit length, making them suitable for cosine similarity and
prenormalized-angular distance metrics
(see VoyageAI FAQ).
To use contextualized chunk embeddings,
configure the VoyageAI embedder with a voyage-context-* model and use it to embed an
array<string> field containing your document chunks:
schema doc {
document doc {
field chunks type array<string> {
indexing: index | summary
}
}
field embeddings type tensor<float>(chunk{}, x[1024]) {
indexing: input chunks | embed voyage | attribute | index
attribute {
distance-metric: prenormalized-angular
}
}
}
When embedding array fields with a contextualized chunk embedding model, Vespa sends all chunks from a document in a single API request,
allowing Voyage to encode each chunk with context from the other chunks.
Be aware that the combined size of all chunks in a document must fit within the VoyageAI API's input token limit.
See Working with chunks for chunking strategies.
Input type detection
VoyageAI models distinguish between query and document embeddings for improved retrieval quality.
The embedder automatically detects the context and sets the appropriate input type based on whether
the embedding is performed during feed (indexing) or query processing in Vespa.
For advanced use cases where you need to control the input type programmatically,
you can use the destination property of the
Embedder.Context
when calling the embedder from Java code.
Best practices
For production deployments, we recommend configuring separate embedder components for feed and search operations.
This architectural pattern provides two key benefits - cost optimization and rate limit isolation.
In Vespa Cloud, it's best practice to configure these embedders in separate container clusters for feed and search.
The Voyage 4 model family features a shared embedding space
across different model sizes. This enables a cost-effective strategy where you can use a more powerful (and expensive) model
for document embeddings, while using a smaller, cheaper model for query embeddings.
Since document embedding happens once during indexing but query embedding occurs on every search request,
this approach can significantly reduce operational costs while maintaining quality.
The voyage-4-nano
model is available as an ONNX model for use with the
Hugging Face embedder.
Since it shares the same embedding space as the larger Voyage 4 models,
it can be used for query embeddings with local inference, trading some accuracy for lower cost
by eliminating API usage for queries entirely.
Rate limit isolation
Separating feed and search operations is particularly important for managing VoyageAI API rate limits.
Bursty document feeding operations can consume significant API quota, potentially causing rate limit errors
that affect search queries. By using separate API keys for feed and search embedders,
you ensure that feeding bursts don't negatively impact search.
Thread pool tuning
When using the VoyageAI embedder, container feed throughput is primarily limited by VoyageAI API latency
combined with the document processing thread pool size, not by CPU. Each document being fed blocks a thread
while waiting for the VoyageAI API response. To improve throughput, you likely have to increase the
document processing thread pool size,
assuming the content cluster is not the bottleneck.
For example, consider a container cluster with 2 nodes, each with 8 vCPUs. With the default document processing
thread pool size of 1 thread per vCPU, you have 16 total threads. If the average VoyageAI API latency is 200ms,
the maximum throughput is approximately 16 / 0.2 = 80 documents/second.
See container tuning for more on container tuning.
Note that the effective throughput can never exceed the rate limit of your VoyageAI API key.
Use the embedder metrics
to determine embedder latency and throughput.
For additional throughput improvements, consider enabling dynamic batching.
Dynamic batching
Dynamic batching combines multiple concurrent embedding requests into a single VoyageAI API call.
This is useful when throughput is constrained by VoyageAI's
RPM (requests per minute) limit
rather than the TPM (tokens per minute) limit.
Batching reduces RPM usage by combining requests; TPM usage is unaffected.
The max-size attribute sets the maximum number of requests in a single batch,
and max-delay sets the maximum time to wait for a full batch before sending a partial one.
Batching is disabled by default.
Embedding inference can be resource-intensive for larger embedding models. Factors that impact performance:
The embedding model parameters. Larger models are more expensive to evaluate than smaller models.
The sequence input length. Transformer models scale quadratically with input length. Since queries
are typically shorter than documents, embedding queries is less computationally intensive than embedding documents.
The number of inputs to the embed call. When encoding arrays, consider how many inputs a single document can have.
For CPU inference, increasing feed timeout settings
might be required when documents have many embedinputs.
Using GPU, especially for longer sequence lengths (documents),
can dramatically improve performance and reduce cost.
See the blog post on GPU-accelerated ML inference in Vespa Cloud.
With GPU-accelerated instances, using fp16 models instead of fp32 can increase throughput by as much as 3x compared to fp32.
Vespa's built-in embedders emit metrics for computation time and token sequence length.
These metrics are prefixed with embedder.
and listed in the Container Metrics reference documentation.
Third-party embedder implementations may inject the ai.vespa.embedding.Embedder.Runtime component to easily
emit the same predefined metrics, although emitting custom metrics is perfectly fine.
The above configuration prepends text in queries and field data.
Find a complete example in the ColBERT
sample application.
An alternative approach is using query profiles to prepend query data.
If you need to add a standard wrapper or a prefix instruction around the input text you want to embed
use parameter substitution to supply the text, as in embed(myEmbedderId, @text),
and let the parameter (text here) be defined in a query profile,
which in turn uses value substitution
to place another query request with a supplied text value within it. The following is a concrete example
where queries should have a prefix instruction before being embedded in a vector representation. The following
defines a text input field to search/query-profiles/default.xml:
<query-profileid="default"><fieldname="text">"Represent this sentence for searching relevant passages: %{user_query}</field></query-profile>
Then, at query request time, we can pass user_query as a request parameter, this parameter is then used to produce
the text value which then is embedded.
{"yql":"select * from doc where userQuery() or ({targetHits: 100}nearestNeighbor(embedding, e))","input.query(e)":"embed(mxbai, @text)","user_query":"space contains many suns"}
The text that is embedded by the embedder is then:
Represent this sentence for searching relevant passages: space contains many suns.
Concatenating input fields
You can concatenate values in indexing using ".", and handle missing field values using
choice
to produce a single input for an embedder:
schema doc {
document doc {
field title type string {
indexing: summary | index
}
field body type string {
indexing: summary | index
}
}
field embeddings type tensor<bfloat16>(x[384]) {
indexing {
(input title || "") . " " . (input body || "") | embed embedderId | attribute | index
}
index: hnsw
}
}
You can also use concatenation to add a fixed preamble to the string to embed.
Combining with foreach
The indexing expression can also use for_each and include other document fields.
For example, the E5 family of embedding models uses instructions along with the input. The following
expression prefixes the input with passage: followed by a concatenation of the title and a text chunk.
schema doc {
document doc {
field title type string {
indexing: summary | index
}
field chunks type array<string> {
indexing: index | summary
}
}
field embedding type tensor<bfloat16>(p{}, x[384]) {
indexing {
input chunks |
for_each {
"passage: " . (input title || "") . " " . ( _ || "")
} | embed e5 | attribute | index
}
attribute {
distance-metric: prenormalized-angular
}
}
}
This section covers common issues and how to resolve them.
Model download failure
If models fail to download, it will cause the Vespa stateless container service to not start with
RuntimeException: Not able to create config builder for payload -
see example.
This usually means that the model download failed. Check the Vespa log for more details.
The most common reasons for download failure are network issues or incorrect URLs.
This will also be visible in the Vespa status output as the container will not listen to its port:
vespa status -t http://127.0.0.1:8080
Container at http://127.0.0.1:8080 is not ready: unhealthy container at http://127.0.0.1:8080/status.html: Get "http://127.0.0.1:8080/status.html": EOF
Error: services not ready: http://127.0.0.1:8080
Tensor shape mismatch
The native embedder implementations expect that the output tensor has a specific shape.
If the shape is incorrect, you will see an error message during feeding like:
feed: got status 500 ({"pathId":"..","..","message":"[UNKNOWN(252001) @ tcp/vespa-container:19101/chain.indexing]:
Processing failed. Error message: java.lang.IllegalArgumentException: Expected 3 output dimensions for output name 'sentence_embedding': [batch, sequence, embedding], got 2 -- See Vespa log for details. "}) for put xx:not retryable
This means that the exported ONNX model output tensor does not have the expected shape. For example, the above is
logged by the hf-embedder that expects the output shape to be [batch, sequence, embedding] (A 3D tensor). This is because the embedder
implementation performs the pooling-strategy over the sequence dimension to produce a single embedding vector.
The batch size is always 1 for Vespa embeddings.
See onnx export for how to export models to ONNX format with the correct output shapes and
onnx debug for debugging input and output names.
Input names
The native embedder implementations expect that the ONNX model accepts certain input names.
If the names are incorrect, it will cause the Vespa container service to not start,
and you will see an error message in the vespa log like:
WARNING container Container.com.yahoo.container.di.Container
Caused by: java.lang.IllegalArgumentException: Model does not contain required input: 'input_ids'. Model contains: my_input
This means that the ONNX model accepts "my_input", while our configuration attempted to use "input_ids". The default
input names for the hf-embedder are "input_ids", "attention_mask" and "token_type_ids". These are overridable
in the configuration (reference). Some embedding models do not
use the "token_type_ids" input. We can specify this in the configuration by setting transformer-token-type-ids to empty,
illustrated by the following example.
The native embedder implementations expect that the ONNX model produces certain output names.
It will cause the Vespa stateless container service to not start,
and you will see an error message in the vespa log like:
Model does not contain required output: 'test'. Model contains: last_hidden_state
This means that the ONNX model produces "last_hidden_state", while our configuration attempted to use "test". The default
output name for the hf-embedder is "last_hidden_state". This is overridable
in the configuration. See reference.
EOF
If vespa status shows that the container is healthy, but you observe an EOF error during feeding, this means that the stateless container service has
crashed and stopped listening to its port. This could be related to the embedder ONNX model size, docker container memory resource constraints,
or the configured JVM heap size of the Vespa stateless container service.
vespa feed ext/1.json
feed: got error "Post "http://127.0.0.1:8080/document/v1/doc/doc/docid/1": unexpected EOF" (no body) for put id:doc:doc::1: giving up after 10 attempts
This could be related to insufficient stateless container (JVM) memory.
Check the container logs for OOM errors. See jvm-tuning for JVM tuning options (The default heap size is 1.5GB).
Container crashes could also be caused by too little memory allocated to the docker or podman container, which can cause the Linux kernel to kill processes to free memory.
See the docker containers memory documentation.