A common technique in modern big data serving applications is to map unstructured data - say, text or images -
to points in an abstract vector space and then do computation in that vector space. For example, retrieve
similar data by finding nearby points in the vector space,
or using the vectors as input to a neural net.
This mapping is usually referred to as embedding.
Read more about embedding and embedding management in this
blog post.
Without using Vespa native embedding support, the embedding vectors must be provided by the client like below:
By using the embedding features in Vespa, developers can instead generate the embeddings
into the Vespa document and query flows.
Observe how embed is used to create the vector embeddings from text in the schema and queries:
Integrating embedding into the Vespa application simplify the serving architecture,
as well as cut serving latencies significantly - vector data transfer is minimized, with fewer components and serialization overhead.
Provided embedders
Vespa provides several embedders as part of the platform.
An example of a Vespa query request using Vespa embed functionality:
{"yql":"select * from doc where {targetHits:10)nearestNeighbor(embedding,query_embedding)","query":"semantic search","input.query(query_embedding)":"embed(semantic search)",}
An embedder using any Huggingface tokenizer,
including multilingual tokenizers,
to produce tokens which is then input to a supplied transformer model in ONNX model format.
The Huggingface Embedder provides embeddings directly suitable for retrieval
and ranking in Vespa, and makes it easy
to implement semantic search with no need for custom components or client-side embedding inference
when used with the syntax for invoking the embedder in queries and during document indexing described
in embedding a query text and embedding a document field.
The Huggingface embedder is configured in services.xml,
within the container tag:
The transformer-model specifies the embedding model in ONNX.
See exporting models to ONNX,
for how to export embedding models from Huggingface to compatible ONNX format.
When using path, the model files must be supplied in the Vespa
application package.
The above example uses files in the the models directory,
or specified by model-id when deployed on Vespa Cloud.
See model config reference.
The following are examples of text embedding models that can be used with the hugging-face-embedder
and their output tensor dimensionality.
The resulting tensor type can be either float
or bfloat16:
An embedder using the WordPiece embedder to produce tokens
which is then input to a supplied ONNX model on the form expected by a BERT base model.
The Bert embedder is limited to English (WordPiece) and
BERT-styled transformer models with three model inputs
(input_ids, attention_mask, token_type_ids).
Prefer using the Huggingface Embedder instead of the Bert embedder.
The Bert embedder is configured in services.xml,
within the container tag:
The transformer-model specifies the embedding model in ONNX.
See exporting models to ONNX,
for how to export embedding models from Huggingface to compatible ONNX format.
The tokenizer-vocab specifies the Huggingface vocab.txt file, with one valid token per line.
Note that the Bert embedder does not support the tokenizer.json formatted tokenizer configuration files.
This means that tokenization settings like max tokens should be set explicitly.
Where you would otherwise supply a tensor representing the vector point in a query,
you can with an embedder configured instead supply any text enclosed in embed(), e.g:
If you have only configured a single embedder, you can skip the embedder id argument and optionally also the quotes. Prefer
to specify the embedder id as introducing more embedder models requires specifying the identifier.
Both single(') and double quotes(") are permitted.
Output from embed that cannot fit into the tensor dimensionality is truncated, only retaining the first values.
A single Vespa query can use multiple embedders or embed multiple texts with the same embedder:
{"yql":"select id,title from paragraph where ({targetHits:10}nearestNeighbor(embedding,q)) or ({targetHits:10}nearestNeighbor(embedding,q2)) or userQuery()","query":"semantic search","input.query(q)":"embed(e5, \"contextualized search\")","input.query(q2)":"embed(e5, \"neural search\")""ranking":"semantic",}
Above example using JSON POST query. Notice how the input to the embedding is quoted. Since
adding new embedders to services.xml will break queries not specifying embedder id, it is recommended to always specify embedder id.
{"yql":"select id,title from paragraph where ({targetHits:10}nearestNeighbor(embedding,q)) or ({targetHits:10}nearestNeighbor(question_embedding,q))","query":"semantic search","input.query(q)":"embed(e5, \"contextualized search\")","ranking":"semantic",}
Using the same embedding tensor as input to two nearestNeighbor query operators, searching two different embedding fields. For this to
work, both embedding and question_embedding must have the same dimensionality.
Embedding a document field
Use the Vespa indexing language
to convert one or more string fields into an embedding vector by using the embed function,
for example:
schema doc {
document doc {
field title type string {
indexing: summary | index
}
field body type string {
indexing: summary | index
}
}
field embeddings type tensor<bfloat16>(x[384]) {
indexing {
(input title || "") . " " . (input body || "") | embed embedderId | attribute | index
}
index: hnsw
}
}
The above example uses two input fields and concatenate them into a single input string to the embedder.
See indexing choice for details.
If each document has multiple text segments, represent them in an array and store the vector embeddings
in a tensor field with one mapped and one indexed dimension.
The array indexes (0-based) are used as labels in the mapped tensor dimension.
See Revolutionizing Semantic Search with Multi-Vector HNSW Indexing in Vespa.
schema doc {
document doc {
field chunks type array<string> {
indexing: index | summary
}
}
field embeddings type tensor<bfloat16>(p{},x[5]) {
indexing: input chunks | embed embedderId | attribute | index
index: hnsw
}
}
If you only have configured a single embedder you can skip the embedder id argument.
The indexing expression can also use for_each and include other document fields.
For example the E5 family of embedding models uses instructions along with the input. The following
expression prefixes the input with passage: followed by a concatenation of the title and a text chunk.
schema doc {
document doc {
field title type string {
indexing: summary | index
}
field chunks type array<string> {
indexing: index | summary
}
}
field embedding type tensor<bfloat16>(p{}, x[384]) {
indexing {
input chunks |
for_each {
"passage: " . (input title || "") . " " . ( _ || "")
} | embed e5 | attribute | index
}
attribute {
distance-metric: prenormalized-angular
}
}
}
Transformer based models have named inputs and outputs that needs to be compatible
with the input and output names used by the Bert embedder or the Huggingface embedder.
The simple-semantic-search
sample app includes two scripts to export models and vocabulary files using the default expecteded input and output names for the bert-embedder
and the huggingface-embedder. The input and output names to use can also be overriden by the various transformer-
input and output huggingface-embedder parameters
and bert-embedder parameters.
Debugging ONNX models
When loading ONNX models for embedders, the model must have correct inputs and output parameters.
Vespa offers tools to inspect ONNX model files.
Here, minilm-l6-v2.onnx is in current working directory:
If loading models without the expected input and output parameter names, the Vespa Container node will not start
(check vespa.log in the container running Vespa):
WARNING container Container.com.yahoo.container.di.Container
Caused by: java.lang.IllegalArgumentException: Model does not contain required input: 'input_ids'. Model contains: input
When this happens, a deploy looks like:
$ vespa deploy --wait 300
Uploading application package ... done
Success: Deployed .
Waiting up to 5m0s for query service to become available ...
Error: service 'query' is unavailable: services have not converged
Embedder performance
Embedding inference can be resource intensive for larger embedding models. Factors that impacts performance:
The embedding model parameters. Larger models are more expensive to evaluate than smaller models.
The sequence input length. Transformer type models scales quadratic with input length. Since queries
are typically shorter than documents, embedding queries is less resource intensive than documents.
The number of inputs to the embed call. When encoding arrays, consider how many inputs a single document can have.
For CPU inference, increasing feed timeout settings
might be required when documents have many embedinputs.
Vespa's built-in embedders such as Bert and HuggingFace emits metrics for computation time and token sequence length.
These metrics are prefixed with embedder.
and listed in the Container Metrics reference documentation.
Third-party embedder implementations may inject the ai.vespa.embedding.EmbedderRuntime component to easily emit the same predefined metrics,
although emitting custom metrics is perfectly fine.