• [+] expand all

Embedding

A common technique in modern big data serving applications is to map unstructured data - say, text or images - to points in an abstract vector space and then do computation in that vector space. For example, retrieve similar data by finding nearby points in the vector space, or using the vectors as input to a neural net. This mapping is usually referred to as embedding. Read more about embedding and embedding management in this blog post.

Without using Vespa native embedding support, the embedding vectors must be provided by the client like below:

document- and query-embeddings

By using the embedding features in Vespa, developers can instead generate the embeddings into the Vespa document and query flows. Observe how embed is used to create the vector embeddings from text in the schema and queries:

Vespa's embedding feature, creating embeddings from text

Integrating embedding into the Vespa application simplify the serving architecture, as well as cut serving latencies significantly - vector data transfer is minimized, with fewer components and serialization overhead.

Provided embedders

Vespa provides several embedders as part of the platform.

An example of a Vespa query request using Vespa embed functionality:

 {
    "yql": "select * from doc where {targetHits:10)nearestNeighbor(embedding,query_embedding)",
    "query": "semantic search",
    "input.query(query_embedding)": "embed(semantic search)",
}

See more usage examples in embedding a query text and embedding a document field.

Huggingface Embedder

An embedder using any Huggingface tokenizer, including multilingual tokenizers, to produce tokens which is then input to a supplied transformer model in ONNX model format.

The huggingface-embedder supports all Huggingface tokenizer implementations and input length settings, while the bert-embedder only supports WordPiece tokenization.

The Huggingface Embedder provides embeddings directly suitable for retrieval and ranking in Vespa, and makes it easy to implement semantic search with no need for custom components or client-side embedding inference when used with the syntax for invoking the embedder in queries and during document indexing described in embedding a query text and embedding a document field.

The Huggingface embedder is configured in services.xml, within the container tag:

<container id="default" version="1.0">
    <component id="hf-embedder" type="hugging-face-embedder">
        <transformer-model path="my-models/model.onnx"/>
        <tokenizer-model path="my-models/tokenizer.json"/>
    </component>
    ...
</container>

When using path, the model files must be supplied in the Vespa application package. The above example uses files in the the models directory, or specified by model-id when deployed on Vespa Cloud. See model config reference.

<container id="default" version="1.0">
    <component id="e5" type="hugging-face-embedder">
        <transformer-model path="my-models/model.onnx"/>
        <tokenizer-model url="https://huggingface.co/intfloat/e5-base-v2/raw/main/tokenizer.json"/>
    </component>
    ...
</container>

See configuration reference for all the parameters.

Huggingface embedder models

The following are examples of text embedding models that can be used with the hugging-face-embedder and their output tensor dimensionality. The resulting tensor type can be either float or bfloat16:

All of these example text embedding models can be used in combination with Vespa's nearest neighbor search using distance-metric angular.

Check the Massive Text Embedding Benchmark (MTEB) benchmark and MTEB leaderboard for help with choosing an embedding model.

Bert embedder

An embedder using the WordPiece embedder to produce tokens which is then input to a supplied ONNX model on the form expected by a BERT base model. The Bert embedder is limited to English (WordPiece) and BERT-styled transformer models with three model inputs (input_ids, attention_mask, token_type_ids). Prefer using the Huggingface Embedder instead of the Bert embedder.

The Bert embedder is configured in services.xml, within the container tag:

<container version="1.0">
  <component id="myBert" type="bert-embedder">
    <transformer-model path="models/e5-small-v2.onnx"/>
    <tokenizer-vocab url="https://huggingface.co/intfloat/e5-small-v2/raw/main/vocab.txt"/>
    <max-tokens>128</max-tokens>
  </component>
</container>
  • The transformer-model specifies the embedding model in ONNX. See exporting models to ONNX, for how to export embedding models from Huggingface to compatible ONNX format.
  • The tokenizer-vocab specifies the Huggingface vocab.txt file, with one valid token per line. Note that the Bert embedder does not support the tokenizer.json formatted tokenizer configuration files. This means that tokenization settings like max tokens should be set explicitly.

See configuration reference for all configuration options.

Embedding a query text

Where you would otherwise supply a tensor representing the vector point in a query, you can with an embedder configured instead supply any text enclosed in embed(), e.g:

input.query(q)=embed(myEmbedderId, "Hello%20world")

If you have only configured a single embedder, you can skip the embedder id argument and optionally also the quotes. Prefer to specify the embedder id as introducing more embedder models requires specifying the identifier.

Both single(') and double quotes(") are permitted.

Note that query input tensors must be defined in the schema's rank-profile. See schema reference inputs.

inputs {
  query(q) tensor<float>(x[768])
  query(q2) tensor<float>(x[768])
}

Output from embed that cannot fit into the tensor dimensionality is truncated, only retaining the first values.

A single Vespa query can use multiple embedders or embed multiple texts with the same embedder:

 {
    "yql": "select id,title from paragraph where ({targetHits:10}nearestNeighbor(embedding,q)) or ({targetHits:10}nearestNeighbor(embedding,q2)) or userQuery()",
    "query": "semantic search",
    "input.query(q)": "embed(e5, \"contextualized search\")",
    "input.query(q2)": "embed(e5, \"neural search\")"
    "ranking": "semantic",
}

Above example using JSON POST query. Notice how the input to the embedding is quoted. Since adding new embedders to services.xml will break queries not specifying embedder id, it is recommended to always specify embedder id.

  {
     "yql": "select id,title from paragraph where ({targetHits:10}nearestNeighbor(embedding,q)) or ({targetHits:10}nearestNeighbor(question_embedding,q))",
     "query": "semantic search",
     "input.query(q)": "embed(e5, \"contextualized search\")",
     "ranking": "semantic",
 }

Using the same embedding tensor as input to two nearestNeighbor query operators, searching two different embedding fields. For this to work, both embedding and question_embedding must have the same dimensionality.

Embedding a document field

Use the Vespa indexing language to convert one or more string fields into an embedding vector by using the embed function, for example:

schema doc {

    document doc {

        field title type string {
            indexing: summary | index
        }

        field body type string {
            indexing: summary | index
        }

    }

    field embeddings type tensor<bfloat16>(x[384]) {
        indexing {
            (input title || "") . " " . (input body || "") | embed embedderId | attribute | index
        }
        index: hnsw
    }

}

The above example uses two input fields and concatenate them into a single input string to the embedder. See indexing choice for details.

If each document has multiple text segments, represent them in an array and store the vector embeddings in a tensor field with one mapped and one indexed dimension. The array indexes (0-based) are used as labels in the mapped tensor dimension. See Revolutionizing Semantic Search with Multi-Vector HNSW Indexing in Vespa.

schema doc {

    document doc {

        field chunks type array<string> {
            indexing: index | summary
        }

    }

    field embeddings type tensor<bfloat16>(p{},x[5]) {
        indexing: input chunks | embed embedderId | attribute | index
        index: hnsw
    }

}

If you only have configured a single embedder you can skip the embedder id argument.

The indexing expression can also use for_each and include other document fields. For example the E5 family of embedding models uses instructions along with the input. The following expression prefixes the input with passage: followed by a concatenation of the title and a text chunk.

schema doc {

    document doc {

        field title type string {
            indexing: summary | index
        }

        field chunks type array<string> {
            indexing: index | summary
        }

    }
    field embedding type tensor<bfloat16>(p{}, x[384]) {
        indexing {
            input chunks |
                for_each {
                    "passage: " . (input title || "") . " " . ( _ || "")
                } | embed e5 | attribute | index
        }
        attribute {
            distance-metric: prenormalized-angular
        }
    }
}

See Indexing language execution value for details.

Examples

Try the simple-semantic-search sample application. The commerce-product-ranking demonstrates using multiple embedders. The multi-vector-indexing sample-app demonstrates how to use embedders with multiple document field inputs.

Exporting HF models to ONNX format

Transformer based models have named inputs and outputs that needs to be compatible with the input and output names used by the Bert embedder or the Huggingface embedder.

The simple-semantic-search sample app includes two scripts to export models and vocabulary files using the default expecteded input and output names for the bert-embedder and the huggingface-embedder. The input and output names to use can also be overriden by the various transformer- input and output huggingface-embedder parameters and bert-embedder parameters.

Debugging ONNX models

When loading ONNX models for embedders, the model must have correct inputs and output parameters. Vespa offers tools to inspect ONNX model files. Here, minilm-l6-v2.onnx is in current working directory:

$ docker run -v `pwd`:/w \
  --entrypoint /opt/vespa/bin/vespa-analyze-onnx-model \
  vespaengine/vespa \
  /w/minilm-l6-v2.onnx

...
model meta-data:
  input[0]: 'input_ids' long[batch][sequence]
  input[1]: 'attention_mask' long[batch][sequence]
  input[2]: 'token_type_ids' long[batch][sequence]
  output[0]: 'output_0' float[batch][sequence][384]
  output[1]: 'output_1' float[batch][384]
...
The above model input and output names conforms with the default bert-embedder parameters.

If loading models without the expected input and output parameter names, the Vespa Container node will not start (check vespa.log in the container running Vespa):

 WARNING container        Container.com.yahoo.container.di.Container
  Caused by: java.lang.IllegalArgumentException: Model does not contain required input: 'input_ids'. Model contains: input

When this happens, a deploy looks like:

$ vespa deploy --wait 300
Uploading application package ... done

Success: Deployed .

Waiting up to 5m0s for query service to become available ...
Error: service 'query' is unavailable: services have not converged

Embedder performance

Embedding inference can be resource intensive for larger embedding models. Factors that impacts performance:

  • The embedding model parameters. Larger models are more expensive to evaluate than smaller models.
  • The sequence input length. Transformer type models scales quadratic with input length. Since queries are typically shorter than documents, embedding queries is less resource intensive than documents.
  • The number of inputs to the embed call. When encoding arrays, consider how many inputs a single document can have. For CPU inference, increasing feed timeout settings might be required when documents have many embedinputs.

Using GPU, especially for longer sequence lengths (documents), can dramatically improve performance and reduce cost. See the blog post on GPU-accelerated ML inference in Vespa Cloud. See also Vespa container GPU setup guide for using GPUs with Vespa.

Metrics

Vespa's built-in embedders such as Bert and HuggingFace emits metrics for computation time and token sequence length. These metrics are prefixed with embedder. and listed in the Container Metrics reference documentation. Third-party embedder implementations may inject the ai.vespa.embedding.EmbedderRuntime component to easily emit the same predefined metrics, although emitting custom metrics is perfectly fine.