A common technique in modern big data serving applications is to map the subject data - say, text or images - to points in an abstract vector space and then do computation in that vector space. For example, retrieve similar data by finding nearby points in the vector space, or using the vectors as input to a neural net. This mapping is usually referred to as embedding.
Vespa provides built-in support for embedding, which is documented here.
Vespa provides a Java interface for defining components which can provide embeddings of text: com.yahoo.language.process.Embedder.
To define a custom embedder in an application and make it usable by Vespa (see below),
implement this interface and add it as a component
to
services.xml:
<container version="1.0"> <component id="myEmbedder" class="com.example.MyEmbedder" bundle="the name in <artifactId> in pom.xml"/> </container>
Vespa provides some embedders as part of the platform.
An embedder using WordPiece to produce tokens which is then input to a supplied ONNX model on the form expected by a BERT base model. See export_model_from_hf.py, for how to export a sentence-transformer model to ONNX format. See also troubleshooting model signature.
This provides embeddings directly suitable for retrieval and ranking in Vespa, and makes it possible to implement semantic search with no need for custom components or client-side embedding when used with the syntax for invoking the embedder in queries and during indexing described below.
To set up the BertBase embedder, add it to services.xml:
<component id="myBert" class="ai.vespa.embedding.BertBaseEmbedder" bundle="model-integration"> <config name="embedding.bert-base-embedder"> <transformerModel path="models/myBertModel.onnx"/> <tokenizerVocab path="models/myTokenizerVocabulary.txt"/> </config> </component>
See the options available for configuring the BertBase embedder in the full configuration definition. Notice that BertBase embedder uses mean pooling strategy by default.
The model files used must be supplied by the application (or specified by id when on Vespa Cloud), for example from HuggingFace. Refer to adding files to the configuration for the full syntax for specifying model files by url, path or model-id.
A native Java implementation of SentencePiece. SentencePiece breaks text into chunks independent of spaces, which is robust to misspellings and works with CJK languages. It is also very fast.
This is suitable to use in conjunction with custom components which processes the resulting encoding further to produce semantically meaningful vectors.
To use the SentencePiece embedder, add it to services.xml:
<component id="mySentencePiece" class="com.yahoo.language.sentencepiece.SentencePieceEmbedder" bundle="linguistics-components"> <config name="language.sentencepiece.sentence-piece"> <model> <item> <language>unknown</language> <path>model/en.wiki.bpe.vs10000.model</path> </item> </model> </config> </component>
See the options available for configuring SentencePiece in the full configuration definition.
A native Java implementation of WordPiece, which is commonly used with BERT models.
This is suitable to use in conjunction with custom components which processes the resulting encoding further to produce semantically meaningful vectors.
To use the WordPiece embedder, add it to services.xml:
<component id="myWordPiece" class="com.yahoo.language.wordpiece.WordPieceEmbedder" bundle="linguistics-components"> <config name="language.wordpiece.word-piece"> <model> <item> <language>unknown</language> <path>models/bert-base-uncased-vocab.txt</path> </item> </model> </config> </component>
See the options available for configuring WordPiece in the full configuration definition.
WordPiece is suitable to use in conjunction with custom components which processes the resulting encoding further to produce semantically meaningful vectors.
Where you would otherwise supply a tensor representing the vector point in a query,
you can with an embedder configured instead supply any text enclosed in embed()
, e.g:
ranking.features.query(myEmbedding)=embed(myEmbedderId, "Hello%20world")
If you have only configured a single embedder, you can skip the embedder id argument and optionally also the quotes. Both single and double quotes are permitted.
Use the indexing language to convert a text field into an embedding by using the embed
function,
for example:
schema doc {
document doc {
field myText type string {
indexing: index | summary
}
}
field embeddingOfMyText type tensor(x[5]) {
indexing: input myText | embed myEmbedderId | attribute | index | summary
index: hnsw
}
}
If you only have configured a single embedder you can skip the embedder id argument.
document
clause in the schema.
When writing custom Java components (such as Searchers or Document processors), use embedders you have configured by having them injected in the constructor, just as any other component:
class MyComponent(ComponentRegistry<Embedder> embedders) { // embedders contains all the embedders configured in your services.xml }
Try the simple-semantic-search sample application. A complete example application using multiple embedders can be found in in this system test.
When loading models for the embedder, the model must have correct inputs and output signatures. Here, minilm-l6-v2.onnx is in current working directory:
$ docker run -v `pwd`:/w \ --entrypoint /opt/vespa/bin/vespa-analyze-onnx-model \ vespaengine/vespa \ /w/minilm-l6-v2.onnx ... model meta-data: input[0]: 'input_ids' long[batch][sequence] input[1]: 'attention_mask' long[batch][sequence] input[2]: 'token_type_ids' long[batch][sequence] output[0]: 'output_0' float[batch][sequence][384] output[1]: 'output_1' float[batch][384] ... test setup: input[0]: tensor(d0[1],d1[1]) -> long[1][1] input[1]: tensor(d0[1],d1[1]) -> long[1][1] input[2]: tensor(d0[1],d1[1]) -> long[1][1] output[0]: float[1][1][384] -> tensor<float>(d0[1],d1[1],d2[384]) output[1]: float[1][384] -> tensor<float>(d0[1],d1[384])
If loading models with other signatures, the Vespa Container node will not start (check vespa.log in the container running Vespa):
[2022-10-18 18:18:31.761] WARNING container Container.com.yahoo.container.di.Container Failed to set up first component graph due to error when constructing one of the components exception=com.yahoo.container.di.componentgraph.core.ComponentNode$ComponentConstructorException: Error constructing 'bert' of type 'ai.vespa.embedding.BertBaseEmbedder': null Caused by: java.lang.IllegalArgumentException: Model does not contain required input: 'input_ids'. Model contains: input at ai.vespa.embedding.BertBaseEmbedder.validateName(BertBaseEmbedder.java:79) at ai.vespa.embedding.BertBaseEmbedder.validateModel(BertBaseEmbedder.java:68)
When this happens, a deploy looks like:
$ vespa deploy --wait 300 Uploading application package ... done Success: Deployed . Waiting up to 5m0s for query service to become available ... Error: service 'query' is unavailable: services have not converged
Use vespa-analyze-onnx-model like in the example above to analyze the signature.