A common technique in modern big data serving applications is to map the subject data - say, text or images - to points in an abstract vector space and then do computation in that vector space. For example, retrieve similar data by finding nearby points in the vector space, or using the vectors as input to a neural net. This mapping is usually referred to as embedding.
Vespa provides built-in support for embedding, which is documented here.
Vespa provides a Java interface for defining components which can provide embeddings of text: com.yahoo.language.process.Embedder
If you want to define your own embedder in an application and make it usable by Vespa (see below), you can implement this interface and add the class as a component to your services.xml:
<container version="1.0"> <component id="com.example.MyEmbedder" bundle="the name in <artifactId> in your pom.xml"/> </container>
You can only configure a single embedder. If you need more, create your own embedder and let it select and invoke the ones you would like.
Vespa provides two embedders as part of the platform.
To use the SentencePiece embedder add it to your services.xml, e.g:
<component id="com.yahoo.language.sentencepiece.SentencePieceEmbedder" bundle="linguistics-components"> <config name="language.sentencepiece.sentence-piece"> <model> <item> <language>unknown</language> <path>model/en.wiki.bpe.vs10000.model</path> </item> </model> </config> </component>
See the options available for configuring SentencePiece in the full configuration definition..
The model(s) used must be supplied by the application package. You can find pre-trained models for many languages here.
A native Java implementation of WordPiece, which is commonly used with BERT models.
To use the WordPiece embedder add it to your services.xml, e.g:
<component id="com.yahoo.language.weordpiece.WordPieceEmbedder" bundle="linguistics-components"> <config name="language.wordpiece.word-piece"> <model> <item> <language>unknown</language> <path>models/bert-base-uncased-vocab.txt</path> </item> </model> </config> </component>
See the options available for configuring WordPiece in the full configuration definition..
Where you would otherwise supply a tensor representing the vector point in a query,
you can with an embedder configured instead supply any text enclosed in embed()
. E.g:
ranking.features.query(myEmbedding)=embed(Hello%20world)
You can use the indexing language to convert a text field into an embedding by using the embed
function. For example:
document doc { field myText type string { indexing: index | summary } } field embeddingOfMyText type tensor(x[5]) { indexing: input text | embed | attribute | index | summary index: hnsw } }
If you are writing your own Java components (such as Searchers or Document processors), you can use an embedder you have configured by having it injected in the constructor, just as any other component:
class MyComponent(Embedder embedder) { this.embedder = embedder; }
You can find a complete example of configuring and using the SentencePiece embedder in this system test.