# Embedding

A common technique in modern big data serving applications is to map the subject data - say, text or images - to points in an abstract vector space and then do computation in that vector space. For example, retrieve similar data by finding nearby points in the vector space, or using the vectors as input to a neural net. This mapping is usually referred to as embedding.

Vespa provides built-in support for embedding, which is documented here.

## Embedders

Vespa provides a Java interface for defining components which can provide embeddings of text: com.yahoo.language.process.Embedder

If you want to define your own embedder in an application and make it usable by Vespa (see below), you can implement this interface and add the class as a component to your services.xml:

<container version="1.0">
<component id="com.example.MyEmbedder" bundle="the name in <artifactId> in your pom.xml"/>
</container>


You can only configure a single embedder. If you need more, create your own embedder and let it select and invoke the ones you would like.

## Provided embedders

Vespa provides two embedders as part of the platform.

### SentencePiece embedder

A native Java implementation of SentencePiece. SentencePiece breaks text into chunks independent of spaces, which is robust to misspellings and works with CJK languages. It is also very fast.

<component id="com.yahoo.language.sentencepiece.SentencePieceEmbedder" bundle="linguistics-components">
<config name="language.sentencepiece.sentence-piece">
<model>
<item>
<language>unknown</language>
<path>model/en.wiki.bpe.vs10000.model</path>
</item>
</model>
</config>
</component>


See the options available for configuring SentencePiece in the full configuration definition..

The model(s) used must be supplied by the application package. You can find pre-trained models for many languages here.

### WordPiece embedder

A native Java implementation of WordPiece, which is commonly used with BERT models.

<component id="com.yahoo.language.weordpiece.WordPieceEmbedder" bundle="linguistics-components">
<config name="language.wordpiece.word-piece">
<model>
<item>
<language>unknown</language>
<path>models/bert-base-uncased-vocab.txt</path>
</item>
</model>
</config>
</component>


See the options available for configuring WordPiece in the full configuration definition..

## Embedding a query text

Where you would otherwise supply a tensor representing the vector point in a query, you can with an embedder configured instead supply any text enclosed in embed(). E.g:

ranking.features.query(myEmbedding)=embed(Hello%20world)


## Embedding a document field

You can use the indexing language to convert a text field into an embedding by using the embed function. For example:

document doc {

field myText type string {
indexing: index | summary
}

}

field embeddingOfMyText type tensor(x[5]) {
indexing: input text | embed | attribute | index | summary
index: hnsw
}

}


## Using an embedder from Java

If you are writing your own Java components (such as Searchers or Document processors), you can use an embedder you have configured by having it injected in the constructor, just as any other component:

class MyComponent(Embedder embedder) {
this.embedder = embedder;
}


## A complete example

You can find a complete example of configuring and using the SentencePiece embedder in this system test.