Cross-Encoder Transformer
based text ranking models are generally more effective than text embedding models
as they take both the query and the document as input with full cross-attention between all the query and document tokens.
The downside of cross-encoder models is the computational complexity. This document is a guide
on how to export cross-encoder Transformer based models from huggingface,
and how to configure them for use in Vespa.
Exporting cross-encoder models
For exporting models from HF to ONNX, we recommend the Optimum
library. Example usage for two relevant ranking models.
To speed up inference, Vespa avoids re-tokenizing the document tokens, so we need to configure the
huggingface-tokenizer-embedder
in the services.xml file:
This allows us to use the tokenizer while indexing documents in Vespa and also at query time to
map (embed) query text to language model tokens.
Using tokenizer in schema
Assuming we have two fields that we want to index and use for re-ranking (title, body), we
can use the embed indexing expression to invoke the tokenizer configured above:
schema my_document {
document my_document {
field title type string {..}
field body type string {..}
}
field tokens type tensor<float>(d0[512]) {
indexing: (input title || "") . " " . (input body || "") | embed tokenizer | attribute
}
}
The above will concat the title and body input document fields, and input to the
hugging-face-tokenizer tokenizer which saves the output tokens as float (101.0).
To use the generated tokens tensor in ranking, the tensor field must be defined with attribute.
Using the cross-encoder model in ranking
Cross-encoder models are not practical for retrieval over large document volumes due to their complexity, so we configure them
using phased ranking.
Bert-based model
Bert-based models have three inputs:
input_ids
token_type_ids
attention_mask
The onnx-model configuration specifies the input names
of the model and how to calculate them. It also specifies the file models/model.onnx.
Notice also the GPU.
GPU inference is not required, and Vespa will fallback to CPU if no GPU device is found.
See section on performance.
rank-profile bert-ranker inherits default {
inputs {
query(q_tokens) tensor<float>(d0[32])
}
onnx-model cross_encoder {
file: models/model.onnx
input input_ids: my_input_ids
input attention_mask: my_attention_mask
input token_type_ids: my_token_type_ids
gpu-device: 0
}
function my_input_ids() {
expression: tokenInputIds(256, query(q_tokens), attribute(tokens))
}
function my_token_type_ids() {
expression: tokenTypeIds(256, query(q_tokens), attribute(tokens))
}
function my_attention_mask() {
expression: tokenAttentionMask(256, query(q_tokens), attribute(tokens))
}
first-phase {
expression: #depends on the retriever used
}
# The output of this model is a tensor of size ["batch", 1]
global-phase {
rerank-count: 25
expression: onnx(cross_encoder){d0:0,d1:0}
}
}
The example above limits the sequence length to 256 using the built-in
convenience functions
for generating token sequence input to Transformer models. Note that tokenInputIds uses 101 as start of sequence
and 102 as padding. This is only compatible with BERT-based tokenizers. See section on performance
about sequence length and impact on inference performance.
Roberta-based model
ROBERTA-based models only have two inputs (input_ids and attention_mask). In addition, the default tokenizer
start of sequence token is 1 and end of sequence is 2. In this case we use the
customTokenInputIds function in my_input_ids function. See
customTokenInputIds.
rank-profile roberta-ranker inherits default {
inputs {
query(q_tokens) tensor<float>(d0[32])
}
onnx-model cross_encoder {
file: models/model.onnx
input input_ids: my_input_ids
input attention_mask: my_attention_mask
gpu-device: 0
}
function my_input_ids() {
expression: customTokenInputIds(1, 2, 256, query(q_tokens), attribute(tokens))
}
function my_attention_mask() {
expression: tokenAttentionMask(256, query(q_tokens), attribute(tokens))
}
first-phase {
expression: #depends on the retriever used
}
# The output of this model is a tensor of size ["batch", 1]
global-phase {
rerank-count: 25
expression: onnx(cross_encoder){d0:0,d1:0}
}
}
Using the cross-encoder model at query time
At query time, we need to tokenize the user query using the embed support.
The embed of the query text, sets the query(q_tokens)
tensor that we defined in the ranking profile.
{"yql":"select title,body from doc where userQuery()","query":"semantic search","input.query(q_tokens)":"embed(tokenizer, \"semantic search\")","ranking":"bert-ranker",}
The retriever (query + first-phase ranking) can be anything, including
nearest neighbor search a.k.a. dense retrieval using bi-encoders.
Performance
There are three major scaling dimensions:
The number of hits that are re-ranked rerank-count Complexity is linear with the number of hits that are re-ranked.
The size of the transformer model used.
The sequence input length. Transformer models scales quadratic with the input sequence length.
For models larger than 30-40M parameters, we recommend using GPU to accelerate inference.
Quantization of model weights can drastically improve serving efficiency on CPU. See
Optimum Quantization
Examples
The MS Marco
sample application demonstrates using cross-encoders.
Using cross-encoders with multi-vector indexing
When using multi-vector indexing
we can do the following to feed the best (closest) paragraph using the
closest() feature into re-ranking with the cross-encoder model.
schema my_document {
document my_document {
field paragraphs type array<string>string {..}
}
field tokens type tensor<float>(p{}, d0[512]) {
indexing: input paragraphs | embed tokenizer | attribute
}
field embedding type tensor<float>(p{}, x[768]) {
indexing: input paragraphs | embed embedder | attribute
}
}
Notice that both tokens use the same mapped embedding dimension name p.
The best_input uses a tensor join between the closest(embedding) tensor and the tokens tensor,
which then returns the tokens of the best-matching (closest) paragraph.
This tensor is used in the other Transformer-related functions
(tokenTypeIds tokenAttentionMask tokenInputIds) as the document tokens.