This document describes how to represent text embedding tensors in Vespa and how to build a scalable real time semantic search engine using Vespa's Approximate Nearest Neighbor Search Operator to search in embedding space generated by Google's Multilingual Universal Sentence Encoder.
The introduction of Transformer NLP models like BERT have led to significant advancement in the state of the art for multiple tasks. Examples include question answering, classification and ad-hoc document ranking. Transformer models give the best accuracy for ranking and question answering tasks when used as an interaction model with cross-attention between the question and document. However, running online inference with query and document cross-attention models over large document collections is computationally prohibitively expensive. The problematic computational complexity of inference using Transformer models over large collections has led to increased interest in multi-stage retrieval and ranking architectures. The first stage retrieves candidate documents using a more cost-efficient scoring function and the advanced cross-attention model inference is limited to the top ranking documents from the first stage.
In ReQA: An Evaluation for End-to-End Answer Retrieval Models, Ahmad et al. introduce Retrieval Question Answering (ReQA), a benchmark for evaluating large-scale sentence level answer retrieval models. There, they establish a baseline for both traditional information retrieval (sparse term based) and neural (dense) encoding models on the Stanford Question Answering Dataset (SQuAD) v1.1 dataset.
In this document, we reproduce the work done by Ahmad et al. on the SQuAD 1.1 retrieval task using Vespa serving engine. We replicate the results from the mentioned paper, which enables organizations to deploy state-of-the-art question answering retrieval systems with low effort using the scalable Vespa engine.
Vespa has support for storing and indexing dense tensors field types along with traditional string fields with support for traditional sparse term based text ranking features like bm25 or Vespa's nativeRank. Having both traditional text ranking features and semantic similarity features expressed in the same engine is a powerful feature of Vespa, which enables hybrid retrieval using both sparse and dense representation.
The work described in this document can be reproduced using the semantic-qa-retrieval sample application.
The Universal Sentence Encoder encodes text into fixed length dense embedding space that can be used for broad range of tasks such as semantic similarity, semantic retrieval and other natural language processing (NLP) tasks. Google has released more sentence encoder models with different goals, and following the work of Ahmad et al. we use the Multilingual Universal Sentence Encoder for Question-Answer Retrieval.
The Universal Sentence Encoder for Question-Answer Retrieval enables us to process questions and candidate answer sentences independently and map the high dimensional sparse text representation to a relatively low dimensional dense tensor representation, where we can use Vespa's approximate nearest neighbor search operator to retrieve documents efficiently.
We can store and index the dense tensor embedding in Vespa using tensor fields and use Vespa's approximate nearest neighbor search operator to retrieve documents.
Papers and resources on Google's Universal Sentence Encoder:
A similar dual encoder architecture (question, document) is described in the Dense Passage Retrieval for Open-Domain Question Answering by Facebook Research, where they demonstrate how a trained dense representation using a dual question encoder based on BERT outperforms traditional IR retrieval - quote: When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.
The SQuAD: 100,000+ Questions for Machine Comprehension of Text paper introduced the SQuAD dataset which is available for download at SQuAD-explorer.
The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. In our experiments we use the train v1.1 dataset.
Sample questions and answers for a given paragraph context taken from a snapshot of the University_of_Notre_Dame Wikipedia page:
The answer_ start represents the offset where the answer for the question can be found. The SQuAD v1.1 train dataset consists of 87,599 questions and 18,896 paragraphs. The paragraphs can further be segmented into 91,729 sentences using a sentence tokenizer.
We model the SQuaD data set in Vespa in two different document schema types; a context document type and a sentence document type:
Context document type:
schema context { document context { field context_id type int { indexing: summary | attribute } field text type string { indexing: summary | index index: enable-bm25 } } }
Sentence document type:
schema sentence { document sentence inherits context { field sentence_embedding type tensor<float>(x[512]) { indexing: attribute attribute { distance-metric: euclidean } index { hnsw { max-links-per-node: 16 neighbors-to-explore-at-insert: 500 } } } } }
See Approximate Nearest Neighbor Search using HNSW Index for details on the HNSW index settings and distance-metric. In this case, we use the euclidean distance metric.
In order to feed the SQuAD data, we need to convert it into our Vespa document schema and feed documents using the Vespa json format.
For each paragraph context we run a simple sentence tokenizer published by Ahmed et al. to extract sentences from the paragraph context. We simply assign a unique sentence id sentences and likewise for context. Sentence sample extracted from the above example paragraph:
We can feed the generated document set to our Vespa instance using any of the feed APIs, here using vespa feed. After this step, we have one content db with 18,896 context documents and 91,729 sentences in another in the same Vespa content cluster.
The goal of the ReQA task is to retrieve sentences which have the answer for any given question. We can also compute context or paragraph level retrieval using the sentence level semantic similarity by aggregating over the sentence level scores. We can do this efficiently using the Vespa grouping API if we want to retrieve paragraphs instead of sentences.
We use the Vespa Query API to express our query request logic. We use the YQL query language to express our query retrieval logic.
Using the sample question from the example, the POST HTTP query request is:
The rank profile is defined in the sentence document schema:
rank-profile sentence-semantic-similarity inherits default { first-phase { expression: closeness(sentence_embedding) } }
Where closeness is a Vespa ranking feature which is defined as 1/(1 + distance).
For paragraph level retrieval we use Vespa's grouping feature to retrieve paragraphs instead of sentences. As in the paper, we use the max sentence score in the paragraph to represent the paragraph level score. The query above is changed to add a grouping specification:
The grouping expression groups sentences by the context id, and order the groups (paragraphs) by the maximum rank score. For each unique context id, we get the top ranking sentences ordered by their rank score, assigned by the chosen rank profile.
We can also retrieve using a hybrid combination consisting of dense retrieval and regular query term matching:
We use logical disjunction to combine the nearest neighbor query operator retrieving in dense embedding space with the regular term based (sparse) retrieval. We use a simple linear combination of the bm25 score on text and the previously described closeness ranking feature:
rank-profile bm25-sentence-semantic-similarity inherits default { first-phase { expression: bm25(text) + closeness(sentence_embedding) } }