Semantic Retrieval for Question Answering Applications

This document describes how to represent text embedding tensors in Vespa and how to build a scalable real time semantic search engine using Vespa's Approximate Nearest Neighbor Search Operator to search in embedding space generated by Google's Multilingual Universal Sentence Encoder.

The introduction of Transformer NLP models like BERT have led to significant advancement in the state of art for many different tasks. Examples include question answering, classification and ad-hoc document ranking. Transformer models give best accuracy for ranking and question answering tasks when used as an interaction model with cross-attention between the question and document. However, running online inference with query and document cross-attention models over large document collections is computationally prohibitively expensive. The problematic computational complexity of inference using Transformer models over large collections has led to increased interest in multi-stage retrieval and ranking architectures. The first stage retrieves candidate documents using a more cost efficient scoring function and the advanced cross-attention model inference is limited to the top ranking documents from the first stage.

In ReQA: An Evaluation for End-to-End Answer Retrieval Models Ahmad Et al. introduce Retrieval Question Answering (ReQA), a benchmark for evaluating large-scale sentence level answer retrieval models and where they establish a baseline for both traditional information retrieval (sparse term based) and neural (dense) encoding models on the Stanford Question Answering Dataset (SQuAD) v1.1 dataset. In this document we reproduce the work done by Ahmad et al. on the SQuAD 1.1 retrieval task using Vespa serving engine. We replicate the results from the mentioned paper which enables organization to deploy state of the art question answering retrieval systems with low effort using the scalable Vespa engine.

Vespa has support for storing and indexing dense tensors field types along with traditional string fields with support for traditional sparse term based text ranking features like bm25 or Vespa's nativeRank. Having both traditional text ranking features and semantic similarity features expressed in the same engine is a powerful feature of Vespa, which enables hybrid retrieval using both sparse and dense representation.

The work described in this document can be reproduced using the semantic-qa-retrieval sample application.

About Google's Universal Sentence Encoder

The Universal Sentence Encoder encodes text into fixed length dense embedding space that can be used for broad range of tasks such as semantic similarity, semantic retrieval and other natural language processing (NLP) tasks. Google has released several different sentence encoder models with different goals and following the work of Ahmad et al. we use the Multilingual Universal Sentence Encoder for Question-Answer Retrieval. The Universal Sentence Encoder for Question-Answer Retrieval enables us to process questions and candidate answer sentences independently and map the high dimensional sparse text representation to a relatively low dimensional dense tensor representation, where we can use Vespa's approximate nearest neighbor search operator to retrieve documents efficiently.

  • Question text is encoded using the question encoder, which takes the question text as input and outputs a 512 dimensional dense tensor.
  • Each sentence of text is encoded using the response encoder which takes the sentence and the surrounding context (e.g paragraph level) as input and outputs a 512 dimension dense tensor
We can store and index the dense tensor embedding in Vespa using tensor fields and use Vespa's approximate nearest neighbor search operator to retrieve documents.

Image Courtesy https://tfhub.dev/google/universal-sentence-encoder/2
Image Courtesy https://tfhub.dev/google/universal-sentence-encoder/2

Papers and resources on Google's Universal Sentence Encoder:

A similar dual encoder architecture (question, document) is described in the Dense Passage Retrieval for Open-Domain Question Answering by Facebook Research, where they demonstrate how a trained dense representation using a dual question encoder based on BERT outperforms traditional IR retrieval. Quote: When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.

About the SQuAD dataset

The SQuAD: 100,000+ Questions for Machine Comprehension of Text paper introduced the SQuAD dataset which is available for download here. The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. In our experiments we use the train v1.1 dataset.

Sample questions and answers for a given paragraph context taken from a snapshot of the University_of_Notre_Dame Wikipedia page is shown below:

{
 "data": [
  {
    "title": "University_of_Notre_Dame"
    "paragraphs": [
      {
        "context": "Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. 
         Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". 
         Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. 
         It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. 
         At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",
      "qas": [
        {
          "question": "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?",
          "answers": [
            {
             "answer_start": 515,
             "text": "Saint Bernadette Soubirous"
            }
           ],
           "id": "5733be284776f41900661182"
        },
        {
          "question": "What is in front of the Notre Dame Main Building?",
          "answers": [
            {
              "answer_start": 188,
              "text": "a copper statue of Christ"
            }
          ],
          "id": "5733be284776f4190066117f"
        }
      ]
      }
    ]
   }
 ]
}

The answer_ start represent the offset where the answer for the question can be found. The SQuAD v1.1 train dataset consists of 87,599 questions and 18,896 paragraphs. The paragraphs can further be segmented into 91,729 sentences using a sentence tokenizer.

SQuAD Data modelling with Vespa

We model the SQuaD data set in Vespa in two different document schema types; a context document type and a sentence document type:

Context document type:
schema context {
  document context {
    field context_id type int {
      indexing: summary | attribute 
    }
    field text type string {
      indexing: summary | index
      index: enable-bm25
    }
  }
}
Sentence document type:
schema sentence {
  document sentence inherits context {
    field sentence_embedding type tensor<float>(x[512]) {
      indexing: attribute
       attribute {
        distance-metric: euclidean 
      }
      index {
        hnsw {
          max-links-per-node: 16 
          neighbors-to-explore-at-insert: 500
        }
      }
    }
  }
}

See Approximate Nearest Neighbor Search using HNSW Index for details on the HNSW index settings and distance-metric. In this case, we use the euclidean distance metric.

Converting the SQuAD json to Vespa json feed format

In order to feed the SQuAD data we need to convert it into our Vespa document schema and feed documents using the Vespa json format.

For each paragraph context we run a simple sentence tokenizer published by Ahmed et al to extract sentences from the paragraph context. We simply assign a unique sentence id sentences and likewise for context. Below is a sample of one sentence extracted from the above example paragraph:

{
  "put": "id:squad:sentence::5"
    "fields": {
      "context_id": 0,
        "text": "Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\"."
        "sentence_embedding": {
          "values": [
            -0.0528511106967926,
            0.00927420798689127,
            ......
            0.011870068497955799,
            -0.06848619878292084
          ]
        }
    }
}

We can feed the generated document set to our Vespa instance using any of the feed api's, but we use the Vespa http client. After this step we have one content db with 18,896 context documents and 91,729 sentences in another in the same Vespa content cluster.

Sentence and Paragraph Retrieval

The goal of the ReQA task is to retrieve sentences which have the answer for any given question. We can also compute context or paragraph level retrieval using the sentence level semantic similarity by aggregating over the sentence level scores, we can do this efficiently using the Vespa grouping api if we want to retrieve paragraphs instead of sentences.

We use the Vespa Search API to express our search request logic. We use the YQL query language to express our query retrieval logic. Using the sample question from the example the POST http search request becomes:

{
  'yql': 'select * from sources sentence where ([{"targetNumHits":100}]nearestNeighbor(sentence_embedding,query_embedding));',
  'hits': 100,
  'ranking.features.query(query_embedding)': [-0.0466, ...,0.064],
  'ranking.profile': 'sentence-semantic-similarity' 
}
  • We use the nearestNeighbor search operator to retrieve the closest 100 sentences in embedding space using euclidean distance as configured with the tensor HNSW settings
  • The top 100 top nearest sentences are ranked by the ranking profile passed in the ranking.profile.
  • The dense tensor representation encoded by the sentence encoder is passed by the ranking.features.query(query_embedding) parameter
The ranking profile is defined in the sentence document schema:
rank-profile sentence-semantic-similarity inherits default {
  first-phase {
    expression: closeness(sentence_embedding) 
  }
}

Where closeness is a Vespa ranking feature which is defined as 1/(1 + distance). For paragraph level retrieval we use Vespa's grouping feature to retrieve paragraphs instead of sentences. As in the paper we use the max sentence score in the paragraph to represent the paragraph level score. The query above is changed to add a grouping specification.

{
  'yql': 'select * from sources sentence where ([{"targetNumHits":100}]nearestNeighbor(sentence_embedding,query_embedding))| \\
    all(group(context_id) max(100) order(-max(relevance())) each( max(2) each(output(summary())) as(sentences)) as(paragraphs));',
  'hits': 0,
  'ranking.features.query(query_embedding)': [-0.0466, ...,0.064],
  'ranking.profile': 'sentence-semantic-similarity' 
}

The grouping expression groups sentences by the context id and order the groups ( paragraphs ) by the maximum rank score. For each unique context id, we get the top ranking sentences sentences ordered by their rank score assigned by the chosen ranking profile.

Hybrid retrieval using both dense (encoding) and sparse (term) representation

We can also retrieve using a hybrid combination consisting of dense retrieval and regular query term matching:

{
  'yql': 'select * from sources sentence  where ([{"targetNumHits":100}]nearestNeighbor(sentence_embedding,query_embedding)) or userQuery();',
  'query': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
  'type': 'any',
  'hits': 100,
  'ranking.features.query(query_embedding)': [-0.0466, ...,0.064],
  'ranking.profile': 'bm25-sentence-semantic-similarity' 
}

We use logical disjunction to combine the nearest neighbor query operator retrieving in dense embedding space with the regular term based (sparse) retrieval. We use a simple linear combination of the bm25 score on text and the previously described closeness ranking feature:

rank-profile bm25-sentence-semantic-similarity inherits default {
  first-phase {
    expression: bm25(text) + closeness(sentence_embedding) 
  }
}