Nearest Neighbor Search

Searching for the nearest neighbors of a data point in high dimensional vector space is an important problem for many real time applications. For example, in Computer Vision, searching for close data points in high dimensional vector space enables finding the most similar cats in large image datasets. In Information Retrieval, large pre-trained natural language Transformer models like BERT, allows representing text sentences in dense embedding space, where nearest neighbor search could serve as an effective multilingual neural retrieval function.

In many of these real world applications of nearest neighbor search, the search is constrained by real time query filters applied over the data points metadata. For example, in E-Commerce search applications with constantly evolving metadata, a search for nearest products for a query in vector space would typically be constrained by inventory status and price.

Expressing the search for nearest neighbors as a query operator allows combining the search for nearest neighbors with traditional query filters using boolean query operators.

The following sections demonstrates how to use the nearestNeighbor query operator in Vespa. The examples uses the brute force variant which has perfect accuracy but which is computationally expensive for large document volumes as the distance needs to be calculated for every document which matches the boolean query filters in the query. See how to speed up evaluation of nearest neighbor search by adding a HNSW index in the Approximate Nearest Neighbor Search.

We demonstrate how to use the nearestNeighbor query operator using a E-Commerce search use case:

  • Related products by product image similarity using SIFT image encodings
  • Related products using text (title) embeddings similarity from a Transformer model (BERT)
  • Related products using a hybrid combination of the above
  • Hybrid Search using traditional IR term based matching in combination with neural vector space retrieval
We define the following product document schema:

schema product {

  document product {
   
    field title type string {
      indexing: summary | index
    }

    field in_stock type bool {
      indexing: summary | attribute 
    }

    field price type float {
      indexing: summary | attribute 
    }

    field popularity type float {
      indexing: summary | attribute 
    }

    field image_sift_encoding type tensor<float>(x[128]) {
      indexing: summary | attribute
      attribute {
        distance-metric: euclidean
      }
    }

    field title_bert_embedding type tensor<float>(x[768]) {
      indexing: summary | attribute
      attribute {
        distance-metric: angular
      }
    }
  }
}

The product document type is made up of six fields. Fields like title, in_stock, price and popularity are self-explanatory but the tensor type fields needs more explanation.

The image_sift_encoding field is used to store the scale-invariant feature transform (SIFT) encoding of a product catalog image. The tensor definition specifies that it's indexed (dense) and has exactly 128 dimensions and uses float value precision. When searching for nearest neighbors in the SIFT vector space we use euclidean distance as our distance metric.

The title_bert_embedding field is used to store the sentence-BERT embedding which has 768 dimensions and we use angular as our distance metric. There are different approaches for using embeddings from large transformer natural language models like BERT for search, but that is out of scope for this document. Please see the referenced paper for details on how to use BERT as a representation model where query and document text can be encoded independently.

Vespa supports 4 distance metrics to use when searching for nearest neighbors:

  • euclidean
  • angular
  • innerproduct
  • geodegrees
See distance-metric reference documentation for details.

We also need to define the types of the query tensors in the application package, see query feature types.

Lastly, we need to configure how to rank our products. We start easy and define a rank profile which rank products by how similar the given query image is in the SIFT vector space:

  rank-profile image-similarity inherits default {
    first-phase {
      expression: closeness(field, image_sift_encoding) 
    }
  }

We configured the tensor distance-metric as euclidean, and the closeness(field, image_sift_encoding) Vespa rank feature which returns a value in the range [0, 1], where 0 is infinite distance, and 1 is zero distance.

This is convenient because Vespa sorts documents by decreasing relevancy score and we want the closest documents to be ranked highest. This relevance score is decided by an expression configured in the ranking profile, see the ranking documentation.

The first-phase is part of Vespa's phased ranking support. Phased ranking or multi-stage document ranking enables re-ranking of the top k subset of the hits which are retrieved from the previous stage. Highly efficient query operators like nearestNeighbor or wand are effectively zero-phase retrieval as they are both top-k heap based and throws away hits which are not making it into the top-K set. This pruning of low ranking hits in the zero-phase retrieval reduces the number of hits which gets fully ranked using the first-phase function and hence reduces latency and cost.

Indexing product data

Once we have deployed the document schema, one can index the data using the json feed format. In the example below we feed two documents using the indexed tensors short form. Documents feed becomes visible in search immediately (real time indexing).

[
  {
    "put": "id:shopping:product::998211",
    "fields": {
      "title": "Linen Blend Wide Leg Trousers",
      "in_stock": true,
      "price": 29.95,
      "popularity": 0.342,
      "image_sift_encoding": {
        "values": [
          0.9147281579191466,
          0.5453696694173961,
          0.7529545687063771,
          ..
         ]
      },
       "title_bert_embedding": {
        "values": [
          0.16766378547490635,
          0.3737005826272204,
          0.040492891373747675,
          ..
         ]
        }
      }
  },
  {
    "put": "id:shopping:product::97711",
    "fields": {
      "title": "Loose-fit White Pens",
      "in_stock": false,
      "price": 99.99,
      "popularity": 0.538,
      "image_sift_encoding": {
        "values": [
          0.9785931815169806,
          0.5697209315543527,
          0.5352198004501647,
          ..
        ]
      },
      "title_bert_embedding": {
        "values": [
          0.03515862084651311,
          0.24585168798559187,
          0.6123057708571111,
          ..
        ]
      }
    }
  }
]

The above json data can be fed to Vespa using any of the Vespa feeding api's. Vespa only supports real time indexing so there is no explicit batch support and documents become searchable as they are fed to the system with low visibility delay. Tensor fields can be partially updated as regular fields (without having to re-index the entire document).

Querying product data

Finally, we are able to start querying using the nearestNeighbor query operator. This operator expects two arguments; the document tensor field we want to search and the query tensor we want to find neighbors for. How many neighbors we want is controlled by the targetHits annotation which is a required parameter. The query tensor is sent as a query ranking feature and the query tensor name is referenced in second argument of the nearestNeighbor operator.

Related Products using Image Similarity

In this example we use the nearestNeighbor query operator to recommend similar products based on image similarity. For a given image (e.g an product image shown in the product search result page) we can find products which have a similar product image. We can do this by fetching the SIFT encoding of the image we want to get similar images for and use this tensor as input to the nearestNeighbor query operator and search over the collection of SIFT encodings.

We are also not interested in the corpus wide global nearest neighbors but only products where in_stock is true. We express the query logic using the YQL query language and our query api request becomes:

{
  "yql": "select title, price from sources products where ([{\"targetHits\": 10}]nearestNeighbor(image_sift_encoding,query_vector_sift)) and in_stock = true;",
  "ranking.features.query(query_vector_sift)": [
    0.22507139604882176,
    0.11696498718517367,
    0.9418422036734729,
    ...
  ],
  "ranking.profile": "image-similarity", 
  "hits": 10 
}

The YQL express our query logic using logical conjunction and so that our nearest neighbor search is restricted to documents where in_stock is true. The targetHits is the desired number of close neighbors we want. The total number of hits which is ranked by the ranking profile depends on the query filters and how fast the nearest neighbor search algorithm converges. The operator might expose more than 10 hits to the first-phase ranking expression. Since we are filtering on in_stock one might also get 0 results if the entire product catalog is out of stock. The ranking.profile parameter chooses the rank-profile we described earlier which simply ranks documents based on how similar they are in the SIFT vector encoding space. Finally, the hits parameter controls how many hits are returned in the response. If the documents are partitioned over multiple content nodes the targetHits is per node, so with e.g 10 nodes we get at least 100 nearest neighbors fully evaluated using the first-phase ranking function. In our example the first-phase expression is simply the closeness score as calculated by the nearestNeighbor operator and hence the hits are not re-ordered but assigned the same score as calculated in the zero-phase by the nearestNeighbor query operator. We could re-order hits by changing the first-phase ranking function, and the following would re-rank the nearest neighbors which are in stock using the price:

rank-profile image-similarity-price {
  first-phase {
    expression: if(isNan(attribute(price)) == 1,0.0, attribute(price)) 
  }
}

In this example the ranking profile simply ignores the closeness score assigned by the nearestNeighbor operator and instead re-rank the most similar images by their price, ranked in decreasing order.

A more advanced example is given below where we retrieve the closest neighbors in SIFT-encoding vector space (with in stock filter) and ranks those neighbors using a combination of the euclidean distance rank and the product popularity (e.g. calculated offline based on past user behavior with the product) and finally re-ranks the top 100 hits per node using a more advanced ONNX imported Machine Learned (ML) model. These are examples of multi-stage or phased ranking.

rank-profile image-similarity-with-onnx-stage {
  first-phase {
    expression: closeness(field, image_sift_encoding)*attribute(popularity) 
  }
  second-phase {
    rerank-count: 100
    expression: onnx("image-similarity-model.onnx")
  }
 }

With the above multi-stage ranking profiles there are three stages:

  • The zero-phase nearest neighbor search which only rank documents by the closeness feature using euclidean distance
  • The first-phase which ranks the top targetHits from zero-phase
  • The second-phase which re-ranks the top 100 hits from first-phase ranking stage

Related Products using BERT Text Embedding Similarity

As an alternative to using product image similarity to find related items we look at using the BERT embedding from the product title instead. Similar to the image similarity case we assume that we have the BERT embedding tensor representation for the product we want to find related or similar items for, and can use this to search for similar products using the angular distance metric. We define our ranking function:

rank-profile title-similarity-using-bert {
  first-phase {
    expression: closeness(field, title_bert_embedding)
  }
}

Searching for 10 nearest neighbors using angular distance in the BERT embedding vector space and we are only interested in products where in_stock is true:

{
  "yql": "select title,price from sources products where ([{\"targetNumHits\": 10}]nearestNeighbor(title_bert_embedding,query_vector_bert)) and in_stock = true;",
  "ranking.features.query(query_vector_bert)": [
    0.7528874147920314,
    0.712266471788029,
    0.8813024904379954,
    ...
  ],
  "ranking.profile": "title-similarity-using-bert",
  "hits": 10
}

Related Products using hybrid BERT Text Embedding Similarity

We can also combine and retrieve products in a single query request using disjunction (OR) logical operator to retrieve products which are close in either vector space and we define our ranking function as a weighted linear combination of the closeness scores produced by two nearestNeighbor query operators:

rank-profile hybrid-similarity {
  first-phase {
    expression: query(bert_weight)*closeness(field, title_bert_embedding) + (1 - query(bert_weight))* closeness(field, image_sift_encoding)
  }
}

The search request need to pass both the BERT embedding tensor and the SIFT encoding tensor and we use logical disjunction (OR) to combine the two nearestNeighbor query operators. Our search request becomes:

{
  "yql": "select title, price from sources products where (([{\"targetNumHits\": 10}]nearestNeighbor(title_bert_embedding,query_vector_bert)) 
or ([{\"targetNumHits\": 10}]nearestNeighbor(image_sift_encoding,query_vector_sift))) and in_stock = true;",
  "hits": 10,
  "ranking.features.query(query_vector_sift)": [
    0.5158057404746615,
    0.6441591144976925,
    0.24075880767944935,
    ...
  ],
  "ranking.features.query(query_vector_bert)": [
    0.708719978302217,
    0.4800850502044147,
    0.03310770782193717,
    ... 
  ],
  "ranking.features.query(bert_weight)": 0.1,
  "ranking.profile": "hybrid-similarity"
}

We ask for 10 nearest neighbors in the title BERT embedding vector space or 10 nearest neighbors in the SIFT encoding space which in total will expose at least 20 documents to the first-phase ranking function.

Hybrid Neural Term Retrieval

In the following example we have a user input query (for example 'white trousers for woman') and we search for both products which matches the query terms and for the nearest neighbors using the neural BERT embedding space. In this case we need to produce the BERT embedding of the input text query and we use logical disjunction. This allows also retrieving documents which does not use the exact same vocabulary as in the user query. Our search request becomes:

{
  "yql": "select title, price from sources products where in_stock = true and (userQuery() or ([{\"targetNumHits\": 10}]nearestNeighbor(title_bert_embedding,query_vector_bert)));",
  "hits": 10,
  "ranking.features.query(query_vector_bert)": [
    0.10209943251234721,
    0.6693870055804121,
    0.5088424671764654,
    ...
  ],
  "ranking.profile": "hybrid-term-neural-retrieval",
  "query": "white pants for women",
  "type": "any"
}

The query will retrieve products based on the title BERT embedding similarity and term based matching. We use the following ranking profile:

  rank-profile hybrid-term-neural-retrieval {
    first-phase {
      expression: 0.5*fieldMatch(title) + 0.5*closeness(field, title_bert_embedding)
    }
  }

The retrieved hits are ranked using a linear combination of the closeness and the fieldMatch rank features. fieldMatch is a native built in Vespa text matching feature which also takes term proximity into account when computing the text ranking score.

Using Nearest Neighbor from a Searcher Component

As with all query operators in Vespa, one can also build the query tree programmatically in a custom Searcher component.

In the example below we execute the incoming query and fetch the tensor of the top ranking hit ranked by whatever rank-profile was specified in the search request. We read the image_sift_encoding tensor from the top ranking hit and use it as the input for the nearestNeighbor query operator to find related products (using the rank profile described in earlier sections).

package ai.vespa.example.searcher;

import com.yahoo.prelude.query.NearestNeighborItem;
import com.yahoo.search.Query;
import com.yahoo.search.Result;
import com.yahoo.search.Searcher;
import com.yahoo.search.result.Hit;
import com.yahoo.search.searchchain.Execution;
import com.yahoo.tensor.Tensor;

public class FeedbackSearcher extends Searcher {
    
    @Override
    public Result search(Query query, Execution execution) {
        Result result = execution.search(query);
        if (result.getTotalHitCount() == 0)
            return result;
        execution.fill(result); // fetch the summary data of the hits if needed
        Hit bestHit = result.hits().get(0);
        Tensor sift_encoding = (Tensor) bestHit.getField("image_sift_encoding");
        Query relatedQuery = new Query();
        relatedQuery.getRanking().setProfile("image-similarity");
        relatedQuery.getRanking().getFeatures().put("query(query_vector_sift)", sift_encoding);
        NearestNeighborItem nn = new NearestNeighborItem("image_sift_encoding",  "query_vector_sift");
        nn.setTargetNumHits(10);
        relatedQuery.getModel().getQueryTree().setRoot(nn);
        return execution.search(relatedQuery);
    }
}