• [+] expand all

Approximate Nearest Neighbor Search using HNSW Index

For an introduction to nearest neighbor search, see nearest neighbor search documentation, for practical usage of Vespa's nearest neighbor search, see nearest neighbor search - a practical guide, and to have Vespa create vectors for you, see embedding. This document describes how to speed up searches for nearest neighbors by adding a HNSW index to the tensor field.

Vespa implements a modified version of the Hierarchical Navigable Small World (HNSW) graph algorithm paper. The implementation in Vespa supports:

  • Filtering - The search for nearest neighbors can be constrained by query filters as the nearest neighbor search in Vespa is expressed as a query operator. The nearestNeighbor query operator can be combined with other filters or query terms using the Vespa query language. See many query examples in the practical guide.

  • Multi-vector Indexing - Since Vespa 8.144 multiple vectors per document can be indexed. In this case documents are retrieved by the closest vector in each document compared to the query vector. See the Multi-vector indexing sample application for examples. For use cases and implementation details see the following blog post: Revolutionizing semantic search with multi-vector HNSW indexing in Vespa

  • Real Time Indexing - CRUD (Create, Add, Update, Remove) vectors in the index with low latency and high throughput.

  • Mutable HNSW Graph - No query or indexing overhead from searching multiple HNSW graphs. In Vespa, there is one graph per tensor field per content node. No segmented or partitioned graph where a query against a content node need to scan multiple HNSW graphs.

  • Multithreaded Indexing - The costly part when performing real time changes to the HNSW graph is distance calculations while searching the graph layers to find which links to change. These distance calculations are performed by multiple indexing threads.

The query examples in nearest neighbor search uses exact search, which has perfect accuracy. However, this is computationally expensive for large document volumes as distances are calculated for every document which matches the query filters.

To enable fast approximate matching, the tensor field definition needs an index directive. A Vespa document schema can declare multiple tensor fields with HNSW enabled.

field image_embeddings type tensor<float>(i{},x[512]) {
  indexing: summary | attribute | index
  attribute {
    distance-metric: angular
  }
  index {
    hnsw {
      max-links-per-node: 16
      neighbors-to-explore-at-insert: 100
    }
  }
}

field text_embedding type tensor<float>(x[384]) {
  indexing: summary | attribute | index
  attribute {
    distance-metric: prenormalized-angular
  }
  index {
    hnsw {
      max-links-per-node: 24
      neighbors-to-explore-at-insert: 200
    }
  }
}

In the schema snippet above, fast approximate search is enabled by building an HNSW index for the image_embeddings and the text_embedding tensor fields. image_embeddings indexes multiple vectors per document, while text_embedding indexes one vector per document.

The two vector fields use different distance-metric and HNSW index settings:

  • max-links-per-node - a higher value increases recall accuracy, but also memory usage, indexing and search cost.
  • neighbors-to-explore-at-insert - a higher value increases recall accuracy, but also indexing cost.

Choosing the value of these parameters affects both accuracy, search performance, memory usage and indexing performance. See Billion-scale vector search with Vespa - part two for a detailed description of these tradeoffs. See HNSW index reference for details on the index parameters.

Indexing throughput

Real-time indexing throughput

The HNSW settings impacts indexing throughput. Higher values of max-links-per-node and neighbors-to-explore-at-insert reduces indexing throughput. Example from Billion-scale vector search with Vespa - part two.

Memory usage

Higher value of max-links-per-node impacts memory usage, higher values means higher memory usage:

Memory footprint

Accuracy

Accuracy

Higher max-links-per-node and neighbors-to-explore-at-insert improves the quality of the graph and recall accuracy. As the search-time parameter hnsw.exploreAdditionalHits is increased, the lower combination reaches about 70% recall@10, while the higher combination reaches about 92% recall@10. The improvement in accuracy needs to be weighted against the impact on indexing performance and memory usage.

With an HNSW index enabled on the tensor field one can choose between approximate or exact (brute-force) search by using the approximate query annotation

{
  "yql": "select * from doc where {targetHits: 100, approximate:false}nearestNeighbor(image_embeddings,query_image_embedding)",
  "hits": 10
  "input.query(query_image_embedding)": [0.21,0.12,....],
  "ranking.profile": "image_similarity" 
}

By default, approximate is true when searching a tensor field with HNSW index enabled. The approximate parameter allows quantifying the accuracy loss of using approximate search. The loss can be calculated by performing an exact neighbor search using approximate:false and compare the retrieved documents with approximate:true and calculate the overlap@k metric.

Note that exact searches over a large vector volume require adjustment of the query timeout. The default query timeout is 500ms, which will be too low for an exact search over many vectors.

In addition to targetHits, there is a hnsw.exploreAdditionalHits parameter which controls how many extra nodes in the graph (in addition to targetHits) that are explored during the graph search. This parameter is used to tune accuracy quality versus query performance.

Combining approximate nearest neighbor search with filters

The nearestNeighbor query operator can be combined with other query filters using the Vespa query language and its query operators. There are two high-level strategies for combining query filters with approximate nearest neighbor search:

These strategies can be configured in a rank profile using approximate-threshold and post-filter-threshold. See Controlling the filtering behavior with approximate nearest neighbor search for more details.

Note that when using pre-filtering the following query operators are not included when evaluating the filter part of the query:

These are instead evaluated after the approximate nearest neighbors are retrieved, more like a post-filter. This might cause the search to expose fewer hits to ranking than the wanted targetHits.

Since Vespa 8.78 the pre-filter can be evaluated using multiple threads per query. This can be used to reduce query latency for larger vector datasets where the cost of evaluating the pre-filter is significant. Note that searching the HNSW index is always single-threaded per query. Multithreaded evaluation when using post-filtering has always been supported, but this is less relevant as the HNSW index search first reduces the document candidate set based on targetHits.

Nearest Neighbor Search Considerations

  • targetHits: The targetHits specifies how many hits one wants to expose to ranking per content node. Nearest neighbor search is typically used as an efficient retriever in a phased ranking pipeline. See performance sizing.

  • Pagination: Pagination uses the standard hits and offset query api parameters. There is no caching of results in between pagination requests, so a query for a higher offset will cause the search to be performed over again. This aspect is no different from sparse search not using nearest neighbor query operator.

  • Total hit count is not accurate: Technically, all vectors in the searchable index are neighbors. There is no strict boundary between a match and no match. Both exact (approximate:false) and approximate (approximate:true) usages of the nearestNeighbor query operator does not produce an accurate totalCount. This is the same behavior as with sparse dynamic pruning search algorithms like weakAnd and wand.

  • Grouping counts are not accurate: Grouping counts from grouping are not accurate when using nearestNeighbor search. This is the same behavior as with other dynamic pruning search algorithms like weakAnd and wand. See the Result diversification blog post on how grouping can be combined with nearest neighbor search.

Memory

Vespa tensor fields are in-memory data structures and so is the HNSW graph data structure. For large vector datasets the primary memory resource usage relates to the raw vector field memory usage.

Using lower tensor cell type precision can reduce memory footprint significantly, for example using bfloat16 instead of float saves close to 50% memory usage without significant accuracy loss.

Vespa tensor cell value types include:

  • int8 - 1 byte per value. Used to represent binary vectors, for example 64 bits can be represented using 8 int8 values.
  • bfloat16 - 2 bytes per value. See bfloat16 floating-point format.
  • float - 4 bytes per value. Standard float.
  • double - 8 bytes per value. Standard double.

Search latency and document volume

The HNSW greedy search algorithm is sublinear (close to log(N) where N is the number of vectors in the graph). This has interesting properties when attempting to add more nodes horizontally using flat data distribution. Even if the document volume per node is reduced by a factor of 10, the search latency is only reduced by 50%. Still, flat scaling helps scale document volume, and increasing indexing throughput as vectors are partitioned randomly over a set of nodes.

Pure vector search applications (without filtering, or re-ranking) should attempt to scale up document volume by using larger instance type and maximize the number of vectors per node. To scale with query throughput, use grouped data distribution to replicate content.

Note that strongly sublinear search is not necessarily true if the application uses nearest neighbor search for candidate retrieval in a multi-phase ranking pipeline, or combines nearest neighbor search with filters.

HNSW Operations

Changing the distance-metric for a tensor field with hnsw index requires restarting, but not re-indexing (re-feed vectors). Similar, changing the max-links-per-node and neighbors-to-explore-at-insert construction parameters requires re-starting.