For an introduction to nearest neighbor search, see nearest neighbor search documentation, for practical usage of Vespa’s nearest neighbor search, see nearest neighbor search - a practical guide, and to have Vespa create vectors for you, see embedding. This document describes how to speed up searches for nearest neighbors by adding a HNSW index to the first-order dense tensor field.
Vespa implements a modified version of the Hierarchical Navigable Small World (HNSW) graph algorithm paper. The implementation in Vespa supports:
Filtering - The search for nearest neighbors can be constrained by query filters as the nearest neighbor search in Vespa is expressed as a query operator. The nearestNeighbor query operator can be combined with other filters or query terms using the Vespa query language. See many query examples in the practical guide.
Real Time Indexing - CRUD (Create, Add, Update, Remove) vectors in the index with low latency and high throughput.
Mutable HNSW Graph - No query or indexing overhead from searching multiple HNSW graphs. In Vespa, there is one graph per tensor field per content node. No segmented or partitioned graph where a query against a content node need to scan multiple HNSW graphs.
Multithreaded Indexing - The costly part when performing real time changes to the HNSW graph is distance calculations while searching the graph layers to find which links to change. These distance calculations are performed by multiple indexing threads.
The query examples in nearest neighbor search uses exact search, which has perfect accuracy. However, this is computationally expensive for large document volumes as distances are calculated for every document which matches the query filters.
To enable fast approximate matching, the first-order dense tensor field definition
needs an index
directive. A Vespa document schema can declare multiple tensor fields with HNSW
enabled.
field image_embedding type tensor<float>(x[512]) { indexing: summary | attribute | index attribute { distance-metric: euclidean } index { hnsw { max-links-per-node: 16 neighbors-to-explore-at-insert: 200 } } } field text_embedding type tensor<float>(x[384]) { indexing: summary | attribute | index attribute { distance-metric: angular } index { hnsw { max-links-per-node: 24 neighbors-to-explore-at-insert: 200 } } }
In the schema snippet above, fast approximate search is enabled by building an HNSW
index for the
image_embedding
and the text_embedding
tensor fields.
The two vector fields use different distance-metric
and HNSW
index settings:
max-links-per-node
- a higher value increases recall accuracy, but also memory usage, indexing and search cost.neighbors-to-explore-at-insert
- a higher value increases recall accuracy, but also indexing cost.Choosing the value of these parameters affects both accuracy, search performance, memory usage and indexing performance. See Billion-scale vector search with Vespa - part two for a detailed description of these tradeoffs. See HNSW index reference for details on the index parameters.
The HNSW
settings impacts indexing throughput. Higher values of max-links-per-node
and neighbors-to-explore-at-insert
reduces indexing throughput. Example from Billion-scale vector search with Vespa - part two.
Higher value of max-links-per-node
impacts memory usage, higher values means higher memory usage:
Higher max-links-per-node
and neighbors-to-explore-at-insert
improves the quality of the graph and
recall accuracy. The lower combination reaches about 70% recall@10, while the higher combination reaches
about 92% recall@10. The improvement in accuracy needs to be weighted against the impact on indexing performance and
memory usage.
With an HNSW index enabled on the tensor field one can choose between approximate or exact (brute-force) search by using the approximate query annotation
{ "yql": "select * from doc where {targetHits: 100, approximate:false}nearestNeighbor(image_embedding,query_image_embedding)", "hits": 10 "input.query(query_image_embedding)": [0.21,0.12,....], "ranking.profile": "image_similarity" }
By default, approximate
is true when searching a tensor field with HNSW
index enabled.
The approximate
parameter allows quantifying the accuracy loss of using approximate search.
The loss can be calculated by performing an exact neighbor search using approximate:false
and
compare the retrieved documents with approximate:true
and calculate the overlap@k metric.
Note that exact searches over a large vector volume require adjustment of the query timeout. The default query timeout is 500ms, which will be too low for an exact search over many vectors.
In addition to targetHits,
there is a hnsw.exploreAdditionalHits parameter
which controls how many extra nodes in the graph (in addition to targetHits
)
that are explored during the graph search. This parameter is used to tune accuracy quality versus query performance.
The nearestNeighbor query operator can be combined with other query filters using the Vespa query language and its query operators. There are two high-level strategies for combining query filters with approximate nearest neighbor search:
These strategies can be configured in a rank profile using post-filter-threshold and approximate-threshold. See Controlling the filtering behavior with approximate nearest neighbor search for more details.
Note that when using pre-filtering
the following query operators are not included when evaluating the filter part of the query:
These are instead evaluated after the approximate nearest neighbors are retrieved, more like a post-filter
.
This might cause the search to expose fewer hits to ranking than the wanted targetHits
.
Since Vespa 8.78.45 the pre-filter
can be evaluated using
multiple threads per query.
This can be used to reduce query latency for larger vector datasets where the cost of evaluating the pre-filter
is significant.
Note that searching the HNSW
index is always single-threaded per query.
Multithreaded evaluation when using post-filtering
has always been supported,
but this is less relevant as the HNSW
index search first reduces the document candidate set based on targetHits
.
targetHits: The targetHits specifies how many hits one wants to expose to ranking per conent node. Nearest neighbor search is typically used as an efficient retriever in a phased ranking pipeline. See performance sizing.
Pagination:
Pagination uses the standard hits
and offset query api parameters.
There is no caching of results in between pagination requests,
so a query for a higher offset
will cause the search to be performed over again.
This aspect is no different from sparse search not using nearest neighbor query operator.
Total hit count is not accurate:
Technically, all vectors in the searchable index are neighbors. There is no strict boundary between a match
and no match. Both exact (approximate:false
) and approximate (approximate:true
) usages
of the nearestNeighbor query operator
does not produce an accurate totalCount
.
This is the same behavior as with sparse dynamic pruning search algorithms like
weakAnd and wand.
Grouping counts are not accurate: Grouping counts from grouping are not accurate when using nearestNeighbor search. This is the same behavior as with other dynamic pruning search algorithms like weakAnd and wand. See the Result diversification blog post on how grouping can be combined with nearest neighbor search.
Vespa tensor fields are in-memory data structures and so is the HNSW
graph data structure.
For large vector datasets the primary memory resource usage relates to the raw vector field memory usage.
Using lower tensor cell type precision can reduce memory footprint significantly, for example using bfloat16
instead of float
saves close to 50% memory usage without significant accuracy loss.
Vespa tensor cell value types include:
int8
- 1 byte per value. Used to represent binary vectors, for example 64 bits can be represented using 8 int8
values.bfloat16
- 2 bytes per value. See bfloat16 floating-point format.float
- 4 bytes per value. Standard float.double
- 8 bytes per value. Standard double.The HNSW
greedy search algorithm is sublinear (close to log(N) where N is the number of vectors in the graph).
This has interesting properties when attempting to add more
nodes horizontally using flat data distribution.
Even if the document volume per node is reduced by a factor of 10, the search latency is only reduced by 50%.
Still, flat scaling helps scale document volume, and increasing indexing throughput as vectors are partitioned
randomly over a set of nodes.
Pure vector search applications (without filtering, or re-ranking) should attempt to scale up document volume by using larger instance type and maximize the number of vectors per node. To scale with query throughput, use grouped data distribution to replicate content.
Note that strongly sublinear search is not necessarily true if the application uses nearest neighbor search for candidate retrieval in a multi-phase ranking pipeline, or combines nearest neighbor search with filters.
Changing the distance-metric
for a tensor field with hnsw
index requires restarting,
but not re-indexing (re-feed vectors). Similar, changing the max-links-per-node
and
neighbors-to-explore-at-insert
construction parameters requires re-starting.