Billion Scale Image Search

This sample application combines two sample applications to implement cost-efficient large scale image search over multimodal AI powered vector representations; text-image-search and billion-scale-vector-search.

The Vector Dataset

This sample app use the LAION-5B dataset, the biggest open accessible image-text dataset in the world.

Large image-text models like ALIGN, BASIC, Turing Bletchly, FLORENCE & GLIDE have shown better and better performance compared to previous flagship models like CLIP and DALL-E. Most of them had been trained on billions of image-text pairs and unfortunately, no datasets of this size had been openly available until now. To address this problem we present LAION 5B, a large-scale dataset for research purposes consisting of 5,85B CLIP-filtered image-text pairs. 2,3B contain English language, 2,2B samples from 100+ other languages and 1B samples have texts that do not allow a certain language assignment (e.g. names ).

The LAION-5B dataset was used to train the popular text-to-image generative StableDiffusion model.

Note the following about the LAION 5B dataset:

Be aware that this large-scale dataset is un-curated. Keep in mind that the un-curated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer.

The released dataset does not contain image data itself, but CLIP encoded vector representations of the images, and metadata like url and caption.

Use cases

The app can be used to implement several use cases over the LAION dataset, or adopted to your large-scale vector dataset:

Search with a free text prompt over the caption or url fields in the LAION dataset using Vespa's standard text-matching functionality.
CLIP retrieval, using vector search, given a text prompt, search the image vector representations (CLIP ViT-L/14), for example for 'french cat'.
Given an image vector representation, search for similar images in the dataset. This can for example be used to take the output image of StableDiffusion to find similar images in the training dataset.

All this combined using Vespa's query language, and also in combination with filters.

Vespa Primitives Demonstrated

The sample application demonstrates many Vespa primitives:

Importing an ONNX-exported version of CLIP ViT-L/14 for accelerated inference in Vespa stateless containers. The exported CLIP model encodes a free-text prompt to a joint image-text embedding space with 768 dimensions.
HNSW indexing of vector centroids drawn from the dataset, and combination with classic Inverted File as described in Billion-scale vector search using hybrid HNSW-IF.
Decoupling of vector storage and vector similarity computations. The stateless layer performs vector similarity computation over the full precision vectors. By using Vespa's support for accelerated inference with onnxruntime, moving the majority of the vector compute to the stateless layer allows for faster auto-scaling with daily query volume changes. The full precision vectors are stored in Vespa's summary log store, using lossless compression (zstd).
Dimension reduction with PCA - The centroid vectors are compressed from 768 dimensions to 128 dimensions. This allows indexing 6x more centroids on the same instance type due to the reduced memory footprint. With Vespa's support for distributed search, coupled with powerful high memory instances, this allows Vespa to scale cost efficiently to trillion-sized vector datasets.
The trained PCA matrix and matrix multiplication which projects the 768-dim vectors to 128-dimensions is evaluated in Vespa using accelerated inference, both at indexing time and at query time. The PCA weights are represented also using ONNX.
Phased ranking. The image embedding vectors are also projected to 128 dimensions, stored using memory mapped paged attribute tensors. Full precision vectors are on stored on disk in Vespa summary store. The first-phase coarse search ranks vectors in the reduced vector space, per node, and results are merged from all nodes before the final ranking phase in the stateless layer. The final ranking phase is implemented in the stateless container layer using accelerated inference.
Combining approximate nearest neighbor search with filters, filtering can be on url, caption, image height, width, safety probability, NSFW label, and more.
Hybrid ranking, both textual sparse matching features and the CLIP similarity, can be used when ranking images.
Reduced tensor cell precision. The original LAION-5B dataset uses float16. The app uses Vespa's support for bfloat16 tensors, saving 50% of storage compared to full float representation.
Caching, both reduced vectors (128) cached by the OS buffer cache, and full version 768 dims are cached using Vespa summary cache.
Query-time vector de-duping and diversification of the search engine result page using document to document similarity instead of query to document similarity. Also accelerated by stateless model inference.
Scale, from a single node deployment to multi-node deployment using managed Vespa Cloud, or self-hosted on-premise.

Stateless Components

The app contains several container components:

RankingSearcher implements the last stage ranking using full-precision vectors using an ONNX model for accelerated inference.
DedupingSearcher implements run-time de-duping after Ranking, using document to document similarity matrix, using an ONNX model for accelerated inference.
DimensionReducer PCA dimension reducing vectors from 768-dims to 128-dims.
AssignCentroidsDocProc searches the HNSW graph content cluster during ingestion to find the nearest centroids of the incoming vector.
SPANNSearcher

Deploying this app

These reproducing steps, demonstrates the app using a smaller subset of the LAION-5B vector dataset, suitable for playing around with the app on a laptop.

Requirements:

Docker Desktop installed and running. 6GB available memory for Docker is recommended. Refer to Docker memory for details and troubleshooting
Alternatively, deploy using Vespa Cloud
Operating system: Linux, macOS or Windows 10 Pro (Docker requirement)
Architecture: x86_64 or arm64
Homebrew to install Vespa CLI, or download a vespa cli release from GitHub releases.
Java 17 installed.
Python3 and numpy to process the vector dataset
Apache Maven - this sample app uses custom Java components and Maven is used to build the application.

Verify Docker Memory Limits:

$ docker info | grep "Total Memory"
or
$ podman info | grep "memTotal"

Install Vespa CLI:

$ brew install vespa-cli

For local deployment using docker image:

$ vespa config set target local

Use the multi-node high availability template for inspiration for multi-node, on-premise deployments.

Pull and start the vespa docker container image:

$ docker pull vespaengine/vespa
$ docker run --detach --name vespa --hostname vespa-container \
  --publish 127.0.0.1:8080:8080 --publish 127.0.0.1:19071:19071 \
  vespaengine/vespa

Verify that the configuration service (deploy api) is ready:

$ vespa status deploy --wait 300 ./app

Download this sample application:

$ vespa clone billion-scale-image-search myapp && cd myapp

Setup:

Create a tenant on Vespa Cloud:

Go to console.vespa-cloud.com and create your tenant (unless you already have one).
Install the Vespa CLI using Homebrew:
```
$ brew install vespa-cli
```
Windows/No Homebrew? See the Vespa CLI page to download directly.
Configure the Vespa client:
```
$ export VESPA_CLI_HOME=$PWD/.vespa
```
```
$ vespa config set target cloud
$ vespa config set application vespa-team.autotest
```
Use the tenant name from step 1 instead of "vespa-team", and replace in other steps in this example guide, too.
Get Vespa Cloud control plane access:
```
$ vespa auth login
```
Follow the instructions from the command to authenticate.
Clone a sample application:
```
$ vespa clone billion-scale-image-search myapp && cd myapp
```
See sample-apps for other sample apps you can clone.
Add a certificate for data plane access to the application:
```
$ vespa auth cert app
```
It is a good idea to take note of the path to the .pem files written here.

Download Vector + Metadata

These instructions use the first split file (0000) of a total of 2314 files in the LAION2B-en split. Download the vector data file:

$ curl --http1.1 -L -o img_emb_0000.npy \
  https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/img_emb_0000.npy

Download the metadata file:

$ curl -L -o metadata_0000.parquet \
  https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/laion2B-en-metadata/metadata_0000.parquet

Install python dependencies to process the files:

$ python3 -m pip install pandas numpy requests mmh3 pyarrow

Generate centroids, this process randomly selects vectors from the dataset to represent centroids. Performing an incremental clustering can improve vector search recall and allow indexing fewer centroids. For simplicity, this tutorial uses random sampling.

$ python3 app/src/main/python/create-centroid-feed.py img_emb_0000.npy > centroids.jsonl

Generate the image feed, this merges the embedding data with the metadata and creates a Vespa jsonl feed file, with one json operation per line.

$ python3 app/src/main/python/create-joined-feed.py metadata_0000.parquet img_emb_0000.npy > feed.jsonl

To process the entire dataset, we recommend starting several processes, each operating on separate split files as the processing implementation is single-threaded.

Build and deploy Vespa app

src/main/application/models has three small ONNX models:

vespa_innerproduct_ranker.onnx for vector similarity (inner dot product) between the query and the vectors in the stateless container.
vespa_pairwise_similarity.onnx for matrix multiplication between the top retrieved vectors.
pca_transformer.onnx for dimension reduction, projecting the 768-dim vector space to a 128-dimensional space.

These ONNX model files are generated by specifying the compute operation using pytorch and using torch's ability to export the model to ONNX format:

Build the sample app (make sure you have JDK 17, verify with mvn -v): - This step also downloads a pre-exported ONNX model for mapping the prompt text to the CLIP vector embedding space.

$ mvn clean package -U -f app

Deploy the application. This step deploys the application package built in the previous step:

$ vespa deploy --wait 300 ./app

Deployment note

It is possible to deploy this app to Vespa Cloud. For Vespa cloud deployments to the dev env replace the src/main/application/services.xml with src/main/application/services-cloud.xml - the cloud deployment uses dedicated clusters for feed and query.

Wait for the application endpoint to become available:

$ vespa status --wait 300

Run Vespa System Tests, which runs a set of basic tests to verify that the application is working as expected:

$ vespa test app/src/test/application/tests/system-test/feed-and-search-test.json

The centroid vectors must be indexed first:

$ vespa feed centroids.jsonl
$ vespa feed feed.jsonl

Track number of documents while feeding:

$ vespa query 'yql=select * from image where true' \
  hits=0 \
  ranking=unranked

Fetching data

Fetch a single document using document api:

$ vespa document get \
 id:laion:image::5775990047751962856

The response contains all fields, including the full vector representation and the reduced vector, plus all the metadata. Everything represented in the same schema.

Query the data

The following provides a few query examples, prompt is a run-time query parameter which is used by the CLIPEmbeddingSearcher which will encode the prompt text into a CLIP vector representation using the embedded CLIP model:

$ vespa query \
 'yql=select documentid, caption, url, height, width from image where nsfw contains "unlikely"'\
 'hits=10' \
 'prompt=two dogs running on a sandy beach'

Results are filtered by a constraint on the nsfw field. Note that even if the image is classified as unlikely the image content might still be explicit as the NSFW classifier is not 100% accurate.

The returned images are ranked by CLIP similarity (The score is found in each hit's relevance field).

The following query adds another filter, restricting the search so that only images crawled from urls with shutterstock.com is retrieved.

$ vespa query \
 'yql=select documentid, caption, url, height, width from image where nsfw contains "unlikely" and url contains "shutterstock.com"'\
 'hits=10' \
 'prompt=two dogs running on a sandy beach'

Another restricting the search further, adding a phrase constraint caption contains phrase("sandy", "beach"):

$ vespa query \
 'yql=select documentid, caption, url, height, width from image where nsfw contains "unlikely" and url contains "shutterstock.com" and caption contains phrase("sandy", "beach")'\
 'hits=10' \
 'prompt=two dogs running on a sandy beach'

Regular query, matching over the default fieldset, searching the caption and the url field, ranked by the text ranking profile:

$ vespa query \
 'yql=select documentid, caption, url from image where nsfw contains "unlikely" and userQuery()'\
 'hits=10' \
 'query=two dogs running on a sandy beach' \
 'ranking=text'

The text rank profile uses nativeRank, one of Vespa's many text matching rank features.

Non-native hyperparameters

There are several non-native query request parameters that controls the vector search accuracy and performance tradeoffs. These can be set with the request, e.g, /search/&spann.clusters=12.

spann.clusters, default 64, the number of centroids in the reduced vector space used to restrict the image search. A higher number improves recall, but increases computational complexity and disk reads.
rank-count, default 1000, the number of vectors that are fully re-ranked in the container using the full vector representation. A higher number improves recall, but increases the computational complexity and network.
collapse.enable, default true, controls de-duping of the top ranked results using image to image similarity.
collapse.similarity.max-hits, default 1000, the number of top-ranked hits to perform de-duping of. Must be less than rank-count.
collapse.similarity.threshold, default 0.95, how similar a given image to image must be before it is considered a duplicate.

Areas of improvement

There are several areas that could be improved.

CLIP model. The exported text transformer model uses fixed sequence length (77), this wastes computations and makes the model a lot slower than it has to be for shorter sequence lengths. A dynamic sequence length would make encoding short queries a lot faster than the current model. It would also be interesting to use the text encoder as a teacher and train a smaller distilled model using a different architecture (for example based on smaller MiniLM models).
CLIP query embedding caching. The CLIP model is fixed and only uses the text input. Caching the map from text to embedding would save resources.

Shutdown and remove the container:

$ vespa destroy --force