This sample application combines two sample applications to implement cost-efficient large scale image search over multimodal AI powered vector representations; text-image-search and billion-scale-vector-search.
This sample app use the LAION-5B dataset, the biggest open accessible image-text dataset in the world.
Large image-text models like ALIGN, BASIC, Turing Bletchly, FLORENCE & GLIDE have shown better and better performance compared to previous flagship models like CLIP and DALL-E. Most of them had been trained on billions of image-text pairs and unfortunately, no datasets of this size had been openly available until now. To address this problem we present LAION 5B, a large-scale dataset for research purposes consisting of 5,85B CLIP-filtered image-text pairs. 2,3B contain English language, 2,2B samples from 100+ other languages and 1B samples have texts that do not allow a certain language assignment (e.g. names ).
The LAION-5B dataset was used to train the popular text-to-image generative StableDiffusion model.
Note the following about the LAION 5B dataset:
Be aware that this large-scale dataset is un-curated. Keep in mind that the un-curated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer.
The released dataset does not contain image data itself,
but CLIP encoded vector representations of the images,
and metadata like url and caption.
The app can be used to implement several use cases over the LAION dataset, or adopted to your large-scale vector dataset:
caption or url fields in the LAION dataset using Vespa's standard text-matching functionality.All this combined using Vespa's query language, and also in combination with filters.
The sample application demonstrates many Vespa primitives:
float16. The app uses Vespa's support for bfloat16 tensors,
saving 50% of storage compared to full float representation.The app contains several container components:
These reproducing steps, demonstrates the app using a smaller subset of the LAION-5B vector dataset, suitable for playing around with the app on a laptop.
Requirements:
Verify Docker Memory Limits:
$ docker info | grep "Total Memory" or $ podman info | grep "memTotal"
Install Vespa CLI:
$ brew install vespa-cli
For local deployment using docker image:
$ vespa config set target local
Use the multi-node high availability template for inspiration for multi-node, on-premise deployments.
Pull and start the vespa docker container image:
$ docker pull vespaengine/vespa $ docker run --detach --name vespa --hostname vespa-container \ --publish 127.0.0.1:8080:8080 --publish 127.0.0.1:19071:19071 \ vespaengine/vespa
Verify that the configuration service (deploy api) is ready:
$ vespa status deploy --wait 300 ./app
Download this sample application:
$ vespa clone billion-scale-image-search myapp && cd myapp
Setup:
Create a tenant on Vespa Cloud:
Go to console.vespa-cloud.com and create your tenant (unless you already have one).
Install the Vespa CLI using Homebrew:
$ brew install vespa-cli
Windows/No Homebrew? See the Vespa CLI page to download directly.
Configure the Vespa client:
$ vespa config set target cloud $ vespa config set application vespa-team.autotest
Use the tenant name from step 1 instead of "vespa-team", and replace in other steps in this example guide, too.
Get Vespa Cloud control plane access:
$ vespa auth login
Follow the instructions from the command to authenticate.
Clone a sample application:
$ vespa clone billion-scale-image-search myapp && cd myapp
See sample-apps for other sample apps you can clone.
Add a certificate for data plane access to the application:
$ vespa auth cert app
It is a good idea to take note of the path to the .pem files written here.
These instructions use the first split file (0000) of a total of 2314 files in the LAION2B-en split. Download the vector data file:
$ curl --http1.1 -L -o img_emb_0000.npy \ https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/img_emb_0000.npy
Download the metadata file:
$ curl -L -o metadata_0000.parquet \ https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/laion2B-en-metadata/metadata_0000.parquet
Install python dependencies to process the files:
$ python3 -m pip install pandas numpy requests mmh3 pyarrow
Generate centroids, this process randomly selects vectors from the dataset to represent centroids. Performing an incremental clustering can improve vector search recall and allow indexing fewer centroids. For simplicity, this tutorial uses random sampling.
$ python3 app/src/main/python/create-centroid-feed.py img_emb_0000.npy > centroids.jsonl
Generate the image feed, this merges the embedding data with the metadata and creates a Vespa jsonl feed file, with one json operation per line.
$ python3 app/src/main/python/create-joined-feed.py metadata_0000.parquet img_emb_0000.npy > feed.jsonl
To process the entire dataset, we recommend starting several processes, each operating on separate split files as the processing implementation is single-threaded.
src/main/application/models has three small ONNX models:
vespa_innerproduct_ranker.onnx for vector similarity (inner dot product) between the query and the vectors
in the stateless container.vespa_pairwise_similarity.onnx for matrix multiplication between the top retrieved vectors.pca_transformer.onnx for dimension reduction, projecting the 768-dim vector space to a 128-dimensional space.These ONNX model files are generated by specifying the compute operation using pytorch and using torch's
ability to export the model to ONNX format:
Build the sample app (make sure you have JDK 17, verify with mvn -v): - This step
also downloads a pre-exported ONNX model for mapping the prompt text to the CLIP vector embedding space.
$ mvn clean package -U -f app
Deploy the application. This step deploys the application package built in the previous step:
$ vespa deploy --wait 300 ./app
It is possible to deploy this app to
Vespa Cloud.
For Vespa cloud deployments to the dev env
replace the src/main/application/services.xml with
src/main/application/services-cloud.xml -
the cloud deployment uses dedicated clusters for feed and query.
Wait for the application endpoint to become available:
$ vespa status --wait 300
Run Vespa System Tests, which runs a set of basic tests to verify that the application is working as expected:
$ vespa test app/src/test/application/tests/system-test/feed-and-search-test.json
The centroid vectors must be indexed first:
$ vespa feed centroids.jsonl $ vespa feed feed.jsonl
Track number of documents while feeding:
$ vespa query 'yql=select * from image where true' \ hits=0 \ ranking=unranked
Fetch a single document using document api:
$ vespa document get \ id:laion:image::5775990047751962856
The response contains all fields, including the full vector representation and the reduced vector, plus all the metadata. Everything represented in the same schema.
The following provides a few query examples,
prompt is a run-time query parameter which is used by the
CLIPEmbeddingSearcher
which will encode the prompt text into a CLIP vector representation using the embedded CLIP model:
$ vespa query \ 'yql=select documentid, caption, url, height, width from image where nsfw contains "unlikely"'\ 'hits=10' \ 'prompt=two dogs running on a sandy beach'
Results are filtered by a constraint on the nsfw field. Note that even if the image is classified
as unlikely the image content might still be explicit as the NSFW classifier is not 100% accurate.
The returned images are ranked by CLIP similarity (The score is found in each hit's relevance field).
The following query adds another filter, restricting the search so that only images crawled from urls with shutterstock.com
is retrieved.
$ vespa query \ 'yql=select documentid, caption, url, height, width from image where nsfw contains "unlikely" and url contains "shutterstock.com"'\ 'hits=10' \ 'prompt=two dogs running on a sandy beach'
Another restricting the search further, adding a phrase constraint caption contains phrase("sandy", "beach"):
$ vespa query \
'yql=select documentid, caption, url, height, width from image where nsfw contains "unlikely" and url contains "shutterstock.com" and caption contains phrase("sandy", "beach")'\
'hits=10' \
'prompt=two dogs running on a sandy beach'
Regular query, matching over the default fieldset, searching the caption and the url field, ranked by
the text ranking profile:
$ vespa query \ 'yql=select documentid, caption, url from image where nsfw contains "unlikely" and userQuery()'\ 'hits=10' \ 'query=two dogs running on a sandy beach' \ 'ranking=text'
The text rank profile uses
nativeRank, one of Vespa's many
text matching rank features.
There are several non-native query request
parameters that controls the vector search accuracy and performance tradeoffs. These
can be set with the request, e.g, /search/&spann.clusters=12.
spann.clusters, default 64, the number of centroids in the reduced vector space used to restrict the image search.
A higher number improves recall, but increases computational complexity and disk reads.rank-count, default 1000, the number of vectors that are fully re-ranked in the container using the full vector representation.
A higher number improves recall, but increases the computational complexity and network.collapse.enable, default true, controls de-duping of the top ranked results using image to image similarity.collapse.similarity.max-hits, default 1000, the number of top-ranked hits to perform de-duping of. Must be less than rank-count.collapse.similarity.threshold, default 0.95, how similar a given image to image must be before it is considered a duplicate.There are several areas that could be improved.