How to use FiftyOne and Qdrant to Search through Billions of Images in Computer Vision Applications

Plaban Nayak
11 min readJan 30, 2024

Enhancing Computer Vision Workflows with Multi-Modal Search: A Deep Dive into CLIP Integration

Machine learning has witnessed a remarkable evolution in recent years, with one of the most exciting developments being the significant strides in multi-modal AI. This progress has fostered a synergistic relationship between computer vision and natural language processing, catalyzed by breakthroughs such as OpenAI’s CLIP model. Employing a contrastive learning technique, CLIP seamlessly embeds diverse multimedia content — ranging from language to images — into a unified latent space.

The unparalleled capabilities of multi-modal models like CLIP have propelled advancements in various domains, including zero-shot image classification, knowledge transfer, synthetic data generation, and semantic search. In this article, our focus will be on the latter — leveraging the power of CLIP for natural language image search.

Unveiling the Integration

Traditionally, vector search tools and libraries have stood as independent entities, providing glimpses into the potential of cross-domain searches. Today, however, we present a groundbreaking approach to seamlessly incorporate natural language image search directly into your computer vision workflows. Our integration involves three key components: the open-source computer vision toolkit FiftyOne, the vector database Qdrant, and the powerful CLIP model from OpenAI.

Understanding CLIP’s Role

At the heart of our integration lies OpenAI’s CLIP model, which acts as the bridge between textual and visual domains. CLIP’s contrastive learning methodology allows it to encode textual descriptions and corresponding images into a shared latent space. This intrinsic capability forms the foundation for our natural language image search.

OpenAI’s CLIP (Contrastive Language-Image Pre-training) model is a powerful and innovative deep learning model designed for understanding and processing both natural language and images in a unified framework. Developed by OpenAI, CLIP leverages a contrastive learning approach to learn a shared representation space for text and images, enabling it to understand the relationships between the two modalities. The model was introduced in a research paper titled “CLIP: Connecting Text and Images for Zero-Shot Learning.”

Key characteristics and components of the CLIP model include:

  • Unified Embedding Space :CLIP is trained to embed images and corresponding textual descriptions into the same high-dimensional space. This shared space allows for direct comparison and understanding of the relationships between visual and textual concepts.
  • Contrastive Learning : The training of CLIP involves contrastive learning, a technique where the model learns by contrasting positive pairs (correct image-text pairs) with negative pairs (incorrect image-text pairs). This process encourages the model to bring similar images and texts closer together while pushing dissimilar pairs apart in the embedding space.
  • Vision Transformer (ViT) Architecture : CLIP is built upon the Vision Transformer (ViT) architecture, which has proven effective in image processing tasks. ViT divides images into fixed-size patches and processes them using transformer layers, allowing the model to capture both local and global information in images.
  • Text Tokenization : CLIP uses a text tokenization method to convert textual descriptions into a format suitable for input to the model. This ensures that both images and text can be processed consistently.
  • Zero-Shot Learning :One of the notable features of CLIP is its ability to perform zero-shot learning. This means the model can generalize to recognize images or understand text related to categories it has not seen during training. This is achieved by leveraging the shared embedding space, allowing the model to make predictions based on semantic similarities.
  • Versatility in Tasks : CLIP has demonstrated effectiveness across a wide range of tasks, including image classification, object detection, natural language image retrieval, and even tasks that require understanding nuanced textual prompts.
  • Pre-Trained Models : OpenAI provides pre-trained CLIP models in various configurations, allowing users to leverage the model’s capabilities without extensive training. These pre-trained models can be fine-tuned for specific tasks if needed.

Applications of CLIP span various domains, including computer vision, natural language processing, and AI-driven systems that require a unified understanding of both visual and textual information. The model’s versatility and performance make it a valuable tool for developers and researchers working on multi-modal AI applications.

FiftyOne: Bridging Vision and Language

FiftyOne serves as the glue that seamlessly connects CLIP with your computer vision workflows. This open-source toolkit provides an intuitive interface for exploring, visualizing, and iterating over your multi-modal datasets. We will delve into how FiftyOne facilitates the integration, enabling you to harness the power of CLIP with ease.

FiftyOne is an open-source Python package designed to facilitate the exploration, visualization, and analysis of computer vision datasets. It provides a user-friendly interface for working with diverse datasets, especially those involving images and annotations. FiftyOne aims to simplify the process of understanding and debugging machine learning models by offering tools for visualizing model predictions, analyzing dataset statistics, and iteratively refining the dataset during the development process.

Key features of the FiftyOne package include:

  • Interactive Exploration : Interactive and customizable UI that allows users to explore images and their associated annotations. This facilitates a deeper understanding of the dataset’s characteristics.
  • Visualization Tools : Users can visualize images, ground truth annotations, and model predictions directly within the FiftyOne interface. This is particularly useful for inspecting model outputs and assessing the model’s performance.
  • Dataset Statistics : The package offers tools to compute and visualize various statistics about the dataset, such as class distribution, label co-occurrence, and image quality metrics. This aids in gaining insights into the dataset’s composition and potential biases.
  • Debugging Models : FiftyOne is designed to help users debug and analyze the outputs of machine learning models. It allows users to visualize model predictions, compare them with ground truth annotations, and identify areas where the model may need improvement.
  • Annotation Integration : The package supports various annotation formats commonly used in computer vision tasks, including bounding boxes, segmentation masks, keypoints, and classification labels. This flexibility makes it suitable for a wide range of computer vision applications.
  • Iterative Dataset Refinement : Users can iteratively refine datasets by adding or modifying annotations directly within the FiftyOne interface. This supports an agile development process, where datasets can be improved based on insights gained during exploration.
  • Compatibility with Deep Learning Frameworks : FiftyOne integrates with popular deep learning frameworks such as TensorFlow and PyTorch. This allows users to seamlessly incorporate it into their machine learning workflows.
  • Extensibility :The package is designed to be extensible, and users can build custom plugins and extensions to tailor it to their specific needs.

FiftyOne is a valuable tool for researchers, data scientists, and machine learning practitioners working on computer vision projects. It enhances the efficiency of the data exploration and model development process by providing a unified platform for visualizing and interacting with image datasets.

Qdrant: Powering Vector Search

To augment our natural language image search, we leverage Qdrant — a vector database that excels in handling high-dimensional data efficiently. Qdrant’s ability to index and search vectors at scale is pivotal in making our integration scalable and performant.

Qdrant “is a vector similarity search engine that provides a production-ready service with a convenient API to store, search, and manage points (i.e. vectors) with an additional payload.” You can think of the payloads as additional pieces of information that can help you hone in on your search and also receive useful information that you can give to your users.

You can get started using Qdrant with the Python qdrant-client, by pulling the latest docker image of Qdrant and connecting to it locally, or by trying out Qdrant’s Cloud free tier option until you are ready to make the full switch.

Step-by-Step Guide

Now, let’s walk through the steps to incorporate natural language image search into your computer vision workflows using FiftyOne, Qdrant, and CLIP.

Step 1: Data Preparation

Install the required dependencies.

# install FiftyOne
%pip install fiftyone
# Install Qdrant
%pip install qdrant-client

Import the required dependencies and download the dataset and model.

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("coco-2017", split="validation")
model = foz.load_zoo_model("clip-vit-base32-torch”)

Sample dataset used: Subset of the MS COCO dataset from the FiftyOne Dataset Zoo,

Model: PyTorch implementation of the CLIP model from the FiftyOne Model Zoo:

Note: We can as well download OpenAI’s CLIP model directly from source by following the instructions here.

dataset.values

######################Response############
<bound method SampleCollection.values of Name: coco-2017-validation
Media type: image
Num samples: 5000
Persistent: True
Tags: []
Sample fields:
id: fiftyone.core.fields.ObjectIdField
filepath: fiftyone.core.fields.StringField
tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
embedding: fiftyone.core.fields.VectorField>
dataset.values("id")[0]

####################Response################
65b115cf884e52b88aef0381
## Visualize the data sample
session = fo.launch_app(dataset)
session

###################Response###############
INFO:fiftyone.core.session.session:
Welcome to


███████╗██╗███████╗████████╗██╗ ██╗ ██████╗ ███╗ ██╗███████╗
██╔════╝██║██╔════╝╚══██╔══╝╚██╗ ██╔╝██╔═══██╗████╗ ██║██╔════╝
█████╗ ██║█████╗ ██║ ╚████╔╝ ██║ ██║██╔██╗ ██║█████╗
██╔══╝ ██║██╔══╝ ██║ ╚██╔╝ ██║ ██║██║╚██╗██║██╔══╝
██║ ██║██║ ██║ ██║ ╚██████╔╝██║ ╚████║███████╗
╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═╝ ╚═══╝╚══════╝ v0.23.3






Dataset: coco-2017-validation
Media type: image
Num samples: 5000
Selected samples: 0
Selected labels: 0
Session type: colab

Step 2: Instantiate the Vector Store

from qdrant_client import QdrantClient
client = QdrantClient(":memory:")

Step 3: Configure Settings to Handle Torch Data

import numpy as np
from pkg_resources import packaging
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

if packaging.version.parse(
torch.__version__
) < packaging.version.parse("1.8.0"):
dtype = torch.long
else:

Step 4: Generate CLIP Embeddings

Utilize CLIP to generate embeddings for both textual descriptions and images. This step establishes a common latent space for cross-modal retrieval.

With FiftyOne’s compute_embeddings() method, we can generate embeddings for all of the images in our dataset in one fell swoop, storing them in an embedding field:

dataset.compute_embeddings(
model,
embeddings_field="embedding"
)

########## Response Logs #####################
100% |███████████████| 5000/5000 [24.5m elapsed, 0s remaining, 3.4 samples/s]
INFO:eta.core.utils: 100% |███████████████| 5000/5000 [24.5m elapsed, 0s remaining, 3.4 samples/s]

In case we plan to generate embeddings for many samples, the best practice is to pre-compute them ahead of time, and if we intend to use these embeddings more than once, it is advisable to persist this data.

dataset.persistent = True

Visualize the Embeddings in a 2D Space

This can be achieved by using the FiftyOne Brain; we can do this with the compute_visualization() method:

import fiftyone.brain as fob
from fiftyone import ViewField as F

# perform dimensionality reduction using t-SNE
results = fob.compute_visualization(
dataset,
embeddings = "embedding",
method = "tsne"
)

# visualize results, labeling by number of objects in image
results.visualize(labels=F("ground_truth.detections").length())

Step 5: Helper Function for Generating Query Embeddings

The following function takes in a text prompt and the FiftyOne-wrapped PyTorch CLIP model we loaded, and performs the logic of generating the corresponding embedding vector.


def get_text_embedding(prompt, clip_model):
tokenizer = clip_model._tokenizer

# standard start-of-text token
sot_token = tokenizer.encoder["<|startoftext|>"]

# standard end-of-text token
eot_token = tokenizer.encoder["<|endoftext|>"]

prompt_tokens = tokenizer.encode(prompt)
all_tokens = [[sot_token] + prompt_tokens + [eot_token]]

text_features = torch.zeros(
len(all_tokens),
clip_model.config.context_length,
dtype=dtype,
device=device,
)

# insert tokens into feature vector
text_features[0, : len(all_tokens[0])] = torch.tensor(all_tokens)

# encode text
embedding = clip_model._model.encode_text(text_features).to(device)

# convert to list for qdrant
return embedding.tolist()
#
prompt = "a picture of a giraffe"
query_vector = get_text_embedding(prompt, model)
print(len(query_vector))
print(len(query_vector[0])

It generates an embedding having a dimension of 512.

Step 6: Create Qdrant Index

from qdrant_client.http import models
COLLECTION_NAME = 'images'

client.recreate_collection(
collection_name=COLLECTION_NAME,
vectors_config={
"text-dense": models.VectorParams(
size=512, # Vector size is defined by used model
distance=models.Distance.COSINE,
)
},
)

Step 7: Load Embeddings into the Vector Store

Leverage Qdrant to index CLIP embeddings, facilitating efficient vector search. Qdrant’s scalability ensures that the search process remains performant even with large datasets.

from tqdm import tqdm
for i in tqdm(range(0, len(dataset.values("embedding")))):
client.upsert(
collection_name=COLLECTION_NAME,
points=[models.PointStruct(
id=i,
payload={"image_path":dataset.values("filepath")[i]}, # Add any additional payload if necessary
vector={"text-dense":final_embeddings[i].tolist()})
],
)

Check if the load has been successful.

#check if the update is successful
client.count(
collection_name=COLLECTION_NAME,
exact=True,
)
#check if the update is successful
client.count(
collection_name=COLLECTION_NAME,
exact=True,
)
client.scroll(
collection_name=COLLECTION_NAME,
)

#########################Resoponse Logs ######################
([Record(id=0, payload={'image_path': '/root/fiftyone/coco-2017/validation/data/000000000139.jpg'}, vector=None, shard_key=None),
Record(id=1, payload={'image_path': '/root/fiftyone/coco-2017/validation/data/000000000285.jpg'}, vector=None, shard_key=None),
Record(id=2, payload={'image_path': '/root/fiftyone/coco-2017/validation/data/000000000632.jpg'}, vector=None, shard_key=None),
Record(id=3, payload={'image_path': '/root/fiftyone/coco-2017/validation/data/000000000724.jpg'}, vector=None, shard_key=None),
Record(id=4, payload={'image_path': '/root/fiftyone/coco-2017/validation/data/000000000776.jpg'}, vector=None, shard_key=None),
Record(id=5, payload={'image_path': '/root/fiftyone/coco-2017/validation/data/000000000785.jpg'}, vector=None, shard_key=None),
Record(id=6, payload={'image_path': '/root/fiftyone/coco-2017/validation/data/000000000802.jpg'}, vector=None, shard_key=None),
Record(id=7, payload={'image_path': '/root/fiftyone/coco-2017/validation/data/000000000872.jpg'}, vector=None, shard_key=None),
Record(id=8, payload={'image_path': '/root/fiftyone/coco-2017/validation/data/000000000885.jpg'}, vector=None, shard_key=None),
Record(id=9, payload={'image_path': '/root/fiftyone/coco-2017/validation/data/000000001000.jpg'}, vector=None, shard_key=None)],
10)

Step 7: Querying the Dataset

Query 1:

prompt = "a gloomy day"
query_vector = get_text_embedding(prompt, model)
type(query_vector)
Search the Qdrant VectorStore for content Matching Query1
results = client.search_batch(
collection_name=COLLECTION_NAME,
requests=[
models.SearchRequest(
vector=models.NamedVector(
name="text-dense",
vector=query_vector[0],
),
limit=5,
),
],
)
# get ids of gloomiest samples
top_k_ids = [res.id for res in results[0]]
ids =[dataset.values("id")[i] for i in top_k_ids]
print(ids)
# view these samples, ordered by "gloominess"
view = dataset.select(ids, ordered=True)
session.view = view.view()

##################RESPONSE ########################
[[ScoredPoint(id=3070, version=0, score=0.2837843719099665, payload=None, vector=None, shard_key=None),
ScoredPoint(id=4945, version=0, score=0.27531836526617415, payload=None, vector=None, shard_key=None),
ScoredPoint(id=3767, version=0, score=0.27227966878915677, payload=None, vector=None, shard_key=None),
ScoredPoint(id=3266, version=0, score=0.27125659838782235, payload=None, vector=None, shard_key=None),
ScoredPoint(id=4276, version=0, score=0.2668262212925451, payload=None, vector=None, shard_key=None)]]


Image ids :['65b24419ddead206eaea9a55', '65b24427ddead206eaead73e', '65b2441fddead206eaeab17d', '65b2441addead206eaeaa0c4', '65b24423ddead206eaeac208']

Images corresponding the match for Query 1

Query 2:


prompt = "a person holding a baseball bat"
query_vector = get_text_embedding(prompt, model)
#
results = client.search_batch(
collection_name=COLLECTION_NAME,
requests=[
models.SearchRequest(
vector=models.NamedVector(
name="text-dense",
vector=query_vector[0],
),
limit=15,
),
],
)
# get ids of gloomiest samples
top_k_ids = [res.id for res in results[0]]
ids =[dataset.values("id")[i] for i in top_k_ids]
print(ids)
# view these samples, ordered by "gloominess"
view = dataset.select(ids, ordered=True)
session.view = view.view()

#################RESPONSE###########################
Image ids found as similar matches: ['65b24414ddead206eaea85a8', '65b24413ddead206eaea8159', '65b24404ddead206eaea3fdd', '65b2441addead206eaea9e98', '65b24426ddead206eaead347', '65b24416ddead206eaea8e9c', '65b24419ddead206eaea9a53', '65b24401ddead206eaea37a4', '65b24407ddead206eaea4e31', '65b2441eddead206eaeaae0e', '65b24419ddead206eaea9c7c', '65b2441eddead206eaeaaec1', '65b24418ddead206eaea9685', '65b24413ddead206eaea8062', '65b24421ddead206eaeab90d']

Images corresponding the match for Query 2

Query 3:


prompt = "panda in the zoo"
query_vector = get_text_embedding(prompt, model)
#
results = client.search_batch(
collection_name=COLLECTION_NAME,
requests=[
models.SearchRequest(
vector=models.NamedVector(
name="text-dense",
vector=query_vector[0],
),
limit=15,
),
],
)
# get ids of gloomiest samples
top_k_ids = [res.id for res in results[0]]
ids =[dataset.values("id")[i] for i in top_k_ids]
print(ids)
# view these samples, ordered by "gloominess"
view = dataset.select(ids, ordered=True)
session.view = view.view()

Images corresponding the match for Query 3

Query 4:


prompt = "a child playing with a dog"
query_vector = get_text_embedding(prompt, model)
#
results = client.search_batch(
collection_name=COLLECTION_NAME,
requests=[
models.SearchRequest(
vector=models.NamedVector(
name="text-dense",
vector=query_vector[0],
),
limit=15,
),
],
)
# get ids of gloomiest samples
top_k_ids = [res.id for res in results[0]]
ids =[dataset.values("id")[i] for i in top_k_ids]
print(ids)
# view these samples, ordered by "gloominess"
view = dataset.select(ids, ordered=True)
session.view = view.view()

Images corresponding the match for Query 4

Conclusion

The amalgamation of FiftyOne, Qdrant, and CLIP introduces a new dimension to computer vision workflows. By seamlessly incorporating natural language image search, users can unlock novel applications, ranging from content discovery to interactive exploration of multi-modal datasets. As the landscape of multi-modal AI continues to evolve, this integration stands at the forefront, exemplifying the potential of cross-pollination between computer vision and natural language processing.

References

connect with me

--

--