Zero Shot Object Detection with OpenAI's CLIP
The Imagenet Large Scale Visual Recognition Challenge (ILSVRC) was a world-changing competition hosted annually from 2010 until 2017. During this time, the competition acted as the catalyst for the explosion of deep learning and was the place to find state-of-the-art image classification, object localization, and object detection.
Researchers fine-tuned better-performance computer vision (CV) models to achieve ever more impressive results year-after-year. But there was an unquestioned assumption causing problems.
We assumed that every new task required model fine-tuning, this required a lot of data, and this needed both time and capital.
It wasn’t until very recently that this assumption was questioned and proven wrong.
The astonishing rise of multi-modal models has made the impossible possible across various domains and tasks. One of those is zero-shot object detection and localization.
“Zero-shot” means applying a model without the need for fine-tuning. Meaning we take a multi-modal model and use it to detect images in one domain, then switch to another entirely different domain without the model seeing a single training example from the new domain.
Not needing a single training example means we completely skip the hard part of data annotation and model training. We can focus solely on application of our models.
In this chapter, we will explore how to apply OpenAI’s CLIP to this task—using CLIP for localization and detection across domains with zero fine-tuning.
Classification, Localization, and Detection
Image classification is one of the most straightforward tasks in visual recognition and the first step on the way to object detection. It consists of assigning a categorical label to an image.
We could have an image classification model that identifies animals and could classify images of dogs, cats, mice, etc. If we pass the above image into this model, we’d expect it to return the class “dog”.
Object localization takes this one step further by “localizing” the identified object.
When we localize the object, we identify the object’s coordinates on the image. That typically includes a set of patches where the object is located or a bounding box defined by () coordinates, box width, and box height.
Object detection can be thought of as the next step. With detection, we are localizing multiple object instances within the same image.
In the example above, we are detecting two different objects within the image, a cat and a dog. Both objects are localized, and the results are returned.
Object detection can also identify multiple instances of the same object in a single image. If we added another dog to the previous image, an object detection algorithm could detect two dogs and a single cat.
Zero Shot CLIP
OpenAI’s CLIP is a multi-modal model pretrained on a massive dataset of text-image pairs . It can identify text and images with similar meanings by encoding both modalities into a shared vector space.
CLIP’s broad pretraining means it can perform effectively across many domains. We can adjust the task being performed (i.e. from classification to detection) with just a few lines of code. A big part of this flexibility if thanks to the multi-modal vector embeddings built by CLIP.
These vector embeddings allow us to switch from text-to-image search, image classification, and object detection. We simply adjust how we preprocess data being fed into CLIP, or how we interpret the similarity scores between the CLIP embeddings. The model itself requires no modification.
For classification, we need to give CLIP a list of our class labels, and it will encode them into a vector space:
From there, we give CLIP the images we’d like to classify. CLIP will encode them in the same vector space, and we find which of the class label embeddings is nearest to our image embeddings.
We can apply similar logic to using CLIP in a zero-shot object localization setting. As before, we create a class label embedding like "a fluffy cat". But, unlike before, we don’t feed the entire image into CLIP.
To localize an object, we break the image into many small patches. We then pass a window over these patches, moving across the entire image and generating an image embedding for a unique window.
We can calculate the similarity between these patch image embeddings and our class label embeddings — returning a score for each patch.
After calculating the similarity scores for every patch, we collate them into a map of relevance across the entire image. We use that “map” to identify the location of the object of interest.
From there, we can recreate the traditional approach of creating a “bounding box” around the object.
Both of these visuals capture the same information but displays them in different ways.
Occlusion is another method of localization where we slide a black patch across the image. The idea being that we dentify similarity by the “absence” of an object .
If the black patch covers the object we are looking for, the similarity score will drop. We then take that position as the assumed location of our object.
There is a fine line between object localization and object detection. With object localization, we perform a “classification” of a single object followed by the localization of that object. With object detection, we perform localization for multiple classes and/or objects.
With our cat and butterfly image, we could search for two objects; "a fluffy cat" and "a butterfly". We use object localization to identify each individual object, but by iteratively identifying multiple objects, this becomes object detection.
We stick with the bounding box visualizations for object detection, as the other method makes it harder to visualize multiple objects within the same image.
We have covered the idea behind object localization and detection in a zero-shot setting with CLIP. Now let’s take a look at how to implement it.
Detection with CLIP
Before we move on to any classification, localization, or detection task, we need images to process. We will use a small demo dataset named jamescalam/image-text-demo hosted on Hugging Face datasets.
The dataset contains the image of a butterfly landing on a cat’s nose. We can view it in a Jupyter notebook with the following:
We have downloaded the image, but it is not in the format we need for localization. For that, we must break the image into smaller patches.
To create the patches, we must first convert our PIL image object into a PyTorch tensor. We can do this using torchvision.transforms.
Our tensor has 3 color channels (RGB), a height of 5184 pixels, and width of 3456 pixels.
Assuming each patch has an equal height and width of 256 pixels, we must reshape this tensor into a tensor of shape (1, 20, 13, 3, 256, 256) where 20 and 13 of the number of patches in height and width of the image and 1 represents the batch dimension.
We first add the batch dimension and move the color channels' dimension behind the height and width dimensions.
Following this, we broke up the image into horizontal patches first. All patches will be square with dimensionalities of 256x256, so the horizontal patch height equals 256 pixels.
We need one more unfold to create the vertical space between patches.
Every patch is tiny, and looking at a single patch gives us little-to-no information about the image’s content. Rather than feeding single patches to CLIP, we merge multiple patches to create a big patch passed to CLIP.
We call this grouping of patches a window. A larger window size captures more global views of the image, whereas a smaller window can produce a more precise map at the risk of missing larger objects. To slide across the image and create a big_batch at each step, we do the following:
We will re-use this logic later when creating our patch image embeddings. Before we do that, we must initialize CLIP.
CLIP and Localization
The Hugging Face transformers library contains an implementation of CLIP named openai/clip-vit-base-patch32. We can download and initialize it like so:
Note that we also move to model to a CUDA-enabled GPU if possible to reduce inference times.
With CLIP initialized, we can rerun the patch sliding logic, but this time we will calculate the similarity between each big_patch and the text label "a fluffy cat".
Here we have also added scores and runs that we will use to calculate the mean score for each patch. We calculate the scores tensor as the sum of every big_patch score calculated while the patches were within the window.
Some patches will be seen more often than others (for example, the top-left patch is seen once), so the scores will be much greater for patches viewed more frequently. That is why we use the runs tensor to keep track of the “visit frequency” for each patch. With both tensors populated, we calculate the mean score:
The scores tensor typically contains a smooth gradient of values as a byproduct of the scoring function sliding over each window. This means the scores gradually fade to 0.0 the further they are from the object of interest.
We cannot accurately visualize the object location with the current scores. Ideally, we should push low scores to zero while maintaining a range of values for higher scores. We can do this by clipping our outputs and normalizing the remaining values.
With that, our patch scores are ready, and we can move on to visualizing the results.
Each patch in the (20,13)(20,13) patches tensor is assigned a similarity score within the range of 00 (not similar) to 11 (perfect match).
If we can align the scores with the original image pixels, we can multiply each pixel by its corresponding similarity score. Those near 00 will be dark, and near 11 will maintain their original brightness.
The only problem is that these two tensors are not the same shape:
[Out]: (torch.Size([20, 13]), torch.Size([1, 20, 13, 3, 256, 256]))
We need to reshape patches to align with scores. To do that, we use squeeze to remove the batch dimension at position 0 and then re-order the dimensions using permute.
[Out]: torch.Size([256, 256, 3, 20, 13])
From there, we multiply the adjusted patches and scores to return the brightness-adjusted patches. These need to be permuted again to be visualized with matplotlib.
[Out]: torch.Size([20, 13, 3, 256, 256])
Now we’re ready to visualize:
That works well. We can repeat the same but with the prompt "a butterfly" to return:
CLIP shows another good result and demonstrates how easy it is to add new labels to classification and localization tasks with CLIP.
Before moving on to object detection, we need to rework the visualization to handle multiple objects.
The standard way to outline objects for localization and detection is to use a bounding box. We will do the same using the scores calculated previously for the "a butterfly" prompt.
The bounding box requires a defined edge, unlike our previous visual, which had a more continuous fade to black. To do this, we need to set a threshold for what is positive or negative, and we will use 0.5.
We can now detect the non-zero positions with the np.nonzero function. The output values represent the x,y coordinates of patches with scores > 0.5.
The first column represents the x-coordinates of non-zero positions, and the second column represents the respective y-coordinates.
Our bounding box will take each of the edges produced by these non-zero coordinates.
We need the minimum and maximum x and y coordinates to find the box corners.
These give us the bounding box coordinates based on patches rather than pixels. To get the pixel coordinates (for the visual), we multiply the coordinates by patch. After that, we calculate the box height and width.
With the x_min, y_min, width, and height values we can use matplotlib.patches to create the bounding box. Before we do that, we convert the original PIL image into a matplotlib-friendly format.
Now we visualize everything together:
There we have our bounding box visual.
We finally have everything we need to perform object detection for multiple object classes within the same image. The logic is a loop over what we have already built, and we can package it into a neater function like so:
(Find the full code here)
Now we pass a list of class labels and the image to detect. The function will return our image with each detected object annotated with a bounding box.
The current implementation is limited to displaying a single object from each class, but this can be solved with a small amount of additional logic.
That’s it for this walkthrough of zero-shot object localization and detection with OpenAI’s CLIP. Zero-shot opens the doors to many organizations and domains that could not perform good object detection due to a lack of training data or compute resources — which is the case for the vast majority of companies.
Multi-modality and CLIP are just part of a trend towards more broadly applicable ML with a much lower barrier to entry. Zero-to-few-shot learning unlocks those previously inaccessible projects and presents us with what will undoubtedly be a giant leap forward in ML capability and adoption across the globe.
 O. Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge (2014)
 A. Krizhevsky et al., ImageNet Classification with Deep Convolutional Neural Networks (2012), NeurIPS
 A. Radford, J. Kim, et al., Learning Transferable Visual Models From Natural Language Supervision (2021)
 F. Bianchi, Domain-Specific Multi-Modal Machine Learning with CLIP (2022), Pinecone Workshop
 R. Pisoni, Searching Across Images and Text: Intro to OpenAI’s CLIP (2022), Pinecone Workshop