# Zero Shot Object Detection with OpenAI's CLIP

> How to apply CLIP to object detection in a zero-shot setting.

The **I**magenet **L**arge **S**cale **V**isual **R**ecognition **C**hallenge (ILSVRC)[1] was a world-changing competition hosted annually from 2010 until 2017. During this time, the competition acted as the catalyst for the explosion of deep learning[2] and was the place to find state-of-the-art image classification, object localization, and object detection.

Researchers fine-tuned better-performance computer vision (CV) models to achieve ever more impressive results year-after-year. But there was an unquestioned assumption causing problems.

We assumed that every new task required model fine-tuning, this required _a lot_ of data, and this needed both time and capital.

It wasn’t until very recently that this assumption was questioned and proven wrong.

The astonishing rise of multi-modal models has made the impossible possible across various domains and tasks. One of those is zero-shot object detection and localization.

“Zero-shot” means applying a model without the need for fine-tuning. Meaning we take a multi-modal model and use it to detect images in one domain, then switch to another entirely different domain _without_ the model seeing a single training example from the new domain.

Not needing a single training example means we completely skip the hard part of data annotation and model training. We can focus solely on application of our models.

In this chapter, we will explore how to apply OpenAI’s CLIP to this task—using CLIP for localization and detection across domains with _zero_ fine-tuning.

[Video](https://www.youtube.com/watch?v=i3OYlaoj-BM)


---

## Classification, Localization, and Detection

[Image classification](https://www.pinecone.io/learn/series/image-search/zero-shot-image-classification-clip/) is one of the most straightforward tasks in visual recognition and the first step on the way to object detection. It consists of assigning a categorical label to an image.

![Image classified as (1) “dog”.](https://cdn.sanity.io/images/vr8gru94/production/9edbadc5922e74f402f946a08594b4cc2238604e-2749x1164.png)


We could have an image classification model that identifies animals and could classify images of dogs, cats, mice, etc. If we pass the above image into this model, we’d expect it to return the class _“dog”_.

**Object localization** takes this one step further by _“localizing”_ the identified object.

![(1) Image classified as "dog" and (2) object localized.](https://cdn.sanity.io/images/vr8gru94/production/fe40408fa76700a5e6cd0d73cd2609d389aa67b0-2747x1173.png)


When we _localize_ the object, we identify the object’s coordinates on the image. That typically includes a set of patches where the object is located or a bounding box defined by ($x,y$) coordinates, box width, and box height.

**Object detection** can be thought of as the next step. With detection, we are _localizing_ multiple object instances within the same image.

![(1) Object localized and classified as "cat" and (2) object localized and classified as "dog".](https://cdn.sanity.io/images/vr8gru94/production/1153cd23011f0a2fe8c46d2eaca58449db347fc9-2698x1305.png)


In the example above, we are detecting two different objects within the image, a cat _and_ a dog. Both objects are localized, and the results are returned.

Object detection can also identify multiple instances of the _same_ object in a single image. If we added another dog to the previous image, an object detection algorithm could detect two dogs and a single cat.

## Zero Shot CLIP

[OpenAI’s CLIP](https://www.pinecone.io/learn/series/image-search/clip/) is a _multi-modal_ model pretrained on a massive dataset of text-image pairs [3]. It can identify text and images with similar meanings by encoding both modalities into a shared vector space.

![CLIP is able to encode different text and images into the same vector space.](https://cdn.sanity.io/images/vr8gru94/production/c329639c27a4385d2a254cd1df791ce0169f8e87-1620x852.png)


CLIP’s broad pretraining means it can perform effectively across many domains. We can adjust the task being performed (i.e. from classification to detection) with just a few lines of code. A big part of this flexibility if thanks to the multi-modal vector embeddings built by CLIP.

These vector embeddings allow us to switch from [text-to-image search](https://www.pinecone.io/learn/series/image-search/clip/), [image classification](https://www.pinecone.io/learn/series/image-search/zero-shot-image-classification-clip/), and object detection. We simply adjust how we preprocess data being fed into CLIP, or how we interpret the similarity scores between the CLIP embeddings. The model itself requires no modification.

For classification, we need to give CLIP a list of our class labels, and it will encode them into a vector space:

![By encoding both images and class labels into the same vector space, we can identify each image’s most similar class label.](https://cdn.sanity.io/images/vr8gru94/production/1bbd5d8f31158ecd81c2e746a49e93c7c01b6115-1858x896.png)


From there, we give CLIP the images we’d like to classify. CLIP will encode them in the same vector space, and we find which of the class label embeddings is nearest to our image embeddings.

### Object Localization

We can apply similar logic to using CLIP in a zero-shot object localization setting. As before, we create a class label embedding like `"a fluffy cat"`. But, unlike before, we don’t feed the entire image into CLIP.

To localize an object, we break the image into many small patches. We then pass a `window` over these patches, moving across the entire image and generating an image embedding for a unique window.

We can calculate the similarity between these patch image embeddings and our class label embeddings — returning a score for each patch.

After calculating the similarity scores for every patch, we collate them into a map of relevance across the entire image. We use that “map” to identify the location of the object of interest.

![We split images into small patches, which we can use to create mini-images that are encoded and compared to each encoded label. Producing a set of scores for each part of the image.](https://cdn.sanity.io/images/vr8gru94/production/be4800918976efd9d974d9e5453985a5106f2558-2389x1455.png)


From there, we can recreate the traditional approach of creating a “bounding box” around the object.

![We can use the scored patches of the image to find a bounding box that encapsulates the object of interest.](https://cdn.sanity.io/images/vr8gru94/production/f2cc59b9623c22b2def1eec1dfb625ffa69f91a5-609x810.png)


Both of these visuals capture the same information but displays them in different ways.

#### Occlusion Algorithm

Occlusion is another method of localization where we slide a black patch across the image. The idea being that we dentify similarity by the “absence” of an object [4][5].

[Video](https://d33wubrfki0l68.cloudfront.net/d139e5634d1991c9b5a4364e42498b34f42c7134/975d8/images/zero-shot-object-detection-clip-8.mp4)


If the black patch covers the object we are looking for, the similarity score will drop. We then take that position as the assumed location of our object.

### Object Detection

There is a fine line between object localization and object detection. With object localization, we perform a “classification” of a single object followed by the localization of that object. With object detection, we perform localization for multiple classes and/or objects.

With our cat and butterfly image, we could search for two objects; `"a fluffy cat"` and `"a butterfly"`. We use object localization to identify each _individual_ object, but by iteratively identifying multiple objects, this becomes _object detection_.

![Object detection differs from localization by allowing the detection of multiple classes and multiple objects belonging to each class.](https://cdn.sanity.io/images/vr8gru94/production/62007031e68f525c7fdd6c5f8551c6b3cdc1ce7e-609x810.png)


We stick with the bounding box visualizations for object detection, as the other method makes it harder to visualize multiple objects within the same image.

We have covered the idea behind object localization and detection in a zero-shot setting with CLIP. Now let’s take a look at how to implement it.

### Detection with CLIP

Before we move on to any classification, localization, or detection task, we need images to process. We will use a small demo dataset named `jamescalam/image-text-demo` hosted on Hugging Face _datasets_.

```json
{
  "_key": "da979db6d20b",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 43,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['text', 'image'],\\n\",\n       \"    num_rows: 21\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 43,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# import dataset\\n\",\n    \"from datasets import load_dataset\\n\",\n    \"\\n\",\n    \"data = load_dataset(\\n\",\n    \"    \\\"jamescalam/image-text-demo\\\",\\n\",\n    \"    split=\\\"train\\\",\\n\",\n    \"    revision=\\\"180fdae\\\"\\n\",\n    \")\\n\",\n    \"data\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.13 ('image-embeds')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.13\"\n  },\n  \"orig_nbformat\": 4,\n  \"vscode\": {\n   \"interpreter\": {\n    \"hash\": \"9ec8fc8fb845fc3e050bf8bf651a355c069bbfeddee31167baf4bc42b6050476\"\n   }\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

The dataset contains the image of a butterfly landing on a cat’s nose. We can view it in a Jupyter notebook with the following:

```python
data[2]['image']
```

![The image we will be using for image localization and detection.](https://cdn.sanity.io/images/vr8gru94/production/586f7dd4910e8845c38e38669be5b6d743f71ad7-921x1362.png)


We have downloaded the image, but it is not in the format we need for localization. For that, we must break the image into smaller patches.

#### Creating Patches

To create the patches, we must first convert our PIL image object into a PyTorch tensor. We can do this using `torchvision.transforms`.

```json
{
  "_key": "2877a4617efc",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 45,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"torch.Size([3, 5184, 3456])\"\n      ]\n     },\n     \"execution_count\": 45,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from torchvision import transforms\\n\",\n    \"\\n\",\n    \"# transform the image into tensor\\n\",\n    \"transt = transforms.ToTensor()\\n\",\n    \"\\n\",\n    \"img = transt(data[2][\\\"image\\\"])\\n\",\n    \"img.data.shape\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.13 ('image-embeds')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.13\"\n  },\n  \"orig_nbformat\": 4,\n  \"vscode\": {\n   \"interpreter\": {\n    \"hash\": \"9ec8fc8fb845fc3e050bf8bf651a355c069bbfeddee31167baf4bc42b6050476\"\n   }\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Our tensor has `3` color channels (RGB), a height of `5184` pixels, and width of `3456` pixels.

Assuming each patch has an equal height and width of 256 pixels, we must reshape this tensor into a tensor of shape `(1, 20, 13, 3, 256, 256)` where _20_ and _13_ of the number of patches in height and width of the image and _1_ represents the batch dimension.

We first add the batch dimension and move the color channels' dimension behind the height and width dimensions.

```json
{
  "_key": "9dbb2a88d243",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 46,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"torch.Size([1, 5184, 3456, 3])\"\n      ]\n     },\n     \"execution_count\": 46,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# add batch dimension and shift color channels\\n\",\n    \"patches = img.data.unfold(0,3,3)\\n\",\n    \"patches.shape\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.13 ('image-embeds')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.13\"\n  },\n  \"orig_nbformat\": 4,\n  \"vscode\": {\n   \"interpreter\": {\n    \"hash\": \"9ec8fc8fb845fc3e050bf8bf651a355c069bbfeddee31167baf4bc42b6050476\"\n   }\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Following this, we broke up the image into horizontal patches first. All patches will be square with dimensionalities of _256x256_, so the horizontal patch height equals _256_ pixels.

```json
{
  "_key": "8ac108cbebbf",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 47,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"torch.Size([1, 20, 3456, 3, 256])\"\n      ]\n     },\n     \"execution_count\": 47,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# break the image into patches (in height dimension)\\n\",\n    \"patch = 256\\n\",\n    \"\\n\",\n    \"patches = patches.unfold(1, patch, patch)\\n\",\n    \"patches.shape\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.13 ('image-embeds')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.13\"\n  },\n  \"orig_nbformat\": 4,\n  \"vscode\": {\n   \"interpreter\": {\n    \"hash\": \"9ec8fc8fb845fc3e050bf8bf651a355c069bbfeddee31167baf4bc42b6050476\"\n   }\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

![Cat image split into 256 pixel high strips.](https://cdn.sanity.io/images/vr8gru94/production/6b65fdbc7c1878d8a3cbfe510638aa9589c3da35-748x1244.png)


We need one more unfold to create the vertical space between patches.

```json
{
  "_key": "073aa649269a",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 49,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"torch.Size([1, 20, 13, 3, 256, 256])\"\n      ]\n     },\n     \"execution_count\": 49,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# break the image into patches (in width dimension)\\n\",\n    \"patches = patches.unfold(2, patch, patch)\\n\",\n    \"patches.shape\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.13 ('image-embeds')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.13\"\n  },\n  \"orig_nbformat\": 4,\n  \"vscode\": {\n   \"interpreter\": {\n    \"hash\": \"9ec8fc8fb845fc3e050bf8bf651a355c069bbfeddee31167baf4bc42b6050476\"\n   }\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

![Cat image split into small 256x256 pixel patches.](https://cdn.sanity.io/images/vr8gru94/production/a2bd725e4d4e84fc99d2113216dfd9076bd3dda3-930x1434.png)


Every patch is tiny, and looking at a single patch gives us little-to-no information about the image’s content. Rather than feeding single patches to CLIP, we merge multiple patches to create a big patch passed to CLIP.

![The first 6x6 window viewed by CLIP.](https://cdn.sanity.io/images/vr8gru94/production/a8b3638cd76b8c92f728f71bb81b62bb584e4beb-1171x1784.png)


We call this grouping of patches a `window`. A larger `window` size captures more global views of the image, whereas a smaller `window` can produce a more precise map at the risk of missing larger objects. To slide across the image and create a `big_batch` at each step, we do the following:

```python
window = 6
stride = 1

# window slides from top to bottom
for Y in range(0, patches.shape[1]-window+1, stride):
    # window slides from left to right
    for X in range(0, patches.shape[2]-window+1, stride):
        # initialize an empty big_patch array
        big_patch = torch.zeros(patch*window, patch*window, 3)
        # this gets the current batch of patches that will make big_batch
        patch_batch = patches[0, Y:Y+window, X:X+window]
        # loop through each patch in current batch
        for y in range(patch_batch.shape[1]):
            for x in range(patch_batch.shape[0]):
                # add patch to big_patch
                big_patch[
                    y*patch:(y+1)*patch, x*patch:(x+1)*patch, :
                ] = patch_batch[y, x].permute(1, 2, 0)
        # display current big_patch
        plt.imshow(big_patch)
        plt.show()
```

[Video](https://d33wubrfki0l68.cloudfront.net/d8705cf07458a52951fc09e3094ecbaae6406a30/448d7/images/zero-shot-object-detection-clip-14.mp4)


We will re-use this logic later when creating our patch image embeddings. Before we do that, we must initialize CLIP.

#### CLIP and Localization

The Hugging Face _transformers_ library contains an implementation of CLIP named [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32). We can download and initialize it like so:

```python
from transformers import CLIPProcessor, CLIPModel
import torch

# define processor and model
model_id = "openai/clip-vit-base-patch32"

processor = CLIPProcessor.from_pretrained(model_id)
model = CLIPModel.from_pretrained(model_id)

# move model to device if possible
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model.to(device)
```

Note that we also move to model to a CUDA-enabled GPU _if possible_ to reduce inference times.

With CLIP initialized, we can rerun the patch sliding logic, but this time we will calculate the similarity between each `big_patch` and the text label `"a fluffy cat"`.

```python
window = 6
stride = 1

scores = torch.zeros(patches.shape[1], patches.shape[2])
runs = torch.ones(patches.shape[1], patches.shape[2])

for Y in range(0, patches.shape[1]-window+1, stride):
    for X in range(0, patches.shape[2]-window+1, stride):
        big_patch = torch.zeros(patch*window, patch*window, 3)
        patch_batch = patches[0, Y:Y+window, X:X+window]
        for y in range(window):
            for x in range(window):
                big_patch[
                    y*patch:(y+1)*patch, x*patch:(x+1)*patch, :
                ] = patch_batch[y, x].permute(1, 2, 0)
        # we preprocess the image and class label with the CLIP processor
        inputs = processor(
            images=big_patch,  # big patch image sent to CLIP
            return_tensors="pt",  # tell CLIP to return pytorch tensor
            text="a fluffy cat",  # class label sent to CLIP
            padding=True
        ).to(device) # move to device if possible

        # calculate and retrieve similarity score
        score = model(**inputs).logits_per_image.item()
        # sum up similarity scores from current and previous big patches
        # that were calculated for patches within the current window
        scores[Y:Y+window, X:X+window] += score
        # calculate the number of runs on each patch within the current window
        runs[Y:Y+window, X:X+window] += 1
```

Here we have also added `scores` and `runs` that we will use to calculate the _mean_ score for each patch. We calculate the `scores` tensor as the sum of every `big_patch` score calculated while the patches were within the `window`.

Some patches will be seen more often than others (for example, the top-left patch is seen once), so the scores will be much greater for patches viewed more frequently. That is why we use the `runs` tensor to keep track of the “visit frequency” for each patch. With both tensors populated, we calculate the mean score:

```python
scores /= runs
```

The `scores` tensor typically contains a smooth gradient of values as a byproduct of the scoring function sliding over each window. This means the scores gradually fade to `0.0` the further they are from the object of interest.

We cannot accurately visualize the object location with the current scores. Ideally, we should push low scores to zero while maintaining a range of values for higher scores. We can do this by clipping our outputs and normalizing the remaining values.

```python
# clip the scores
scores = np.clip(scores-scores.mean(), 0, np.inf)

# normalize scores
scores = (
    scores - scores.min()) / (scores.max() - scores.min()
)
```

![After clipping and normalization we return a more useful visual (right).](https://cdn.sanity.io/images/vr8gru94/production/a7a86913765c70dfa415ffbf28bdedaf49a89699-1706x913.png)


With that, our patch scores are ready, and we can move on to visualizing the results.

#### Visualize Localization

Each patch in the (20,13)(20,13) patches tensor is assigned a similarity score within the range of 00 (not similar) to 11 (perfect match).

If we can align the scores with the original image pixels, we can multiply each pixel by its corresponding similarity score. Those near 00 will be dark, and near 11 will maintain their original brightness.

The only problem is that these two tensors are _not_ the same shape:

```python
scores.shape, patches.shape
```

`[Out]: (torch.Size([20, 13]), torch.Size([1, 20, 13, 3, 256, 256]))`

We need to reshape `patches` to align with scores. To do that, we use `squeeze` to remove the batch dimension at position `0` and then re-order the dimensions using `permute`.

```python
# transform the patches tensor
adj_patches = patches.squeeze(0).permute(3, 4, 2, 0, 1)
adj_patches.shape
```

`[Out]: torch.Size([256, 256, 3, 20, 13])`

From there, we multiply the adjusted patches and `scores` to return the brightness-adjusted patches. These need to be permuted again to be visualized with `matplotlib`.

```python
# multiply patches by scores
adj_patches = adj_patches * scores

# rotate patches to visualize
adj_patches = adj_patches.permute(3, 4, 2, 0, 1)
adj_patches.shape
```

`[Out]: torch.Size([20, 13, 3, 256, 256])`

Now we’re ready to visualize:

```python
Y = adj_patches.shape[0]
X = adj_patches.shape[1]

fig, ax = plt.subplots(Y, X, figsize=(X*.5, Y*.5))
for y in range(Y):
    for x in range(X):
        ax[y, x].imshow(adj_patches[y, x].permute(1, 2, 0))
        ax[y, x].axis("off")
        ax[y, x].set_aspect('equal')
plt.subplots_adjust(wspace=0, hspace=0)
plt.show()
```

![Object localization for “a fluffy cat”.](https://cdn.sanity.io/images/vr8gru94/production/b8e29ccaf47e9ac9c6d68fdda15a7f82299412e5-894x1328.png)


That works well. We can repeat the same but with the prompt `"a butterfly"` to return:

![Object localization for “a butterfly”.](https://cdn.sanity.io/images/vr8gru94/production/df6a8d683c8d42ffc3dc8643506a91c4704fa348-894x1328.png)


CLIP shows another good result and demonstrates how easy it is to add new labels to classification and localization tasks with CLIP.

#### Bounding Box

Before moving on to object detection, we need to rework the visualization to handle multiple objects.

The standard way to outline objects for localization and detection is to use a bounding box. We will do the same using the `scores` calculated previously for the `"a butterfly"` prompt.

The bounding box requires a defined edge, unlike our previous visual, which had a more continuous fade to black. To do this, we need to set a threshold for what is positive or negative, and we will use `0.5`.

```python
# scores higher than 0.5 are positive
detection = scores > 0.5
```

We can now detect the non-zero positions with the `np.nonzero` function. The output values represent the x,y coordinates of patches with `scores > 0.5`.

```json
{
  "_key": "568f74457528",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 28,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"tensor([[ 3,  2],\\n\",\n       \"        [ 3,  3],\\n\",\n       \"        [ 4,  1],\\n\",\n       \"        [ 4,  2],\\n\",\n       \"        [ 4,  3],\\n\",\n       \"        [ 4,  4],\\n\",\n       \"        [ 5,  1],\\n\",\n       \"        [ 5,  2],\\n\",\n       \"        [ 5,  3],\\n\",\n       \"        [ 5,  4],\\n\",\n       \"        [ 6,  1],\\n\",\n       \"        [ 6,  2],\\n\",\n       \"        [ 6,  3],\\n\",\n       \"        [ 6,  4],\\n\",\n       \"        [ 7,  1],\\n\",\n       \"        [ 7,  2],\\n\",\n       \"        [ 7,  3],\\n\",\n       \"        [ 7,  4],\\n\",\n       \"        [ 8,  1],\\n\",\n       \"        [ 8,  2],\\n\",\n       \"        [ 8,  3],\\n\",\n       \"        [ 9,  1],\\n\",\n       \"        [ 9,  2],\\n\",\n       \"        [ 9,  3],\\n\",\n       \"        [10,  2]])\"\n      ]\n     },\n     \"execution_count\": 28,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# non-zero positions\\n\",\n    \"np.nonzero(detection)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.13 ('image-embeds')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.13\"\n  },\n  \"orig_nbformat\": 4,\n  \"vscode\": {\n   \"interpreter\": {\n    \"hash\": \"9ec8fc8fb845fc3e050bf8bf651a355c069bbfeddee31167baf4bc42b6050476\"\n   }\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

The first column represents the _x-coordinates_ of non-zero positions, and the second column represents the respective _y-coordinates_.

![Detection coordinates created by detection = scores > 0.5.](https://cdn.sanity.io/images/vr8gru94/production/a7569bf6103de06af871b8595212d5ca71876972-1708x2554.png)


Our bounding box will take each of the edges produced by these non-zero coordinates.

![Bounding box](https://cdn.sanity.io/images/vr8gru94/production/c93b4ce33239fe41b2d0fbc96ea397efabe5fb7c-1740x2580.png)


We need the minimum and maximum x and y coordinates to find the box corners.

```json
{
  "_key": "593be30c36d1",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 29,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"(3, 11)\"\n      ]\n     },\n     \"execution_count\": 29,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"y_min, y_max = (\\n\",\n    \"    np.nonzero(detection)[:,0].min().item(),\\n\",\n    \"    np.nonzero(detection)[:,0].max().item()+1\\n\",\n    \")\\n\",\n    \"y_min, y_max\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 30,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"(1, 5)\"\n      ]\n     },\n     \"execution_count\": 30,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"x_min, x_max = (\\n\",\n    \"    np.nonzero(detection)[:,1].min().item(),\\n\",\n    \"    np.nonzero(detection)[:,1].max().item()+1\\n\",\n    \")\\n\",\n    \"x_min, x_max\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.13 ('image-embeds')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.13\"\n  },\n  \"orig_nbformat\": 4,\n  \"vscode\": {\n   \"interpreter\": {\n    \"hash\": \"9ec8fc8fb845fc3e050bf8bf651a355c069bbfeddee31167baf4bc42b6050476\"\n   }\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

![The min and max values of  x , y  coordinates give us our bounding box corners within the patches array.](https://cdn.sanity.io/images/vr8gru94/production/61d89118bc785e1e41b4dfe07fd0dad975de38c7-1708x2554.png)


These give us the bounding box coordinates based on patches rather than pixels. To get the pixel coordinates (for the visual), we multiply the coordinates by `patch`. After that, we calculate the box `height` and `width`.

```json
{
  "_key": "4cbb0e434932",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 31,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"(256, 768)\"\n      ]\n     },\n     \"execution_count\": 31,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"y_min *= patch\\n\",\n    \"y_max *= patch\\n\",\n    \"x_min *= patch\\n\",\n    \"x_max *= patch\\n\",\n    \"x_min, y_min\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 32,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"(2048, 1024)\"\n      ]\n     },\n     \"execution_count\": 32,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"height = y_max - y_min\\n\",\n    \"width = x_max - x_min\\n\",\n    \"\\n\",\n    \"height, width\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.13 ('image-embeds')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.13\"\n  },\n  \"orig_nbformat\": 4,\n  \"vscode\": {\n   \"interpreter\": {\n    \"hash\": \"9ec8fc8fb845fc3e050bf8bf651a355c069bbfeddee31167baf4bc42b6050476\"\n   }\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

![Bounding box coordinates](https://cdn.sanity.io/images/vr8gru94/production/403a43ccf8ae3444ff2c0b5bdd65fb081975448b-1708x2554.png)


With the `x_min`, `y_min`, `width`, and `height` values we can use `matplotlib.patches` to create the bounding box. Before we do that, we convert the original PIL image into a `matplotlib`-friendly format.

```json
{
  "_key": "df86a5ad1e68",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 33,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"(3, 5184, 3456)\"\n      ]\n     },\n     \"execution_count\": 33,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# image shape\\n\",\n    \"img.data.numpy().shape\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 34,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"(5184, 3456, 3)\"\n      ]\n     },\n     \"execution_count\": 34,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# move color channel to final dim\\n\",\n    \"image = np.moveaxis(img.data.numpy(), 0, -1)\\n\",\n    \"image.shape\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.13 ('image-embeds')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.13\"\n  },\n  \"orig_nbformat\": 4,\n  \"vscode\": {\n   \"interpreter\": {\n    \"hash\": \"9ec8fc8fb845fc3e050bf8bf651a355c069bbfeddee31167baf4bc42b6050476\"\n   }\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Now we visualize everything together:

```python
import matplotlib.patches as patches

fig, ax = plt.subplots(figsize=(Y*0.5, X*0.5))

ax.imshow(image)

# Create a Rectangle patch
rect = patches.Rectangle(
    (x_min, y_min), width, height,
    linewidth=3, edgecolor='#FAFF00', facecolor='none'
)

# Add the patch to the Axes
ax.add_patch(rect)

plt.show()
```

![Bounding box visual on a butterfly](https://cdn.sanity.io/images/vr8gru94/production/c01610aed767d1e42eb01575a71dfacc1d7f7097-409x578.png)


There we have our bounding box visual.

### Object Detection

We finally have everything we need to perform **object detection** for multiple object classes within the same image. The logic is a loop over what we have already built, and we can package it into a neater function like so:

```python
def detect(prompts, img, patch_size=256, window=6, stride=1, threshold=0.5):
    # build image patches for detection
    img_patches = get_patches(img, patch_size)
    # convert image to format for displaying with matplotlib
    image = np.moveaxis(img.data.numpy(), 0, -1)
    # initialize plot to display image + bounding boxes
    fig, ax = plt.subplots(figsize=(Y*0.5, X*0.5))
    ax.imshow(image)
    # process image through object detection steps
    for i, prompt in enumerate(tqdm(prompts)):
        scores = get_scores(img_patches, prompt, window, stride)
        x, y, width, height = get_box(scores, patch_size, threshold)
        # create the bounding box
        rect = patches.Rectangle((x, y), width, height, linewidth=3, edgecolor=colors[i], facecolor='none')
        # add the patch to the Axes
        ax.add_patch(rect)
    plt.show()
```

_(Find the_ _[full code here](https://github.com/pinecone-io/examples/blob/master/learn/search/image/image-retrieval-ebook/clip-object-detection/zero-shot-object-detection-clip.ipynb))_

Now we pass a list of class labels and the image to `detect`. The function will return our image with each detected object annotated with a bounding box.

```python
detect(["a cat", "a butterfly"], img, window=4, stride=1)
```

![Object detection](https://cdn.sanity.io/images/vr8gru94/production/96fc7bb855f240f6f082e17c5d4b3873b9889dd3-964x1378.png)


The current implementation is limited to displaying a single object from each class, but this can be solved with a small amount of additional logic.

---

That’s it for this walkthrough of _zero-shot_ object localization and detection with OpenAI’s CLIP. Zero-shot opens the doors to many organizations and domains that could not perform good object detection due to a lack of training data or compute resources — which is the case for the vast majority of companies.

Multi-modality and CLIP are just part of a trend towards more broadly applicable ML with a much lower barrier to entry. Zero-to-few-shot learning unlocks those previously inaccessible projects and presents us with what will undoubtedly be a giant leap forward in ML capability and adoption across the globe.

## Resources

[Code Notebook](https://github.com/pinecone-io/examples/blob/master/learn/search/image/image-retrieval-ebook/clip-object-detection/zero-shot-object-detection-clip.ipynb)

[1] O. Russakovsky et al., [ImageNet Large Scale Visual Recognition Challenge](https://arxiv.org/abs/1409.0575) (2014)

[2] A. Krizhevsky et al., [ImageNet Classification with Deep Convolutional Neural Networks](https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) (2012), NeurIPS

[3] A. Radford, J. Kim, et al., [Learning Transferable Visual Models From Natural Language Supervision](https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language.pdf) (2021)

[4] F. Bianchi, [Domain-Specific Multi-Modal Machine Learning with CLIP](https://youtu.be/uqRSc-KSA1Y?t=1841) (2022), Pinecone Workshop

[5] R. Pisoni, [Searching Across Images and Text: Intro to OpenAI’s CLIP](https://youtu.be/W11lSifqJDs?t=1690) (2022), Pinecone Workshop