Image Similarity Search

In this example, we will walk you through how to create an image similarity search backend service. The structure of this notebook is as follows:

  1. Install dependencies and set up Pinecone.
  2. Download an image dataset (e.g., TinyImageNet).
  3. Download a pre-trained computer vision (CV) model (e.g., squeezenet).
  4. Using the CV model, convert images to their vector embeddings.
  5. Create a Pinecone vector index service, which serves as the backend service.
  6. Upload and index your image vectors.
  7. Perform image similarity search and review the results.
  8. Stop the Pinecone service.

Open Notebook in Google Colab

Prerequisites

Install dependencies.

!pip install -qU pip pinecone-client torchvision matplotlib==3.1.3 tqdm requests pandas ipywidgets

Set up Pinecone.

import pinecone
import os

api_key = os.getenv("PINECONE_API_KEY") or "USE_YOUR_API_KEY"
pinecone.init(api_key=api_key)

(Need an API key? Get it here.)

Check Pinecone version compatibility.

import pinecone.info

version_info = pinecone.info.version()
server_version = ".".join(version_info.server.split(".")[:2])
client_version = ".".join(version_info.client.split(".")[:2])
notebook_version = "0.8"

assert (
    notebook_version == server_version
), "This notebook is outdated. Consider using the latest version of the notebook."
assert client_version == server_version, "Please upgrade pinecone-client."

Data preparation

In this example, we will use the Tiny ImageNet dataset, which has 100,000 training images across 200 classes. Each class has 500 training images.

import requests, os, zipfile

DATA_DIR = "tmp"
IMAGENET_DIR = f"{DATA_DIR}/tiny-imagenet-200"
IMAGENET_ZIP = f"{DATA_DIR}/tiny-imagenet-200.zip"
IMAGENET_URL = "http://cs231n.stanford.edu/tiny-imagenet-200.zip"


def download_data():
    os.makedirs(DATA_DIR, exist_ok=True)

    if not os.path.exists(IMAGENET_DIR):
        if not os.path.exists(IMAGENET_ZIP):
            r = requests.get(IMAGENET_URL)  # create HTTP response object
            with open(IMAGENET_ZIP, "wb") as f:
                f.write(r.content)

        with zipfile.ZipFile(IMAGENET_ZIP, "r") as zip_ref:
            zip_ref.extractall(DATA_DIR)


download_data()

torchvision is a popular library from PyTorch (torch) that provides convenient functions for loading and transforming images. Here we use torchvision to select images that belong to a random subset of image classes.

from torchvision import datasets
import random

random_seed = 123
random.seed(random_seed)

# Select a random sample of image classes
image_classes = set(random.sample(range(200), 5))

# Get the corresponding image file names
image_file_names = [
    file_name
    for file_name, label in datasets.ImageFolder(f"{IMAGENET_DIR}/train").imgs
    if label in image_classes
]

Let’s look at some of selected images. We’ll write two utility functions to show the images, which will also be useful later.

import matplotlib.pyplot as plt
from PIL import Image


def show_images_horizontally(file_names):
    m = len(file_names)
    fig, ax = plt.subplots(1, m)
    fig.set_figwidth(1.5 * m)
    for a, f in zip(ax, file_names):
        a.imshow(Image.open(f))
        a.axis("off")
    plt.show()


def show_image(file_name):
    fig, ax = plt.subplots(1, 1)
    fig.set_figwidth(1.3)
    ax.imshow(Image.open(file_name))
    ax.axis("off")

for i in range(5):
    show_images_horizontally(random.sample(image_file_names, 10))

png

Convert images to embeddings using a pre-trained neural network model

Vector embeddings for images are representations of images as high dimensional vectors.

A typical image file consists of values of pixel intensities in three spectra: red, green, and blue. A na├»ve way of converting an image file to a vector is by “flattening” the image file–that is, arranging all pixel value in one row. Sadly this is practically useless for most machine learning applications.

A much better way of creating a vector representation of an image is to apply a computer vision (CV) model to the image. The output of the CV model is a vector embedding of the image optimized for assessing the image similarity. In other words, the CV model maps similar images to vectors that are close to each other, and dissimilar images to vectors that are far apart. By converting images to vector embeddings, we can simplify the problem of image similarity search to that of vector proximity search.

In practice, we don’t always have to train a new CV model. In this example, we will use a pre-trained model from torchvision called squeezenet to convert the images. Feel free to experiment with other models provided by torchvision to see which model best fits your application. See torchvision examples page.

from torchvision import transforms as ts
import torchvision.models as models


class ImageEmbedder:
    def __init__(self):
        self.normalize = ts.Normalize(
            mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
        )
        # see https://pytorch.org/vision/0.8/models.html for many more model options
        self.model = models.squeezenet1_0(pretrained=True)  # squeezenet

    def embed(self, image_file_name):
        image = Image.open(image_file_name).convert("RGB")
        image = ts.Resize(256)(image)
        image = ts.CenterCrop(224)(image)
        tensor = ts.ToTensor()(image)
        tensor = self.normalize(tensor).reshape(1, 3, 224, 224)
        vector = self.model(tensor).cpu().detach().numpy().flatten()
        return vector


image_embedder = ImageEmbedder()

Once we have the image embedder class, we can use it to convert image files to vector embeddings.

from tqdm.autonotebook import tqdm
import pandas as pd

df = pd.DataFrame()
df["image_file_name"] = image_file_names
df["embedding_id"] = [
    file_name.split(IMAGENET_DIR)[-1] for file_name in image_file_names
]
df["embedding"] = [
    image_embedder.embed(file_name)
    for file_name in tqdm(image_file_names)
]
df = df.sample(frac=1)  # shuffle the images
100% (2500 of 2500) |####################| Elapsed Time: 0:02:26 Time:  0:02:26
df.head(2)
image_file_nameembedding_idembedding
1711tmp/tiny-imagenet-200/train/n03400231/images/n.../train/n03400231/images/n03400231_289.JPEG[8.216192, 7.0965824, 3.269934, 4.0712285, 3.7...
558tmp/tiny-imagenet-200/train/n02058221/images/n.../train/n02058221/images/n02058221_150.JPEG[10.222655, 8.227736, 6.144992, 7.0665154, 5.9...

Split the data into the items dataset and the queries dataset. Because the data is already shuffled, we can directly split the data.

cutoff = int(len(df) * 0.8)
item_df, query_df = df[:cutoff], df[cutoff:]
(item_df.shape, query_df.shape)
((2000, 3), (500, 3))

Create a Pinecone vector index service

Now that we have our vector embeddings we can create our Pinecone service and upload the data to it.

import pinecone.graph
import pinecone.service
import pinecone.connector

Below we spin up the Pinecone service. The graph object is simply a configuration object that defines the service. Here we use two arguments:

  • metric="euclidean" means the vector search will use euclidean distance as the measure of similarity.
  • shards=1 means the service will run on a single node. Since we only index at most 100k vectors in 1000 dimensions, a single shard is more than enough. If you have more than 1M vectors, you should increase the number of shards. Refer to the documentation for guidelines on how to choose the number of shards.
# Choosing an arbitrary name for my service
service_name = "simple-pytorch-image-search"

# Checking whether the service is already deployed.
if service_name not in pinecone.service.ls():
    graph = pinecone.graph.IndexGraph(metric="euclidean", shards=1)
    pinecone.service.deploy(service_name, graph)

Now that the service is up, we can upload vectorized images. To do that, we create a connection. The connection is your communication channel to the service. You only need to do this once.

conn = pinecone.connector.connect(service_name)
conn.info()
InfoResult(index_size=0)

upsert will upload, add, and index items into your remote vector similarity search service. If an item with that ID already exists it will be overwritten with the new provided value.

collect() is needed because upsert is a lazy evaluated operation. The data is not sent until you use collect(). See the quick tour for example and the connection documentation for more details.

acks = conn.upsert(items=zip(item_df.embedding_id, item_df.embedding)).collect()
conn.info()
InfoResult(index_size=2000)

Querying

Let’s test the dataset as test queries.

len(query_df)
500

Pinecone supports running multiple queries in a minibatch.

import time


start = time.perf_counter()

res = conn.query(query_df.embedding, batch_size=100).collect()  # issuing queries

end = time.perf_counter()
print("Run this test on a fast network to get the best performance.")
# print(f"Service QPS is {int(len(query_df)/(end-start))} queries/second.")
Run this test on a fast network to get the best performance.

Of course, running single queries one at a time is also an option.

start = time.perf_counter()

for query in query_df.embedding:
    _ = conn.unary_query(query)  # issuing one query

end = time.perf_counter()
print("Run this test on a fast network to get the best performance.")
# print(f"Service QPS is {int(len(query_df)/(end-start))} queries/second.")
Run this test on a fast network to get the best performance.

Now that we have our results, we can look at them.

for i in range(100, 110):
    print(f"Query {i+1} and search results")
    show_image(query_df.image_file_name.iloc[i])
    show_images_horizontally(
        [IMAGENET_DIR + embedding_id for embedding_id in res[i].ids]
    )

    print("-" * 80)
Query 101 and search results

Example query and results for image similarity search

Example query and results for image similarity search

--------------------------------------------------------------------------------
Query 102 and search results

Example query and results for image similarity search

Example query and results for image similarity search

--------------------------------------------------------------------------------
Query 103 and search results

Example query and results for image similarity search

Example query and results for image similarity search

--------------------------------------------------------------------------------
Query 104 and search results

Example query and results for image similarity search

Example query and results for image similarity search

--------------------------------------------------------------------------------
Query 105 and search results

Example query and results for image similarity search

Example query and results for image similarity search

--------------------------------------------------------------------------------
Query 106 and search results

Example query and results for image similarity search

Example query and results for image similarity search

--------------------------------------------------------------------------------
Query 107 and search results

Example query and results for image similarity search

Example query and results for image similarity search

--------------------------------------------------------------------------------
Query 108 and search results

Example query and results for image similarity search

Example query and results for image similarity search

--------------------------------------------------------------------------------
Query 109 and search results

Example query and results for image similarity search

Example query and results for image similarity search

--------------------------------------------------------------------------------
Query 110 and search results

Example query and results for image similarity search

Example query and results for image similarity search

--------------------------------------------------------------------------------

As is shown above, image similarity search works relatively well with the pre-trained CV model: when we query with images that are not in the index, most similarity seaches return images from the same class as the query images.

Wrap up

You can use pinecone.service.stop to stop the service and free all resources dedicated to it. Once a service is stopped, you cannot restart the service, and all resources need to be recreated. We suggest that you only stop a service if no application is using it.

pinecone.service.stop(service_name)
{'success': True}