Search Result Diversification With Post-Processors

This notebook demonstrates how Pinecone’s API lets you control the way your service handles requests. Here, we demonstrate Pinecone’s powerful Post-processing API and define a post-processing function that will perform search result diversification.

For some search applications, you may want to exclude results that are near-duplicates to the query. For example, in a product search application, we often want to retrieve a diverse set of products instead of the same product with slight variations.

This demo notebook will walk you through building an image search application that achieves that. We will use Pinecone to tie everything together and expose the image search as a real-time service that will take any fashion article image and return a diverse set of similar fashion article images.

We will,

  1. implement a simple diversification filter as a Pinecone’s post-processing function;
  2. upload the post-processing function to Pinecone’s Model Hub;
  3. launch an image search service that includes a vector index backend and a diversification filter post-processor function;
  4. upload and index our image vectors;
  5. query our deployed service;
  6. and compare the service with a baseline service that does not contain a diversification functionality.

Install and Setup Pinecone Client

First, let’s install the Pinecone client and set up its API key. Here you can obtain an API key.

!pip install --quiet -U numpy pinecone-client python-mnist matplotlib progressbar2 pandas ipywidgets
import pinecone.graph
import pinecone.service
import pinecone.connector
import pinecone.hub

pinecone.init(api_key='FILL-YOUR-API-KEY')

Define a Search Result Diversification Postprocessor

The following code computes a heterogeneous “top-five” subset out of a query result set. This is done simply by clustering the data into five clusters and choosing from each cluster a representative. We use the k-means clustering algorithm to minimize inner cluster distance variance while maximizing in-between clusters variance.

Recall that our demo focus is on Pinecone’s post-processing API. Therefore, we apply a simple diversification idea and assess it subjectively. For more rigorous search results diversification ideas, see for example this work.

Our diversification postprocessor is a python class that follows Pinecone Model Hub’s Postprocessor API. In short, we should implement a transform function that receives the query results and manipulates them.

Note that we save the code as a file because we will later package it as a docker image, and upload it to Pinecone’s Model Hub. This way, we will be able to define a search service with built-in search result diversification functionality.

%%writefile diversity_postprocessor.py

import numpy as np
from sklearn.cluster import KMeans

from pinecone.hub import postprocessor, QueryResult

@postprocessor
class DiversityPostprocessor:
    def __init__(self):
        self._k = 5  # top k

    def _diversity_filter(self, data):
        kmeans = KMeans(n_clusters=self._k, random_state=0).fit(data)

        inxs_per_cluster = [[i for i, value in enumerate(kmeans.labels_) if value == c] for c in range(self._k)]   # group cluster indices
        results = set([inxs[ int(len(inxs)/2) ] for inxs in inxs_per_cluster]) # from each cluster take the "median" index

        return results

    def transform(self, queries, matches):
        """This is the postprocessor relevant function"""
        output = []
        for q, match in zip(queries, matches):
            # Filter data
            res = self._diversity_filter(match.data)

            # Then rearrange results
            new_scores = [s for i,s in enumerate(match.scores) if i in res]
            new_ids = [id_ for i,id_ in enumerate(match.ids) if i in res]
            new_data = np.array([x for i,x in enumerate(match.data) if i in res])
            output.append(QueryResult(ids=new_ids, scores=new_scores, data=new_data))

        return output
Overwriting diversity_postprocessor.py

Create the Post-Processor Docker Image And Push It to Pinecone’s Model Hub

diversity_filter_image_builder = pinecone.hub.ImageBuilder(
    image="diversity_filter:v1",  # The name of the docker image (you should also tag the image
    build_path="./docker_build/diversity_filter/v1",  # Path to which docker build artifacts are saved
    model_path='./diversity_postprocessor.py', # Main model file
    pip=['numpy', 'scikit-learn'],  # Additional pip packages needed
    data_paths=[],  # Additional files or directories needed
)

# Log into Pinecone's Model Hub
login_cmd = pinecone.hub.get_login_cmd()
!{login_cmd}
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/jupyter/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
diversity_filter_image_builder.package(exist_ok=True)
!{diversity_filter_image_builder.get_build_cmd()}
!{diversity_filter_image_builder.get_push_cmd()}
~/docker_build/diversity_filter/v1 ~
Sending build context to Docker daemon  4.096kB
Step 1/4 : FROM hub.beta.pinecone.io/pinecone/base:0.8.34
 ---> e988c545396e
Step 2/4 : RUN pip3 install --quiet --upgrade pip
 ---> Using cache
 ---> 5ac13fbcd300
Step 3/4 : RUN pip3 install --quiet --no-cache-dir numpy scikit-learn
 ---> Using cache
 ---> f9b753cff6cd
Step 4/4 : COPY model.py ./model.py
 ---> Using cache
 ---> 1129e677325b
Successfully built 1129e677325b
Successfully tagged diversity_filter:v1
~
/bin/bash: -c: line 0: syntax error near unexpected token `}'
/bin/bash: -c: line 0: `{diversity_filter_image_builder.get_push_cmd()}'

Set Up a Pinecone Service

Define How the Service Handles Requests

Here we define how the service handles requests. We want to store and retrieve a diverse set of images. We store the vector embeddings in Pinecone’s vector index. We rank and retrieve them using the Euclidean distance measure. Finally, we apply our search results diversification function on the top matched vectors and retrieve the index’s final results.

Let’s deploy these computation steps using Pinecone’s Index Graph. Observe that we initiate a Vector Index with Euclidean distance and attach our postprocessor function. The resulting graph defines how we set (i.e., write) or retrieve (i.e., read) data.

graph = pinecone.graph.IndexGraph(metric='euclidean')

# Name of the hub images
diversity_filter_image_name = pinecone.hub.as_user_image(diversity_filter_image_builder.image)

# Add to the graph function that will deduplicate results
diversity_filter_postprocessor = pinecone.hub.HubFunction(name='diversity-postprocessor', image=diversity_filter_image_name)

graph.add_postprocessor(fn=diversity_filter_postprocessor)

# View the updated graph
graph.view()

svg

Deploy the Service and Set a Connection

service_name = "diversity-postprocessor-demo"
pinecone.service.deploy(service_name, graph, timeout=300)
conn = pinecone.connector.connect(service_name)
conn.info()
InfoResult(index_size=0)

Upload Vectors to the Service

Let’s upload real image vector embeddings into the service!

We are using Fashion MNIST dataset that contains fashion item images. For the sake of simplicity, we will use the raw grayscale pixel values as our vector embedding. Note that this choice is not optimal. Therefore, we expect it to produce reasonable search results only. (Recall, the focus of the demo is on the pre-processing API.)

First, let’s download the dataset.

!wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
!wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
!gunzip -f train-images-idx3-ubyte.gz
!gunzip -f train-labels-idx1-ubyte.gz
--2021-03-21 11:28:42--  http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Resolving fashion-mnist.s3-website.eu-central-1.amazonaws.com (fashion-mnist.s3-website.eu-central-1.amazonaws.com)... 52.219.75.48
Connecting to fashion-mnist.s3-website.eu-central-1.amazonaws.com (fashion-mnist.s3-website.eu-central-1.amazonaws.com)|52.219.75.48|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26421880 (25M) [binary/octet-stream]
Saving to: ‘train-images-idx3-ubyte.gz’

train-images-idx3-u 100%[===================>]  25.20M  11.8MB/s    in 2.1s

2021-03-21 11:28:45 (11.8 MB/s) - ‘train-images-idx3-ubyte.gz’ saved [26421880/26421880]

--2021-03-21 11:28:45--  http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Resolving fashion-mnist.s3-website.eu-central-1.amazonaws.com (fashion-mnist.s3-website.eu-central-1.amazonaws.com)... 52.219.75.48
Connecting to fashion-mnist.s3-website.eu-central-1.amazonaws.com (fashion-mnist.s3-website.eu-central-1.amazonaws.com)|52.219.75.48|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29515 (29K) [binary/octet-stream]
Saving to: ‘train-labels-idx1-ubyte.gz’

train-labels-idx1-u 100%[===================>]  28.82K  --.-KB/s    in 0.1s

2021-03-21 11:28:45 (199 KB/s) - ‘train-labels-idx1-ubyte.gz’ saved [29515/29515]
from mnist import MNIST
import numpy as np
images, labels = MNIST('.').load_training()
images = np.array(images)
np.array(images).shape
(60000, 784)

Upload the Vectors

import progressbar

upsert_acks = conn.upsert(items=( (f"img-{i}", img) for i,img in progressbar.progressbar(enumerate(images)))).collect()
59999 Elapsed Time: 0:00:03

Count the Index’s New Size

conn.info()
InfoResult(index_size=60000)

Search Example

Let’s try our new service! We query the service with an arbitrary image. Observe that we set the desired number of matches to be 30. Our search results diversification filter will reduce these matches into five results only. Hence, our service will retrieve five final results.

import matplotlib.pyplot as plt

Choose an Arbitrary Query Image

Here we choose our query vector. This is just an arbitrary image from our dataset.

query_id = np.random.randint(images.shape[0])
query = images[ query_id ]

fig = plt.figure(figsize=(3, 3))
img = query.reshape([28, 28])
plt.imshow(img, cmap='gray')
plt.axis('off')
plt.tight_layout()
plt.show()

png

Query the Service

Here we query the service. Observe that we set the required (maximal) number of matches to be 30; we expect to retrieve five results. Also, note that we require that the results would contain the vectors data. Recall that this is required by our post-processor function that clusters these vectors.

res = conn.query(queries=[query], top_k=30, include_data=True).collect()[0]

Retrieved Items

Let’s visualize the retrieved items.

columns = 5
rows = int(np.ceil(len(res.ids)/columns))

fig = plt.figure(figsize=(8, 8))

for i in range(1, len(res.ids)+1):
    data_idx = int(res.ids[i-1].split('-')[-1])
    img = images[data_idx].reshape([28, 28])
    lbl = labels[data_idx]
    fig.add_subplot(rows, columns, i)
    plt.imshow(img, cmap='gray')
    plt.axis('off')
plt.tight_layout()
plt.show()

png


Compare With a Service Without Search Results Diversification Functionality

Does our search results diversification filter work alright? Let’s compare it with a service that does not contain such a post-processing functionality.

First, let’s deploy a service with vector index only. This is the most basic Pinecone service functionality. Observe how simple and easy it is to deploy such a service.

graph = pinecone.graph.IndexGraph(metric='cosine')
graph.view()

svg

Deploy the Baseline Service

Here we deploy, fill in, and query the simple baseline service. We use the same query image as above. Then, we compare this baseline service vs. our search results diversification service.

baseline_service_name = "diversity-postprocessor-demo-baseline"
pinecone.service.deploy(baseline_service_name, graph, timeout=300)
baseline_conn = pinecone.connector.connect(baseline_service_name)

upsert_acks = baseline_conn.upsert(items=( (f"img-{i}", img) for i,img in progressbar.progressbar(enumerate(images)))).collect()
59999 Elapsed Time: 0:00:03
def show_results(res):
    columns = 5
    rows = int(np.ceil(len(res.ids)/columns))

    fig = plt.figure(figsize=(5, 5))

    for i in range(1, len(res.ids)+1):
        data_idx = int(res.ids[i-1].split('-')[-1])
        img = images[data_idx].reshape([28, 28])
        lbl = labels[data_idx]
        fig.add_subplot(rows, columns, i)
        plt.imshow(img, cmap='gray')
        plt.axis('off')
    plt.tight_layout()
    plt.show()

def compare(query):
    print("Query")
    fig = plt.figure(figsize=(2, 2))
    img = query.reshape([28, 28])
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    plt.tight_layout()
    plt.show()
    print()

    print("Baseline without a Diversirty Filter")
    res = baseline_conn.query(queries=[query], top_k=5, include_data=True).collect()[0]
    show_results(res)
    print("-"*20)
    print("Service with a Diversity Filter")
    res = conn.query(queries=[query], top_k=30, include_data=True).collect()[0]
    show_results(res)
    print("\n\n")

Cherry-Picked Examples

Let’s examine a few cherry-picked query examples.

Diversity in Action

Observe that all the baseline results (upper row) are near-duplicates, while our diversification-service results exhibit variance.

compare(images[27697])
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

Diversity Adds Noise

Adding variance to the results comes with the risk of adding non-relevant results. In this example, the lower row’s last match is less relevant. (Although it indeed eliminates a baseline duplicate match.)

compare(images[4647])
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

Baseline Results are Already Diverse

Sometimes it is hard to declare which of the options is best.

compare(images[6999])
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

Try It Yourself

Let’s run a quick subjective comparison. We pick ten query images at random (i.i.d.) and compare the diversification-filter vs. the baseline results.

for query_id in range(10):
    query_id = np.random.randint(images.shape[0])
    print(f"qid {query_id}")
    compare(images[ query_id ])
qid 34671
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

qid 42726
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

qid 54686
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

qid 17412
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

qid 49222
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

qid 36067
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

qid 22519
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

qid 56056
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

qid 21216
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

qid 54265
Query

png

Baseline without a Diversirty Filter

png

--------------------
Service with a Diversity Filter

png

Conclusion

Although we defined a simple search result diversification function and utilized a basically raw vector embedding technique, the cherry-picked examples demonstrate diversification’s usefulness. Search results diversification is an active research area. If you seek more rigorous search results diversification ideas and evaluation methods, then this book might be a good starting point.

Besides that, the comparison demonstrates the ease-of-use and flexibility of Pinecone’s Graph API. Observe how easily we could launch two different service flavors and compare them with live data. This gives a gist of what you could do with Pinecone. For example, rapid model development, live experiments (e.g., A/B tests), complex ETL and post-processing steps, and more.


Shut down the Services

We will not use the services anymore. Let’s shut them down.

Note that this permanently shuts the service. You will not be able to restart the service, and all resources need to be recreated. We suggest that you only stop a service if no application is using it.

for srv in pinecone.service.ls():
    pinecone.service.stop(srv)