Semantic Search with Cohere and Pinecone

In this guide we will see how to utilize Cohere for generating language embeddings, which can then be stored in Pinecone and used for Semantic Search.

Environment Setup

We start by installing the Cohere and Pinecone clients, we will also need HuggingFace Datasets for downloading the TREC dataset that we will use in this guide.

Copy
Copied
pip install -U cohere pinecone-client datasets

Creating Embeddings

To create embeddings we must first initialize our connection to Cohere, we sign up for an API key at Cohere.

Copy
Copied
import cohere

co = cohere.Client("<<YOUR_API_KEY>>")

We will load the Text REtrieval Conference (TREC) question classification dataset which contains 5.5K labeled questions. We will take the first 1K samples for this walkthrough, but this can be scaled to millions or even billions of samples.

Copy
Copied
from datasets import load_dataset

# load the first 1K rows of the TREC dataset
trec = load_dataset('trec', split='train[:1000]')

Each sample in trec contains two label features and the text feature, which we will be using. We can pass the questions from the text feature to Cohere to create embeddings.

Copy
Copied
embeds = co.embed(
    texts=trec['text'],
    model='small',
    truncate='LEFT'
).embeddings

We can check the dimensionality of the returned vectors, for this we will convert it from a list of lists to a Numpy array. We will need to save the embedding dimensionality from this to be used when initializing our Pinecone index later.

Copy
Copied
import numpy as np

shape = np.array(embeds).shape
print(shape)
[Out]:
(1000, 1024)

Here we can see the 1024 embedding dimensionality produced by Cohere's small model, and the 1000 samples we built embeddings for.

Storing the Embeddings

Now that we have our embeddings we can move on to indexing them in the Pinecone vector database. For this we need a Pinecone API key, sign up for one here.

We first initialize our connection to Pinecone, and then create a new index for storing the embeddings (we will call it "cohere-pinecone-trec"). When creating the index we specify that we would like to use the cosine similarity metric to align with Cohere's embeddings, and also pass the embedding dimensionality of 1024.

Copy
Copied
import pinecone

pinecone.init("<<YOUR_API_KEY>>", environment='us-west1-gcp')

index_name = 'cohere-pinecone-trec'

# if the index does not exist, we create it
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=shape[1],
        metric='cosine'
    )

# connect to index
index = pinecone.Index(index_name)

Now we can begin populating the index with our embeddings. Pinecone expects us to provide a list of tuples in the format (id, vector, metadata), where the metadata field is an optional extra field where we can store anything we want in a dictionary format. For this example, we will store the original text of the embeddings.

warning

High-cardinality metadata values (like the unique text values we use here) can reduce the number of vectors that fit on a single pod. See Limits for more.

While uploading our data, we will batch everything to avoid pushing too much data in one go.

Copy
Copied
batch_size = 128

ids = [str(i) for i in range(shape[0])]
# create list of metadata dictionaries
meta = [{'text': text} for text in trec['text']]

# create list of (id, vector, metadata) tuples to be upserted
to_upsert = list(zip(ids, embeds, meta))

for i in range(0, shape[0], batch_size):
    i_end = min(i+batch_size, shape[0])
    index.upsert(vectors=to_upsert[i:i_end])

# let's view the index statistics
print(index.describe_index_stats())
[Out]:
{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1000}}}

We can see from index.describe_index_stats that we have a 1024-dimensionality index populated with 1000 embeddings. The indexFullness metric tells us how full our index is, at the moment it is empty. Using the default value of one p1 pod we can fit around 750K embeddings before the indexFullness reaches capacity. The Usage Estimator can be used to identify the number of pods required for a given number of n-dimensional embeddings.

Semantic Search

Now that we have our indexed vectors we can perform a few search queries. When searching we will first embed our query using Cohere, and then search using the returned vector in Pinecone.

Copy
Copied
query = "What caused the 1929 Great Depression?"

# create the query embedding
xq = co.embed(
    texts=[query],
    model='small',
    truncate='LEFT'
).embeddings

print(np.array(xq).shape)

# query, returning the top 5 most similar results
res = index.query(xq, top_k=5, include_metadata=True)

The response from Pinecone includes our original text in the metadata field, let's print out the top_k most similar questions and their respective similarity scores.

Copy
Copied
for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")
[Out]:
0.83: Why did the world enter a global depression in 1929 ?
0.75: When was `` the Great Depression '' ?
0.50: What crop failure caused the Irish Famine ?
0.34: What war did the Wanna-Go-Home Riots occur after ?
0.34: What were popular songs and types of songs in the 1920s ?

Looks good, let's make it harder and replace "depression" with the incorrect term "recession".

Copy
Copied
query = "What was the cause of the major recession in the early 20th century?"

# create the query embedding
xq = co.embed(
    texts=[query],
    model='small',
    truncate='LEFT'
).embeddings

# query, returning the top 5 most similar results
res = index.query(xq, top_k=5, include_metadata=True)

for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")
[Out]:
0.66: Why did the world enter a global depression in 1929 ?
0.61: When was `` the Great Depression '' ?
0.43: What are some of the significant historical events of the 1990s ?
0.43: What crop failure caused the Irish Famine ?
0.37: What were popular songs and types of songs in the 1920s ?

Let's perform one final search using the definition of depression rather than the word or related words.

Copy
Copied
query = "Why was there a long-term economic downturn in the early 20th century?"

# create the query embedding
xq = co.embed(
    texts=[query],
    model='small',
    truncate='LEFT'
).embeddings

# query, returning the top 10 most similar results
res = index.query(xq, top_k=10, include_metadata=True)

for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")
[Out]:
0.71: Why did the world enter a global depression in 1929 ?
0.62: When was `` the Great Depression '' ?
0.40: What crop failure caused the Irish Famine ?
0.38: What are some of the significant historical events of the 1990s ?
0.38: When did the Dow first reach ?

It's clear from this example that the semantic search pipeline is clearly able to identify the meaning between each of our queries. Using these embeddings with Pinecone allows us to return the most semantically similar questions from the already indexed TREC dataset.