Question Answering with Similarity Search

This notebook demonstrates how Pinecone's similarity search as a service helps you build a question answering application. We will index a set of questions and retrieve the most similar stored questions for a new (unseen) question. That way, we can link a new question to answers we might already have.

You can build a questions answering application with Pinecone in three steps:

  • Represent questions as vector embeddings so that semantically similar questions are in close proximity within the same vector space.
  • Index vectors using Pinecone.
  • Given a new question, query the index to fetch similar questions. This can allow us to store answers associated with these questions

In this notebook we will be dealing with indexing a set of quetions and retrieving similar questions for a new and unseen question.


!pip install -qU matplotlib pinecone-client ipywidgets
!pip install -qU sentence-transformers --no-cache-dir
import pandas as pd
import numpy as np

%matplotlib inline

Pinecone Installation and Setup

import pinecone
import os

# load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pinecone.init(api_key=api_key, environment='us-west1-gcp')

Get a Pinecone API key if you don’t have one already.

Create a New Pinecone Index

# pick a name for the new index
index_name = "question-answering"
# check whether an index with the same name already exists
if index_name in pinecone.list_indexes():

Create index

pinecone.create_index(name=index_name, dimension=300)

Connect to the index

The index object, a class instance of pinecone.Index , will be reused for optimal performance.

index = pinecone.Index(index_name=index_name)

Uploading Questions

The dataset used in this notebook is the Quora Question Pairs Dataset.

Let's download the dataset and load the data.

# download dataset from the url
import requests

DATA_DIR = "tmp"
DATA_FILE = f"{DATA_DIR}/quora_duplicate_questions.tsv"

def download_data():
    os.makedirs(DATA_DIR, exist_ok=True)

    if not os.path.exists(DATA_FILE):
        r = requests.get(DATA_URL)  # create HTTP response object
        with open(DATA_FILE, "wb") as f:

pd.set_option("display.max_colwidth", 500)

df = pd.read_csv(
    f"{DATA_FILE}", sep="\t", usecols=["qid1", "question1"], index_col=False
df = df.sample(frac=1).reset_index(drop=True)
df['qid1'] = df['qid1'].apply(str)
     qid1                                                                                     question1
0   20221                                 What is the best editor to write React and React-Native code?
1   88042  I want to live the rest of my life alone and without working. Is jail an appropriate option?
2  269899                                                           What do you think of Chinese girls?
3  428133        What is the most alarming thing you see in today's children between age group of 1-10?
4  297135                        Why should we square the distance in the universal law of gravitation?

Define the model

We will use the Averarage Word Embeddings Model for this example. This model has a high computation speed but relatively low quality of embeddings. You can look into other sentence embeddings models such as the Sentence Embeddings Models trained on Paraphrases for improving quality of embeddings.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("average_word_embeddings_glove.6B.300d")

Creating Vector Embeddings

# create embedding for each question
question_vectors = model.encode(list(df.question1), show_progress_bar=True).tolist()

# add question embeddings to dataframe
df["question_vector"] = question_vectors
Batches:   0%|          | 0/9083 [00:00<?, ?it/s]

Index the Vectors

import itertools

def chunks(iterable, batch_size=100):
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))
for batch in chunks(zip(df.qid1, df.question_vector)):


Once you have indexed the vectors it is very straightforward to query the index. These are the steps you need to follow:

  • Select a set of questions you want to query with
  • Use the Average Embedding Model to transform questions into embeddings.
  • Send each question vector to the Pinecone index and retrieve most similar indexed questions
# define questions to query the vector index
query_questions = [
    "What is best way to make money online?",
    "How can i build an e-commerce website?"

# extract embeddings for the questions
query_vectors = model.encode(query_questions).tolist()

# query pinecone
query_results = [index.query(xq, top_k=5) for xq in query_vectors]

# show the results
for question, res in zip(query_questions, query_results):
    print("\n\n\n Original question : " + str(question))
    print("\n Most similar questions based on pinecone vector search: \n")

    ids = [ for match in res.matches]
    scores = [match.score for match in res.matches]
    df_result = pd.DataFrame(
            "id": ids,
            "question": [
                df[df.qid1 == _id].question1.values[0] for _id in ids
            "score": scores,
 Original question : What is best way to make money online?

 Most similar questions based on pinecone vector search: 

       id                                             question     score
0      57               What is best way to make money online?  1.000000
1  297469           What is the best way to make money online?  1.000000
2   55585        What is the best way for making money online?  0.989930
3   28280         What are the best ways to make money online?  0.981526
4  157045  What is the best way to make money on the internet?  0.978538

 Original question : How can i build an e-commerce website?

 Most similar questions based on pinecone vector search: 

       id                                                   question     score
0  119383                   How can I develop an e-commerce website?  0.925466
1    1713                 How would I develop an e-commerce website?  0.925466
2    1714                     How do I create an e-commerce website?  0.919407
3   79063             How do I build and host an e-commerce website?  0.918379
4  245780  What is the best platform to build an e-commerce website?  0.894444

Delete the Index

Delete the index once you are sure that you do not want to use it anymore. Once it is deleted, you cannot reuse it.