Question Answering

This notebook demonstrates how Pinecone's similarity search as a service helps you build a question answering application. We will index a set of questions and retrieve the most similar stored questions for a new (unseen) question. That way, we can link a new question to answers we might already have.

You can build a questions answering application with Pinecone in three steps:

  1. Represent questions as vector embeddings so that semantically similar questions are in close proximity within the same vector space.
  2. Index vectors using Pinecone.
  3. Given a new question, query the index to fetch similar questions. This can allow us to store answers associated with these questions

In this notebook we will be dealing with indexing a set of quetions and retrieving similar questions for a new and unseen question.

Open Notebook in Google Colab

Dependencies

pythonpython
!pip install -qU matplotlib pinecone-client ipywidgets
!pip install -qU sentence-transformers --no-cache-dir
import pandas as pd
import numpy as np

%matplotlib inline

Pinecone Installation and Setup

import pinecone
import os

# load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pinecone.init(api_key=api_key)

Get a Pinecone API key if you don’t have one already.

Create a New Pinecone Index

# pick a name for the new index
index_name = "question-answering"
# check whether an index with the same name already exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)
0%|          | 0/1 [00:00<?, ?it/s]

Create index

pinecone.create_index(name=index_name, metric="cosine", shards=1)
0%|          | 0/2 [00:00<?, ?it/s]

{'msg': '', 'success': True}

Connect to the index

The index object, a class instance of pinecone.Index , will be reused for optimal performance.

index = pinecone.Index(name=index_name)

Uploading Questions

The dataset used in this notebook is the Quora Question Pairs Dataset.

Let's download the dataset and load the data.

# download dataset from the url
import requests

DATA_DIR = "tmp"
DATA_FILE = f"{DATA_DIR}/quora_duplicate_questions.tsv"
DATA_URL = "https://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"


def download_data():
    os.makedirs(DATA_DIR, exist_ok=True)

    if not os.path.exists(DATA_FILE):
        r = requests.get(DATA_URL)  # create HTTP response object
        with open(DATA_FILE, "wb") as f:
            f.write(r.content)


download_data()
pd.set_option("display.max_colwidth", 500)

df = pd.read_csv(
    f"{DATA_FILE}", sep="\t", usecols=["qid1", "question1"], index_col=False
)
df = df.sample(frac=1).reset_index(drop=True)
df.drop_duplicates(inplace=True)
print(df.head())
     qid1  \
0  198665
1   70104
2  121939
3   61014
4  164230

                                                                  question1
0                         What are some advantages of the informal economy?
1  What would cause a popping/crackling sound in one of my stereo speakers?
2                                               How much does cocaine cost?
3        How many devices can one Netflix account simultaneously stream on?
4                                           Do jio sims works in iPhone 5s?

Define the model

We will use the Average Word Embeddings Model for this example. This model has a high computation speed but relatively low quality of embeddings. You can look into other sentence embeddings models such as the Sentence Embeddings Models trained on Paraphrases for improving quality of embeddings.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("average_word_embeddings_glove.6B.300d")
0%|          | 0.00/441M [00:00<?, ?B/s]

Creating Vector Embeddings

# create embedding for each question
df["question_vector"] = df.question1.apply(lambda x: model.encode(str(x)))

Index the Vectors

acks = index.upsert(items=zip(df.qid1, df.question_vector))
print(index.info())
0it [00:00, ?it/s]

InfoResult(index_size=290654)

Once you have indexed the vectors it is very straightforward to query the index. These are the steps you need to follow:

  1. Select a set of questions you want to query with
  2. Use the Average Embedding Model to transform questions into embeddings.
  3. Send each question vector to the Pinecone index and retrieve most similar indexed questions
# define questions to query the vector index
query_questions = [
    "What is best way to make money online?",
]

# extract embeddings for the questions
query_vectors = [model.encode(str(question)) for question in query_questions]

# query pinecone
query_results = index.query(queries=query_vectors, top_k=5)

# show the results
for question, res in zip(query_questions, query_results):
    print("\n\n\n Original question : " + str(question))
    print("\n Most similar questions based on Pinecone vector search: \n")

    df_result = pd.DataFrame(
        {
            "id": res.ids,
            "question": [
                df[df.qid1 == int(_id)].question1.values[0] for _id in res.ids
            ],
            "score": res.scores,
        }
    )
    print(df_result)
0it [00:00, ?it/s]

Original question : What is best way to make money online?

Most similar questions based on Pinecone vector search:

       id                                             question     score
0      57               What is best way to make money online?  1.000000
1  297469           What is the best way to make money online?  1.000000
2   55585        What is the best way for making money online?  0.989930
3   28280         What are the best ways to make money online?  0.981526
4  157045  What is the best way to make money on the internet?  0.978538

Delete the Index

Delete the index once you are sure that you do not want to use it anymore. Once it is deleted, you cannot reuse it.

pinecone.delete_index(index_name)
0%|          | 0/1 [00:00<?, ?it/s]

{'success': True}