# Generative Question-Answering with Long-Term Memory

> Generative AI sparked several “wow” moments in 2022. From generative art tools like OpenAI’s DALL-E 2, Midjourney, and Stable Diffusion, to the next generation of Large Language Models like OpenAI’s GPT-3.5 generation models, BLOOM, and chatbots like LaMDA and ChatGPT.

James Briggs · 2023-06-30

Generative AI sparked several _“wow”_ moments in 2022. From generative art tools like OpenAI’s DALL-E 2, Midjourney, and Stable Diffusion, to the next generation of **L**arge **L**anguage **M**odels like OpenAI’s GPT-3.5 generation models, BLOOM, and chatbots like LaMDA and ChatGPT.

It’s hardly surprising that Generative AI is experiencing a boom in interest and innovation [1]. Yet, this marks the _just_ first year of generative AI’s widespread adoption. The early days of a new field poised to disrupt how we interact with machines.

One of the most thought-provoking use cases belongs to **G**enerative **Q**uestion-**A**nswering (GQA). Using GQA, we can sculpt human-like interaction with machines for information retrieval (IR).

We all use IR systems every day. Google search indexes the web and retrieves relevant information to your search terms. Netflix uses your behavior and history on the platform to recommend new TV shows and movies, and Amazon does the same with products [2].

These applications of IR are world-changing. Yet, they may be little more than a faint echo of what we will see in the coming months and years with the combination of IR and GQA.

Imagine a Google that can answer your queries with an intelligent and insightful summary based on the top 20 pages — highlighting key points and information sources.

The technology available today already makes this possible and surprisingly easy. This article will look at retrieval-augmented GQA and how to implement it with Pinecone and OpenAI.

[Video](https://www.youtube.com/watch?v=dRUIGgNBvVk)


---

## Generative Question-Answering

The most straightforward GQA system requires nothing more than a user text query and a large language model (LLM).

![Simple generative question answering](https://cdn.sanity.io/images/vr8gru94/production/b624418ee01e8a66edb84e0dec68157dec530e70-1721x536.png)


We can access one of the most advanced LLMs in the world via OpenAI. To start, we sign up for an [API key](https://beta.openai.com/).

![OpenAI API Key](https://cdn.sanity.io/images/vr8gru94/production/f0681043e50c33904bcc6adba6d7df56b4aac0fd-4083x1955.png)


Then we switch to a Python file or notebook, install some prerequisites, and initialize our connection to OpenAI.

```json
{
  "_key": "96e89d58d5b5",
  "_type": "colabBlock",
  "jsonContent": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 1,\n      \"metadata\": {\n        \"id\": \"VpMvHAYRQf9N\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"!pip install -qU openai pinecone-client datasets\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 2,\n      \"metadata\": {\n        \"id\": \"aEreHNxYkDbK\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"import openai\\n\",\n        \"\\n\",\n        \"# get API key from top-right dropdown on OpenAI website\\n\",\n        \"openai.api_key = \\\"OPENAI_API_KEY\\\"\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"Python 3\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"name\": \"python\"\n    },\n    \"widgets\": {}\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}"
}
```

From here, we can use the OpenAI completion endpoint to ask a question like _“who was the 12th person on the moon and when did they land?” _:

```json
{
  "_key": "f9e96772f90e",
  "_type": "colabBlock",
  "jsonContent": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 3,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 35\n        },\n        \"id\": \"9FEDn7LvkDYj\",\n        \"outputId\": \"dea469a8-55ab-491f-f645-356e86d361ac\"\n      },\n      \"outputs\": [\n        {\n          \"data\": {\n            \"application/vnd.google.colaboratory.intrinsic+json\": {\n              \"type\": \"string\"\n            },\n            \"text/plain\": [\n              \"'The 12th person on the moon was Harrison Schmitt, and he landed on December 11, 1972.'\"\n            ]\n          },\n          \"execution_count\": 3,\n          \"metadata\": {},\n          \"output_type\": \"execute_result\"\n        }\n      ],\n      \"source\": [\n        \"query = \\\"who was the 12th person on the moon and when did they land?\\\"\\n\",\n        \"\\n\",\n        \"# now query text-davinci-003 WITHOUT context\\n\",\n        \"res = openai.Completion.create(\\n\",\n        \"    engine='text-davinci-003',\\n\",\n        \"    prompt=query,\\n\",\n        \"    temperature=0,\\n\",\n        \"    max_tokens=400,\\n\",\n        \"    top_p=1,\\n\",\n        \"    frequency_penalty=0,\\n\",\n        \"    presence_penalty=0,\\n\",\n        \"    stop=None\\n\",\n        \")\\n\",\n        \"\\n\",\n        \"res['choices'][0]['text'].strip()\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"Python 3\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"name\": \"python\"\n    },\n    \"widgets\": {}\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}"
}
```

We get an accurate answer immediately. Yet, this question is relatively easy, what happens if we ask about a lesser-known topic?

```json
{
  "_key": "f71034c1e228",
  "_type": "colabBlock",
  "jsonContent": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 4,\n      \"metadata\": {\n        \"id\": \"SczFSfnjmNji\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"# first let's make it simpler to get answers\\n\",\n        \"def complete(prompt):\\n\",\n        \"    # query text-davinci-003\\n\",\n        \"    res = openai.Completion.create(\\n\",\n        \"        engine='text-davinci-003',\\n\",\n        \"        prompt=prompt,\\n\",\n        \"        temperature=0,\\n\",\n        \"        max_tokens=400,\\n\",\n        \"        top_p=1,\\n\",\n        \"        frequency_penalty=0,\\n\",\n        \"        presence_penalty=0,\\n\",\n        \"        stop=None\\n\",\n        \"    )\\n\",\n        \"    return res['choices'][0]['text'].strip()\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 5,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 89\n        },\n        \"id\": \"H2fUC8BtxCt_\",\n        \"outputId\": \"01beb42c-1f32-4e08-afc5-127e2dc5597a\"\n      },\n      \"outputs\": [\n        {\n          \"data\": {\n            \"application/vnd.google.colaboratory.intrinsic+json\": {\n              \"type\": \"string\"\n            },\n            \"text/plain\": [\n              \"'If you only have pairs of related sentences, then the best training method to use for sentence transformers is the supervised learning approach. This approach involves providing the model with labeled data, such as pairs of related sentences, and then training the model to learn the relationships between the sentences. This approach is often used for tasks such as natural language inference, semantic similarity, and paraphrase identification.'\"\n            ]\n          },\n          \"execution_count\": 5,\n          \"metadata\": {},\n          \"output_type\": \"execute_result\"\n        }\n      ],\n      \"source\": [\n        \"query = (\\n\",\n        \"    \\\"Which training method should I use for sentence transformers when \\\" +\\n\",\n        \"    \\\"I only have pairs of related sentences?\\\"\\n\",\n        \")\\n\",\n        \"\\n\",\n        \"complete(query)\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"Python 3\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"name\": \"python\"\n    },\n    \"widgets\": {}\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}"
}
```

Although this answer is technically correct, it isn’t an answer. It tells us to use a supervised training method and learn the relationship between sentences. Both of these facts are true but do not answer the original question.

There are two options for allowing our LLM to better understand the topic and, more precisely, answer the question.

1. We fine-tune the LLM on text data covering the domain of fine-tuning sentence transformers.
2. We use _retrieval-augmented generation_, meaning we add an information retrieval component to our GQA process. Adding a retrieval step allows us to retrieve relevant information and feed this into the LLM as a _secondary source_ of information.

In the following sections, we will outline how to implement option **two**.

---

## Building a Knowledge Base

With option **two** of implementing retrieval, we need an external _“knowledge base “_. A knowledge base acts as the place where we store information and as the system that effectively retrieves this information.

A knowledge base is a store of information that can act as an external reference for GQA models. We can think of it as the _“long-term memory”_ for AI systems.

We refer to knowledge bases that can enable the retrieval of semantically relevant information as _vector databases_.

A vector database stores vector representations of information encoded using specific ML models. These models have an “understanding” of language and can encode passages with similar meanings into a similar vector space and dissimilar passages into a dissimilar vector space.

![Similar v Dissimilar](https://cdn.sanity.io/images/vr8gru94/production/7605afe2a111cdcd6372af6287293a1ae6dacd85-2601x682.png)


We can achieve this with OpenAI via the embed endpoint:

```json
{
  "_key": "4668f74cb9a5",
  "_type": "colabBlock",
  "jsonContent": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 6,\n      \"metadata\": {\n        \"id\": \"EI2iYxq16or9\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"embed_model = \\\"text-embedding-ada-002\\\"\\n\",\n        \"\\n\",\n        \"res = openai.Embedding.create(\\n\",\n        \"    input=[\\n\",\n        \"        \\\"Sample document text goes here\\\",\\n\",\n        \"        \\\"there will be several phrases in each batch\\\"\\n\",\n        \"    ], engine=embed_model\\n\",\n        \")\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 7,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"57smZFmz61tj\",\n        \"outputId\": \"30745411-1f44-4abb-ac36-20abcfdbb343\"\n      },\n      \"outputs\": [\n        {\n          \"data\": {\n            \"text/plain\": [\n              \"dict_keys(['object', 'data', 'model', 'usage'])\"\n            ]\n          },\n          \"execution_count\": 7,\n          \"metadata\": {},\n          \"output_type\": \"execute_result\"\n        }\n      ],\n      \"source\": [\n        \"# vector embeddings are stored within the 'data' key\\n\",\n        \"res.keys()\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 8,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"36D4ipOR63AW\",\n        \"outputId\": \"10a3d6ba-a646-4ebd-d74f-90868d04a6f6\"\n      },\n      \"outputs\": [\n        {\n          \"data\": {\n            \"text/plain\": [\n              \"2\"\n            ]\n          },\n          \"execution_count\": 8,\n          \"metadata\": {},\n          \"output_type\": \"execute_result\"\n        }\n      ],\n      \"source\": [\n        \"# we have created two vectors (one for each sentence input)\\n\",\n        \"len(res['data'])\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 9,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"dPyGLhDX62t4\",\n        \"outputId\": \"f5d38bb2-f863-4d39-c8f6-d75579634ec9\"\n      },\n      \"outputs\": [\n        {\n          \"data\": {\n            \"text/plain\": [\n              \"(1536, 1536)\"\n            ]\n          },\n          \"execution_count\": 9,\n          \"metadata\": {},\n          \"output_type\": \"execute_result\"\n        }\n      ],\n      \"source\": [\n        \"# we have created two 1536-dimensional vectors\\n\",\n        \"len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"Python 3\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"name\": \"python\"\n    },\n    \"widgets\": {}\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}"
}
```

We’ll need to repeat this embedding process over many records that will act as our pipeline’s external source of information. These records still need to be downloaded and prepared for embedding.

## Data Preparation

The dataset we will use in our knowledge base is the `jamescalam/youtube-transcriptions` dataset hosted on Hugging Face _Datasets_. It contains transcribed audio from several ML and tech YouTube channels. We download it with the following:

```json
{
  "_key": "e55b0b35d642",
  "_type": "colabBlock",
  "jsonContent": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 2,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"t7uzxkGz73Ov\",\n        \"outputId\": \"995123eb-8f78-44b0-b325-e0ce2284b168\"\n      },\n      \"outputs\": [\n        {\n          \"data\": {\n            \"text/plain\": [\n              \"Dataset({\\n\",\n              \"    features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],\\n\",\n              \"    num_rows: 208619\\n\",\n              \"})\"\n            ]\n          },\n          \"execution_count\": 2,\n          \"metadata\": {},\n          \"output_type\": \"execute_result\"\n        }\n      ],\n      \"source\": [\n        \"from datasets import load_dataset\\n\",\n        \"\\n\",\n        \"data = load_dataset('jamescalam/youtube-transcriptions', split='train')\\n\",\n        \"data\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 3,\n      \"metadata\": {},\n      \"outputs\": [\n        {\n          \"data\": {\n            \"text/plain\": [\n              \"{'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',\\n\",\n              \" 'published': '2021-07-06 13:00:03 UTC',\\n\",\n              \" 'url': 'https://youtu.be/35Pdoyi6ZoQ',\\n\",\n              \" 'video_id': '35Pdoyi6ZoQ',\\n\",\n              \" 'channel_id': 'UCv83tO5cePwHMt1952IVVHw',\\n\",\n              \" 'id': '35Pdoyi6ZoQ-t0.0',\\n\",\n              \" 'text': 'Hi, welcome to the video.',\\n\",\n              \" 'start': 0.0,\\n\",\n              \" 'end': 9.36}\"\n            ]\n          },\n          \"execution_count\": 3,\n          \"metadata\": {},\n          \"output_type\": \"execute_result\"\n        }\n      ],\n      \"source\": [\n        \"data[0]\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"ml\",\n      \"language\": \"python\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"codemirror_mode\": {\n        \"name\": \"ipython\",\n        \"version\": 3\n      },\n      \"file_extension\": \".py\",\n      \"mimetype\": \"text/x-python\",\n      \"name\": \"python\",\n      \"nbconvert_exporter\": \"python\",\n      \"pygments_lexer\": \"ipython3\",\n      \"version\": \"3.9.12\"\n    },\n    \"vscode\": {\n      \"interpreter\": {\n        \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n      }\n    },\n    \"widgets\": {}\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}"
}
```

The dataset contains many small snippets of text data. We need to merge several snippets to create more substantial chunks of text that contain more meaningful information.

```json
{
  "_key": "a13f8357ee7f",
  "_type": "colabBlock",
  "jsonContent": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 11,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 49,\n          \"referenced_widgets\": [\n            \"8b7d062ee1c14bf6b0c55da89ff4b551\",\n            \"8f03d894148346bb90897fb39d6ec686\",\n            \"a115589785f34bc38e1730e8b497eef6\",\n            \"f205f29abe8d47f6b28627684f947bcd\",\n            \"d5aeb124d44744d2aaaa7d5b213caca7\",\n            \"2af9f8bae68d406d8cd4f56acf3db9e4\",\n            \"88f0b5625c9a4ce89d8a30fdf28efd90\",\n            \"a21a6992c8a744d49826ab0f56b867ed\",\n            \"6aa795a589714b058783f5f3eb5983e1\",\n            \"1d70ba5c815a4473939665061e52ae6e\",\n            \"fbd86b292a484498a61acf0ea7f5e814\"\n          ]\n        },\n        \"id\": \"uG9ZTI0o-9cJ\",\n        \"outputId\": \"30b65907-eea0-4de0-c457-d69531e388c3\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"from tqdm.auto import tqdm\\n\",\n        \"\\n\",\n        \"new_data = []\\n\",\n        \"\\n\",\n        \"window = 20  # number of sentences to combine\\n\",\n        \"stride = 4  # number of sentences to 'stride' over, used to create overlap\\n\",\n        \"\\n\",\n        \"for i in tqdm(range(0, len(data), stride)):\\n\",\n        \"    i_end = min(len(data)-1, i+window)\\n\",\n        \"    if data[i]['title'] != data[i_end]['title']:\\n\",\n        \"        # in this case we skip this entry as we have start/end of two videos\\n\",\n        \"        continue\\n\",\n        \"    text = ' '.join(data[i:i_end]['text'])\\n\",\n        \"    # create the new merged dataset\\n\",\n        \"    new_data.append({\\n\",\n        \"        'start': data[i]['start'],\\n\",\n        \"        'end': data[i_end]['end'],\\n\",\n        \"        'title': data[i]['title'],\\n\",\n        \"        'text': text,\\n\",\n        \"        'id': data[i]['id'],\\n\",\n        \"        'url': data[i]['url'],\\n\",\n        \"        'published': data[i]['published'],\\n\",\n        \"        'channel_id': data[i]['channel_id']\\n\",\n        \"    })\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 12,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"wN0BuMWSnqId\",\n        \"outputId\": \"2b733986-c26b-487b-a4cb-336602a1a3dc\"\n      },\n      \"outputs\": [\n        {\n          \"data\": {\n            \"text/plain\": [\n              \"{'start': 0.0,\\n\",\n              \" 'end': 74.12,\\n\",\n              \" 'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',\\n\",\n              \" 'text': \\\"Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data.\\\",\\n\",\n              \" 'id': '35Pdoyi6ZoQ-t0.0',\\n\",\n              \" 'url': 'https://youtu.be/35Pdoyi6ZoQ',\\n\",\n              \" 'published': '2021-07-06 13:00:03 UTC',\\n\",\n              \" 'channel_id': 'UCv83tO5cePwHMt1952IVVHw'}\"\n            ]\n          },\n          \"execution_count\": 12,\n          \"metadata\": {},\n          \"output_type\": \"execute_result\"\n        }\n      ],\n      \"source\": [\n        \"new_data[0]\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"ml\",\n      \"language\": \"python\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"codemirror_mode\": {\n        \"name\": \"ipython\",\n        \"version\": 3\n      },\n      \"file_extension\": \".py\",\n      \"mimetype\": \"text/x-python\",\n      \"name\": \"python\",\n      \"nbconvert_exporter\": \"python\",\n      \"pygments_lexer\": \"ipython3\",\n      \"version\": \"3.9.12\"\n    },\n    \"vscode\": {\n      \"interpreter\": {\n        \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n      }\n    },\n    \"widgets\": {}\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}"
}
```

With the text chunks created, we can begin initializing our knowledge base and populating it with our data.

## Creating the Vector Database

The vector database is the storage and retrieval component in our pipeline. We use Pinecone as our vector database. For this, we need to sign up for a [free API key](https://app.pinecone.io/) and enter it below, where we create the index for storing our data.

```json
{
  "_key": "2efef44e5d59",
  "_type": "colabBlock",
  "jsonContent": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 13,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"UPNwQTH0RNcl\",\n        \"outputId\": \"c5d22baf-0e69-4039-fda0-624ce22cd740\"\n      },\n      \"outputs\": [\n        {\n          \"data\": {\n            \"text/plain\": [\n              \"{'dimension': 1536,\\n\",\n              \" 'index_fullness': 0.0,\\n\",\n              \" 'namespaces': {},\\n\",\n              \" 'total_vector_count': 0}\"\n            ]\n          },\n          \"execution_count\": 13,\n          \"metadata\": {},\n          \"output_type\": \"execute_result\"\n        }\n      ],\n      \"source\": [\n        \"import pinecone\\n\",\n        \"\\n\",\n        \"index_name = 'openai-youtube-transcriptions'\\n\",\n        \"\\n\",\n        \"# initialize connection (get API key at app.pinecone.io)\\n\",\n        \"pinecone.init(\\n\",\n        \"    api_key=\\\"YOUR_API_KEY\\\",\\n\",\n        \"    environment=\\\"YOUR_ENV\\\"  # find next to API key\\n\",\n        \")\\n\",\n        \"\\n\",\n        \"# check if index already exists (it shouldn't if this is first time)\\n\",\n        \"if index_name not in pinecone.list_indexes():\\n\",\n        \"    # if does not exist, create index\\n\",\n        \"    pinecone.create_index(\\n\",\n        \"        index_name,\\n\",\n        \"        dimension=len(res['data'][0]['embedding']),\\n\",\n        \"        metric='cosine',\\n\",\n        \"        metadata_config={\\n\",\n        \"            'indexed': ['channel_id', 'published']\\n\",\n        \"        }\\n\",\n        \"    )\\n\",\n        \"# connect to index\\n\",\n        \"index = pinecone.Index(index_name)\\n\",\n        \"# view index stats\\n\",\n        \"index.describe_index_stats()\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"ml\",\n      \"language\": \"python\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"codemirror_mode\": {\n        \"name\": \"ipython\",\n        \"version\": 3\n      },\n      \"file_extension\": \".py\",\n      \"mimetype\": \"text/x-python\",\n      \"name\": \"python\",\n      \"nbconvert_exporter\": \"python\",\n      \"pygments_lexer\": \"ipython3\",\n      \"version\": \"3.9.12\"\n    },\n    \"vscode\": {\n      \"interpreter\": {\n        \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n      }\n    },\n    \"widgets\": {}\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}"
}
```

Then we embed and index a dataset like so:

```json
{
  "_key": "a256ac28905e",
  "_type": "colabBlock",
  "jsonContent": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 14,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 49,\n          \"referenced_widgets\": [\n            \"bb392ce2d1e047daa1747c4a0f5e89b7\",\n            \"c298c3dc46ed4f2e85e34a9972b3faf4\",\n            \"b335ce0994e045df8a886ca32e3ebb76\",\n            \"b3f2dde1b97c4989b0e6e5ea3365a270\",\n            \"1cbbf96f9b7f46c29ebbd696e5777e82\",\n            \"3172ad39260d41aea64fba5df2c13961\",\n            \"7c20e179ec504d4caed56e17e3f53e02\",\n            \"1a7b9a94c88a4496a24d4e57d4801047\",\n            \"2b2409a6c2024d57b2b6ddb0e76c7068\",\n            \"d26a5434beda435aa5d62c5c11de2bb8\",\n            \"c56384bfac4c4f5596787f04fd76b86c\"\n          ]\n        },\n        \"id\": \"vPb9liovzrc8\",\n        \"outputId\": \"bb69dbce-c140-49af-840f-2c03dd940e2a\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"from tqdm.auto import tqdm\\n\",\n        \"import datetime\\n\",\n        \"from time import sleep\\n\",\n        \"\\n\",\n        \"batch_size = 100  # how many embeddings we create and insert at once\\n\",\n        \"\\n\",\n        \"for i in tqdm(range(0, len(new_data), batch_size)):\\n\",\n        \"    # find end of batch\\n\",\n        \"    i_end = min(len(new_data), i+batch_size)\\n\",\n        \"    meta_batch = new_data[i:i_end]\\n\",\n        \"    # get ids\\n\",\n        \"    ids_batch = [x['id'] for x in meta_batch]\\n\",\n        \"    # get texts to encode\\n\",\n        \"    texts = [x['text'] for x in meta_batch]\\n\",\n        \"    # create embeddings (try-except added to avoid RateLimitError)\\n\",\n        \"    try:\\n\",\n        \"        res = openai.Embedding.create(input=texts, engine=embed_model)\\n\",\n        \"    except:\\n\",\n        \"        done = False\\n\",\n        \"        while not done:\\n\",\n        \"            sleep(5)\\n\",\n        \"            try:\\n\",\n        \"                res = openai.Embedding.create(input=texts, engine=embed_model)\\n\",\n        \"                done = True\\n\",\n        \"            except:\\n\",\n        \"                pass\\n\",\n        \"    embeds = [record['embedding'] for record in res['data']]\\n\",\n        \"    # cleanup metadata\\n\",\n        \"    meta_batch = [{\\n\",\n        \"        'start': x['start'],\\n\",\n        \"        'end': x['end'],\\n\",\n        \"        'title': x['title'],\\n\",\n        \"        'text': x['text'],\\n\",\n        \"        'url': x['url'],\\n\",\n        \"        'published': x['published'],\\n\",\n        \"        'channel_id': x['channel_id']\\n\",\n        \"    } for x in meta_batch]\\n\",\n        \"    to_upsert = list(zip(ids_batch, embeds, meta_batch))\\n\",\n        \"    # upsert to Pinecone\\n\",\n        \"    index.upsert(vectors=to_upsert)\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"ml\",\n      \"language\": \"python\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"codemirror_mode\": {\n        \"name\": \"ipython\",\n        \"version\": 3\n      },\n      \"file_extension\": \".py\",\n      \"mimetype\": \"text/x-python\",\n      \"name\": \"python\",\n      \"nbconvert_exporter\": \"python\",\n      \"pygments_lexer\": \"ipython3\",\n      \"version\": \"3.9.12\"\n    },\n    \"vscode\": {\n      \"interpreter\": {\n        \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n      }\n    },\n    \"widgets\": {}\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}"
}
```

We’re ready to combine OpenAI’s `Completion` and `Embedding` endpoints with our Pinecone vector database to create a retrieval-augmented GQA system.

---

## OP Stack

The OpenAI Pinecone (OP) stack is an increasingly popular choice for building high-performance AI apps, including retrieval-augmented GQA.

Our pipeline during _query time_ consists of the following:

1. OpenAI `Embedding` endpoint to create vector representations of each query.
2. Pinecone vector database to search for relevant passages from the database of previously indexed contexts.
3. OpenAI `Completion` endpoint to generate a natural language answer considering the retrieved contexts.

![Retrieval Augmented GQA Query](https://cdn.sanity.io/images/vr8gru94/production/400ccea8c22458fa48d772573812e1d161593cbe-2750x1165.png)


We start by encoding queries using the same encoder model to create a query vector `xq`.

```json
{
  "_key": "6384d554cd8f",
  "_type": "colabBlock",
  "jsonContent": "{\n    \"cells\": [\n      {\n        \"cell_type\": \"code\",\n        \"execution_count\": 15,\n        \"metadata\": {\n          \"id\": \"LF1U_yZGojRJ\"\n        },\n        \"outputs\": [],\n        \"source\": [\n          \"res = openai.Embedding.create(\\n\",\n          \"    input=[query],\\n\",\n          \"    engine=embed_model\\n\",\n          \")\\n\",\n          \"\\n\",\n          \"# retrieve from Pinecone\\n\",\n          \"xq = res['data'][0]['embedding']\\n\",\n          \"\\n\",\n          \"# get relevant contexts (including the questions)\\n\",\n          \"res = index.query(xq, top_k=2, include_metadata=True)\"\n        ]\n      },\n      {\n        \"cell_type\": \"code\",\n        \"execution_count\": 16,\n        \"metadata\": {\n          \"colab\": {\n            \"base_uri\": \"https://localhost:8080/\"\n          },\n          \"id\": \"GH_DkmsNomww\",\n          \"outputId\": \"fc84b83b-164d-45a7-9e80-9b11519bf25b\"\n        },\n        \"outputs\": [\n          {\n            \"data\": {\n              \"text/plain\": [\n                \"{'matches': [{'id': 'pNvujJ1XyeQ-t418.88',\\n\",\n                \"     'metadata': {\\n\",\n                \"         'channel_id': 'UCv83tO5cePwHMt1952IVVHw',\\n\",\n                \"         'end': 568.4,\\n\",\n                \"         'published': datetime.date(2021, 11, 24),\\n\",\n                \"         'start': 418.88,\\n\",\n                \"         'text': 'pairs of related sentences you can go '\\n\",\n                \"                 'ahead and actually try training or '\\n\",\n                \"                 'fine-tuning using NLI with multiple '\\n\",\n                \"                 \\\"negative ranking loss...\\\"\\n\",\n                \"         'title': 'Today Unsupervised Sentence Transformers, '\\n\",\n                \"                  'Tomorrow Skynet (how TSDAE works)',\\n\",\n                \"         'url': 'https://youtu.be/pNvujJ1XyeQ'\\n\",\n                \"     },\\n\",\n                \"     'score': 0.865277052,\\n\",\n                \"     'sparseValues': {},\\n\",\n                \"     'values': []},\\n\",\n                \"    {'id': 'WS1uVMGhlWQ-t737.28',\\n\",\n                \"     'metadata': {\\n\",\n                \"         'channel_id': 'UCv83tO5cePwHMt1952IVVHw',\\n\",\n                \"         'end': 900.72,\\n\",\n                \"         'published': datetime.date(2021, 10, 20),\\n\",\n                \"         'start': 737.28,\\n\",\n                \"         'text': \\\"were actually more accurate. So we can't \\\"\\n\",\n                \"                 \\\"really do that. We can't use this what is \\\"\\n\",\n                \"                 'called a mean pooling approach. Or we '\\n\",\n                \"                 \\\"can't use it in its current form...\\\"\\n\",\n                \"         'title': 'Intro to Sentence Embeddings with '\\n\",\n                \"                  'Transformers',\\n\",\n                \"         'url': 'https://youtu.be/WS1uVMGhlWQ'\\n\",\n                \"     },\\n\",\n                \"     'score': 0.85855335,\\n\",\n                \"     'sparseValues': {},\\n\",\n                \"     'values': []}],\\n\",\n                \" 'namespace': ''}\"\n              ]\n            },\n            \"execution_count\": 16,\n            \"metadata\": {},\n            \"output_type\": \"execute_result\"\n          }\n        ],\n        \"source\": [\n          \"res\"\n        ]\n      }\n    ],\n    \"metadata\": {\n      \"colab\": {\n        \"provenance\": []\n      },\n      \"kernelspec\": {\n        \"display_name\": \"ml\",\n        \"language\": \"python\",\n        \"name\": \"python3\"\n      },\n      \"language_info\": {\n        \"codemirror_mode\": {\n          \"name\": \"ipython\",\n          \"version\": 3\n        },\n        \"file_extension\": \".py\",\n        \"mimetype\": \"text/x-python\",\n        \"name\": \"python\",\n        \"nbconvert_exporter\": \"python\",\n        \"pygments_lexer\": \"ipython3\",\n        \"version\": \"3.9.12\"\n      },\n      \"vscode\": {\n        \"interpreter\": {\n          \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n        }\n      },\n      \"widgets\": {}\n    },\n    \"nbformat\": 4,\n    \"nbformat_minor\": 0\n  }"
}
```

The query vector `xq` is used to query Pinecone via `index.query`, and previously indexed passage vectors are compared to find the most similar matches — returned in `res` above.

Using these returned contexts, we can construct a prompt instructing the generative LLM to answer the question based on the retrieved contexts. To keep things simple, we will do all this in a function called `retrieve`.

```json
{
  "_key": "c24b4ac84884",
  "_type": "colabBlock",
  "jsonContent": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 30,\n      \"metadata\": {\n        \"id\": \"92NmGGJ1TKQp\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"limit = 3750\\n\",\n        \"\\n\",\n        \"def retrieve(query):\\n\",\n        \"    res = openai.Embedding.create(\\n\",\n        \"        input=[query],\\n\",\n        \"        engine=embed_model\\n\",\n        \"    )\\n\",\n        \"\\n\",\n        \"    # retrieve from Pinecone\\n\",\n        \"    xq = res['data'][0]['embedding']\\n\",\n        \"\\n\",\n        \"    # get relevant contexts\\n\",\n        \"    res = index.query(xq, top_k=3, include_metadata=True)\\n\",\n        \"    contexts = [\\n\",\n        \"        x['metadata']['text'] for x in res['matches']\\n\",\n        \"    ]\\n\",\n        \"\\n\",\n        \"    # build our prompt with the retrieved contexts included\\n\",\n        \"    prompt_start = (\\n\",\n        \"        \\\"Answer the question based on the context below.\\\\n\\\\n\\\"+\\n\",\n        \"        \\\"Context:\\\\n\\\"\\n\",\n        \"    )\\n\",\n        \"    prompt_end = (\\n\",\n        \"        f\\\"\\\\n\\\\nQuestion: {query}\\\\nAnswer:\\\"\\n\",\n        \"    )\\n\",\n        \"    # append contexts until hitting limit\\n\",\n        \"    for i in range(1, len(contexts)):\\n\",\n        \"        if len(\\\"\\\\n\\\\n---\\\\n\\\\n\\\".join(contexts[:i])) >= limit:\\n\",\n        \"            prompt = (\\n\",\n        \"                prompt_start +\\n\",\n        \"                \\\"\\\\n\\\\n---\\\\n\\\\n\\\".join(contexts[:i-1]) +\\n\",\n        \"                prompt_end\\n\",\n        \"            )\\n\",\n        \"            break\\n\",\n        \"        elif i == len(contexts)-1:\\n\",\n        \"            prompt = (\\n\",\n        \"                prompt_start +\\n\",\n        \"                \\\"\\\\n\\\\n---\\\\n\\\\n\\\".join(contexts) +\\n\",\n        \"                prompt_end\\n\",\n        \"            )\\n\",\n        \"    return prompt\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"# first we retrieve relevant items from Pinecone\\n\",\n        \"query_with_contexts = retrieve(query)\\n\",\n        \"query_with_contexts\"\n      ],\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 142\n        },\n        \"id\": \"LwsZuxiTvU2d\",\n        \"outputId\": \"7e3acf8b-7356-41bc-8c9e-5405a21153e0\"\n      },\n      \"execution_count\": 31,\n      \"outputs\": [\n        {\n          \"output_type\": \"execute_result\",\n          \"data\": {\n            \"text/plain\": [\n              \"\\\"Answer the question based on the context below.\\\\n\\\\nContext:\\\\npairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have...\\\\n\\\\n---\\\\n\\\\n...we have the core transform models and what S BERT does is fine tunes on sentence pairs using what is called a Siamese architecture or Siamese network...\\\\n\\\\n---\\\\n\\\\n...we're looking at here is Natural Language Inference or NLI and NLI requires that we have pairs of sentences that are labeled as either contradictory, neutral which means they're not necessarily related or as entailing or as inferring each other. So you have pairs that entail each other...\\\\n\\\\nQuestion: Which training method should I use for sentence transformers when I only have pairs of related sentences?\\\\nAnswer:\\\"\"\n            ],\n            \"application/vnd.google.colaboratory.intrinsic+json\": {\n              \"type\": \"string\"\n            }\n          },\n          \"metadata\": {},\n          \"execution_count\": 31\n        }\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"ml\",\n      \"language\": \"python\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"codemirror_mode\": {\n        \"name\": \"ipython\",\n        \"version\": 3\n      },\n      \"file_extension\": \".py\",\n      \"mimetype\": \"text/x-python\",\n      \"name\": \"python\",\n      \"nbconvert_exporter\": \"python\",\n      \"pygments_lexer\": \"ipython3\",\n      \"version\": \"3.9.12\"\n    },\n    \"vscode\": {\n      \"interpreter\": {\n        \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n      }\n    },\n    \"widgets\": {}\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}"
}
```

Note that the generated _expanded query_ (`query_with_contexts`) has been shortened for readability.

From `retrieve`, we produce a longer prompt (`query_with_contexts`) containing some instructions, the contexts, and the original question.

The prompt is then fed into the generative LLM via OpenAI’s `Completion` endpoint. As before, we use the `complete` function to handle everything.

```json
{
  "_key": "f3bc4ec031e2",
  "_type": "colabBlock",
  "jsonContent": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"# then we complete the context-infused query\\n\",\n        \"complete(query_with_contexts)\"\n      ],\n      \"metadata\": {\n        \"id\": \"ioDVGF7lkDQL\",\n        \"outputId\": \"88bbbd48-89b1-4485-f511-cc5014bf3a5b\",\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 35\n        }\n      },\n      \"execution_count\": 32,\n      \"outputs\": [\n        {\n          \"output_type\": \"execute_result\",\n          \"data\": {\n            \"text/plain\": [\n              \"'You should use Natural Language Inference (NLI) with multiple negative ranking loss.'\"\n            ],\n            \"application/vnd.google.colaboratory.intrinsic+json\": {\n              \"type\": \"string\"\n            }\n          },\n          \"metadata\": {},\n          \"execution_count\": 32\n        }\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"ml\",\n      \"language\": \"python\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"codemirror_mode\": {\n        \"name\": \"ipython\",\n        \"version\": 3\n      },\n      \"file_extension\": \".py\",\n      \"mimetype\": \"text/x-python\",\n      \"name\": \"python\",\n      \"nbconvert_exporter\": \"python\",\n      \"pygments_lexer\": \"ipython3\",\n      \"version\": \"3.9.12\"\n    },\n    \"vscode\": {\n      \"interpreter\": {\n        \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n      }\n    },\n    \"widgets\": {}\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}"
}
```

Because of the additional _“source knowledge”_ (information fed directly into the model), we have eliminated the hallucinations of the LLM — producing accurate answers to our question.

Beyond providing more factual answers, we also have the _sources_ of information from Pinecone used to generate our answer. Adding this to downstream tools or apps can help improve user trust in the system. Allowing users to confirm the reliability of the information being presented to them.

That’s it for this walkthrough of retrieval-augmented **G**enerative **Q**uestion **A**nswering (GQA) systems.

As demonstrated, LLMs alone work incredibly well but struggle with more niche or specific questions. This often leads to _hallucinations_ that are rarely obvious and likely to go undetected by system users.

By adding a _“long-term memory”_ component to our GQA system, we benefit from an external knowledge base to improve system factuality and user trust in generated outputs.

Naturally, there is vast potential for this type of technology. Despite being a new technology, we are already seeing its use in [YouChat](https://blog.you.com/introducing-youchat-the-ai-search-assistant-that-lives-in-your-search-engine-eff7badcd655), several [podcast search apps](https://huberman.rile.yt/), and rumors of its upcoming use as a challenger to Google itself [3].

There is potential for disruption in any place where the need for information exists, and retrieval-augmented GQA represents one of the best opportunities for taking advantage of the outdated information retrieval systems in use today.

## References

[1] E. Griffith, C. Metz, [A New Area of A.I. Booms, Even Amid the Tech Gloom](https://www.nytimes.com/2023/01/07/technology/generative-ai-chatgpt-investments.html) (2023), NYTimes

[2] G. Linden, B. Smith, J. York, [Amazon.com Recommendations: Item-to-Item Collaborative Filtering](https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf) (2003), IEEE

[3] T. Warren, [Microsoft to challenge Google by integrating ChatGPT with Bing search](https://www.theverge.com/2023/1/4/23538552/microsoft-bing-chatgpt-search-google-competition) (2023), The Verge