# Retriever Models for Open Domain Question-Answering

> It’s a sci-fi staple. A vital component of the legendary Turing test. The dream of many across the world. And, until recently, impossible. We are talking about the ability to ask a machine a question and receive a genuinely intelligent, insightful answer. Until recently, technology like this existed only in books, Hollywood, and our collective imagination. Now, it is everywhere. Most of us use this technology every day, and we often don’t even notice it.

It’s a sci-fi staple. A vital component of the legendary Turing test. The dream of many across the world. And, until recently, impossible.

We are talking about the ability to ask a machine a question and receive a genuinely intelligent, insightful answer.

Until recently, technology like this existed only in books, Hollywood, and our collective imagination. Now, it is everywhere. Most of us use this technology every day, and we often don’t even notice it.

[Video](https://www.youtube.com/watch?v=w1dMEWm7jBc)


Google is just one example. Over the last few years, Google has gradually introduced an intelligent question-answering angle to search. When we now ask _how do I tie my shoelaces?"_ Google gives us the ‘exact answer’ alongside the _context_ or video this answer came from:

![In response to our question, Google finds the exact (audio-to-text) answer to be “Start by taking the first lace. And place it behind the second one…", and highlights the exact part of the video that contains this extracted answer.](https://cdn.sanity.io/images/vr8gru94/production/c3444481f371b054541a980eaaca566f086b8463-1894x1518.png)


We can ask other questions like _“Is Google Skynet?"_ and this time return an even more precise answer **“Yes”**.

![At least Google is honest.](https://cdn.sanity.io/images/vr8gru94/production/8039f56624a4cf4f29df539916f4f84856fbc7a6-1902x900.png)


In this example, Google returns an exact answer and the _context_ (paragraph) from where the answer is extracted.

How does Google do this? And more importantly, why should we care?

This search style emulates a human-like interaction. We’re asking a question in natural language as if we were speaking to another person. This natural language Q&A creates a very different search experience to traditional search.

Imagine you find yourself in the world’s biggest warehouse. You have no idea how the place is organized. All you know is that your task is to find some round marble-like objects.

Where do you start? Well, we need to figure out how the warehouse is organized. Maybe everything is stored alphabetically, categorized by industry, or intended use. The traditional search interface requires that we understand how the warehouse is structured before we begin searching. Often, there is a specific ‘query language’ such as:

```
SELECT * WHERE 'material' == 'marble'

or

("marble" | "stone") & "product"
```

Our first task is to learn this query language so we can search. Once we understand how the warehouse is structured, we use that knowledge to begin our search. How do we find _“round marble-like objects”_? We can narrow our search down using similar _queries_ to those above, but we are in the world’s biggest warehouse, so this will take a _very_ long time.

Without a natural Q&A-style interface, this is your search. Unless your users know the ins and outs of the warehouse and its contents, they’re going to struggle.

What happens if we add a natural Q&A-style interface to the warehouse? Imagine we now have people in the warehouse whose entire purpose is to guide us through the warehouse. These people know exactly where everything is.

Those people can understand our question of _“where can I find the round marble-like objects?"_. It may take a few tries until we find the exact object we’re looking for, but we now have a guide that understands our question. There is no longer the need to understand how the warehouse is organized nor to _know_ the exact name of what it is we’re trying to find.

With this natural Q&A-style interface, your users now have a guide. They just need to be able to ask a question.

## Answering Questions

How can we design these natural, human-like Q&A interfaces? The answer is **o**pen-**d**omain **q**uestion-**a**nswering (ODQA). ODQA allows us to use natural language to query a database.

That means that, given a dataset like a set of internal company documents, online documentation, or as is the case with Google, everything on the world’s internet, we can retrieve relevant information in a natural, more human way.

However, ODQA is not a single model. It is more complex and requires three primary components.

- A [vector database](https://www.pinecone.io/learn/vector-database/) to store information-rich vectors that numerically represent the _meaning_ of _contexts_ (paragraphs that we use to extract answers to our questions).
- The **retriever** model encodes questions and contexts into the same vector space. It is these context vectors that we later store in the vector database. The retriever also encodes questions to be compared to the context vectors in a _vector database_ to _retrieve_ the most relevant contexts.
- A **reader** model takes a question and context and attempts to identify a _span_ (sub-section) from the context which answers the question.

Building a retriever model is our focus here. Without it, there is no ODQA; it is arguably the most critical component in the whole process. We _need_ our retriever model to return relevant results; otherwise, the reader model will receive and output garbage.

If we instead had a mediocre reader model, it may still return garbage to us, but it has a much smaller negative impact on the ODQA pipeline. A good retriever means we can at least retrieve relevant contexts, therefore successfully returning relevant information to the user. A paragraph-long context isn’t as clean-cut as a perfectly framed two or three-word answer, but it’s better than nothing.

Our focus in this article is on building a _retriever_ model, of which the _vector database_ is a crucial component, as we will see later.

### Train or Not?

Do we need to fine-tune our retriever models? Or can we use pretrained models like those in the [HuggingFace model hub](https://huggingface.co/models)?

The answer is: _It depends_. An excellent concept from Nils Reimers describes the difficulty of benchmarking models where the use case is within a niché domain that very few people would understand. The idea is that most benchmarks and datasets focus on this short head of knowledge (where most people understand), whereas the most exciting use cases belong in the long-tail portion of the graph [1].

![Nils Reimer’s long tail of semantic relatedness [1]. The more people that know about something (y-axis), the easier it is to find benchmarks and labeled data (x-axis), but the most interesting use cases belong in the long-tail region.](https://cdn.sanity.io/images/vr8gru94/production/2a23e7746b7536352c3dc38b7672d3d5931dbb81-1920x1080.png)


We can take the same idea and modify the x-axis to indicate whether we should be able to take a pretrained model or fine-tune our own.

![The more something is common knowledge (y-axis), the easier it is to find pretrained models that excel in the broader, more general scope. However, as before, most interesting use cases belong in the long-tail, and here is where we would need to fine-tune our own model.](https://cdn.sanity.io/images/vr8gru94/production/54d577568d09eebab7c35f7ae9d623b11b560cf1-1920x820.png)


Imagine you are walking down your local high street. You pick a stranger at random and ask them the sort of question that you would expect from your use case. Do you think they would get the answer? If there’s a good chance they will, you might be able to get away with a pretrained model.

On the other hand, if you ask this stranger what the difference is between RoBERTa and DeBERTa, there is a very high chance that they will have no idea what you’re asking. In this case, you will probably need to fine-tune a retriever model.

## Fine-Tuning a Retriever

Let’s assume the strangers on the street have no chance of answering our questions. Most likely, a custom retriever model is our best bet. But, how do we train/fine-tune a custom retriever model?

The very first ingredient is _data_. Our retriever consumes a question and returns relevant contexts to us. For it to do this, it must learn to encode similar question-context pairs into the same vector space.

![The retriever model must learn to encode similar question-context pairs into a similar vector space.](https://cdn.sanity.io/images/vr8gru94/production/fa846bd026f164e329c8dff9b57b8fa15ab226eb-1920x760.png)


Our first task is to find and create a set of question-context pairs. One of the best-known datasets for this is the **S**tanford **Q**uestion **A**nswering **D**ataset (SQuAD).

### Step One: Data

SQuAD is a reading comprehension dataset built from question, context, and answers with information from Wikipedia articles. Let’s take a look at an example.

```json
{
  "_key": "3e5f07b234c4",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# install HF datasets library if needed\\n\",\n    \"!pip install datasets\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['id', 'title', 'context', 'question', 'answers'],\\n\",\n       \"    num_rows: 130319\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 1,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from datasets import load_dataset\\n\",\n    \"\\n\",\n    \"squad = load_dataset('squad_v2', split='train')\\n\",\n    \"squad\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'id': '56be85543aeaaa14008c9063',\\n\",\n       \" 'title': 'Beyoncé',\\n\",\n       \" 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s... featured the Billboard Hot 100 number-one singles \\\"Crazy in Love\\\" and \\\"Baby Boy\\\".',\\n\",\n       \" 'question': 'When did Beyonce start becoming popular?',\\n\",\n       \" 'answers': {'text': ['in the late 1990s'], 'answer_start': [269]}}\"\n      ]\n     },\n     \"execution_count\": 2,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"squad[0]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We first download the `squad_v2` dataset via 🤗 _Datasets_. In the first sample, we can see:

- the `title` (or topic) of _Beyoncé_
- the `context`, a short paragraph from Wikipedia about Beyoncé
- a `question`, _“When did Beyonce start becoming popular?"_
- the answer `text`, _“in the late 1990s”_, which is extracted from the _context_
- the `answer_start`, which is the starting position of the answer within the _context_ string.

The SQuAD v2 dataset contains _130,319_ of these samples, more than enough for us to train a good retriever model.

We will be using the _Sentence Transformers_ library to train our retriever model. When using this library, we must format our training data into a list of `InputExample` objects.

```json
{
  "_key": "396fb7f979d4",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 130319/130319 [00:08<00:00, 16011.35it/s]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from sentence_transformers import InputExample\\n\",\n    \"from tqdm.auto import tqdm\\n\",\n    \"\\n\",\n    \"train = []\\n\",\n    \"for row in tqdm(squad):\\n\",\n    \"    train.append(InputExample(\\n\",\n    \"        texts=[row['question'], row['context']]\\n\",\n    \"    ))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

After creating this list of `InputExample` objects, we need to load them into a data loader. A data loader is commonly used with PyTorch, which Sentence Transformers uses under the hood. Because of this, we can often use the PyTorch `DataLoader` class.

However, we need to do something slightly different. Our training data consists of positive question-context pairs; positive meaning that every sample in our dataset can be viewed as having a positive or _high_ similarity. There are no negative or dissimilar pairs.

When our data looks like this, one of the most effective training techniques we can use uses the **M**ultiple **N**egatives **R**anking (MNR) loss function. We will not explain MNR loss in this article, but you [can learn about it here](https://www.pinecone.io/learn/series/nlp/fine-tune-sentence-transformers-mnr/).

One crucial property of training with MNR loss is that each training batch does _not_ contain duplicate questions or contexts. This is a problem, as the SQuAD data includes several questions for each context. Because of this, if we used the standard `DataLoader`, there is a high probability that we would find duplicate contexts in our batches.

![Screenshot from HuggingFace’s dataset viewer for the squad_v2 dataset. Each row represents a different question, but they all map to the same context.](https://cdn.sanity.io/images/vr8gru94/production/28bcc8edba06562fe20239c21679abb0fdd897c0-2554x1334.png)


Fortunately, there is an easy solution to this. _Sentence Transformers_ provides a set of modified data loaders. One of those is the `NoDuplicatesDataLoader`, which ensures our batches contain _no_ duplicates.

```json
{
  "_key": "32a8a6cdaaa1",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from sentence_transformers import datasets\\n\",\n    \"\\n\",\n    \"batch_size = 24\\n\",\n    \"\\n\",\n    \"loader = datasets.NoDuplicatesDataLoader(\\n\",\n    \"    train, batch_size=batch_size\\n\",\n    \")\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

With that, our training data is fully prepared, and we can move on to initializing and training our retriever model.

### Step Two: Initialize and Train

Before training our model, we need to initialize it. For this, we begin with a pretrained transformer model from the [HuggingFace model hub](https://huggingface.co/models). A popular choice for sentence transformers is Microsoft’s MPNet model, which we access via `microsoft/mpnet-base`.

There is one problem with our pretrained transformer model. It outputs many word/token-level vector embeddings. We don’t want token vectors; we need _sentence vectors_.

We need a way to transform the many token vectors output by the model into a _single_ sentence vector.

![Transformation of the many token vectors output by a transformer model into a single sentence vector.](https://cdn.sanity.io/images/vr8gru94/production/ffe72973d614384414e005820446942768827c9d-1920x860.png)


To perform this transformation, we add a _mean pooling layer_ to process the outputs of the transformer model. There are a few different pooling techniques. The one that we will use is _mean pooling_. This approach will take the many token vectors output by the model and average the activations across each vector dimension to create a single sentence vector.

We can do this via `models` and `SentenceTransformer` utilities of the _Sentence Transformers_ library.

```json
{
  "_key": "51b173587982",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"SentenceTransformer(\\n\",\n       \"  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel \\n\",\n       \"  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\\n\",\n       \")\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from sentence_transformers import models, SentenceTransformer\\n\",\n    \"\\n\",\n    \"bert = models.Transformer('microsoft/mpnet-base')\\n\",\n    \"pooler = models.Pooling(\\n\",\n    \"    bert.get_word_embedding_dimension(),\\n\",\n    \"    pooling_mode_mean_tokens=True\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"model = SentenceTransformer(modules=[bert, pooler])\\n\",\n    \"\\n\",\n    \"model\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We have a `SentenceTransformer` object; a pretrained `microsoft/mpnet-base` model followed by a mean pooling layer.

With our model defined, we can initialize our MNR loss function.

```json
{
  "_key": "71d612b6b803",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from sentence_transformers import losses\\n\",\n    \"\\n\",\n    \"loss = losses.MultipleNegativesRankingLoss(model)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

That is everything we need for fine-tuning the model. We set the number of training epochs to `1`; anything more for sentence transformers often leads to overfitting. Another method to reduce the likelihood of overfitting is adding a learning rate warmup. Here, we warmup for the first 10% of our training steps (10% is the _go to_ % for warmup steps; if you find the model is overfitting, try increasing the number).

```json
{
  "_key": "fb2e095d142d",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Iteration: 100%|██████████| 5429/5429 [28:42<00:00,  3.15it/s]\\n\",\n      \"Epoch: 100%|██████████| 1/1 [28:42<00:00, 1722.72s/it]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"epochs = 1\\n\",\n    \"warmup_steps = int(len(loader) * epochs * 0.1)\\n\",\n    \"\\n\",\n    \"model.fit(\\n\",\n    \"    train_objectives=[(loader, loss)],\\n\",\n    \"    epochs=epochs,\\n\",\n    \"    warmup_steps=warmup_steps,\\n\",\n    \"    output_path='mpnet-mnr-squad2',\\n\",\n    \"    show_progress_bar=True\\n\",\n    \")\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We now have an ODQA retriever model saved to the local `./mpnet-mnr-squad2` directory. That’s great, but we have no idea how well the model performs, so our next step is to evaluate model performance.

## Retriever Evaluation

Evaluation of retriever models is slightly different from the evaluation of most language models. Typically, we input some text and calculate the error between clearly defined predicted and true values.

For information retrieval (IR), we need a metric that measures the rate of successful vs. unsuccessful retrievals. A popular metric for this is [mAP@K](https://sdsawtelle.github.io/blog/output/mean-average-precision-MAP-for-recommender-systems.html). In short, this is an averaged precision value (fraction of retrieved contexts that are relevant) that considers the top _K_ of retrieved results.

The setup for IR evaluation is a little more involved than with other evaluators in the _Sentence Transformers_ library. We will be using the `InformationRetrievalEvaluator`, and this requires three inputs:

- `ir_queries` is a dictionary mapping question IDs to question text
- `ir_corpus` maps context IDs to context text
- `ir_relevant_docs` maps question IDs to their relevant context IDs

Before we initialize the evaluator, we need to download a new set of samples that our model has not seen before and format them into the three dictionaries above. We will use the SQuAD _validation_ set.

```json
{
  "_key": "576b2f1591db",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'id': '56ddde6b9a695914005b9628',\\n\",\n       \" 'title': 'Normans',\\n\",\n       \" 'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France... it continued to evolve over the succeeding centuries.',\\n\",\n       \" 'question': 'In what country is Normandy located?',\\n\",\n       \" 'answers': {'text': ['France', 'France', 'France', 'France'],\\n\",\n       \"  'answer_start': [159, 159, 159, 159]}}\"\n      ]\n     },\n     \"execution_count\": 11,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"squad_dev = load_dataset('squad_v2', split='validation')\\n\",\n    \"squad_dev[0]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

To create the dictionary objects required by the `InformationRetrievalEvaluator`, we must assign unique IDs to both contexts and questions. And we need to ensure that duplicate contexts are _not_ assigned different IDs. To handle these, we will first convert our dataset object into a Pandas dataframe.

```json
{
  "_key": "5674484ce979",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 11873/11873 [00:20<00:00, 576.84it/s]\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>context</th>\\n\",\n       \"      <th>id</th>\\n\",\n       \"      <th>question</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>The Normans (Norman: Nourmands; French: Norman...</td>\\n\",\n       \"      <td>56ddde6b9a695914005b9628</td>\\n\",\n       \"      <td>In what country is Normandy located?</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>The Normans (Norman: Nourmands; French: Norman...</td>\\n\",\n       \"      <td>56ddde6b9a695914005b9629</td>\\n\",\n       \"      <td>When were the Normans in Normandy?</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>The Normans (Norman: Nourmands; French: Norman...</td>\\n\",\n       \"      <td>56ddde6b9a695914005b962a</td>\\n\",\n       \"      <td>From which countries did the Norse originate?</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>The Normans (Norman: Nourmands; French: Norman...</td>\\n\",\n       \"      <td>56ddde6b9a695914005b962b</td>\\n\",\n       \"      <td>Who was the Norse leader?</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>The Normans (Norman: Nourmands; French: Norman...</td>\\n\",\n       \"      <td>56ddde6b9a695914005b962c</td>\\n\",\n       \"      <td>What century did the Normans first gain their ...</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"                                             context  \\\\\\n\",\n       \"0  The Normans (Norman: Nourmands; French: Norman...   \\n\",\n       \"1  The Normans (Norman: Nourmands; French: Norman...   \\n\",\n       \"2  The Normans (Norman: Nourmands; French: Norman...   \\n\",\n       \"3  The Normans (Norman: Nourmands; French: Norman...   \\n\",\n       \"4  The Normans (Norman: Nourmands; French: Norman...   \\n\",\n       \"\\n\",\n       \"                         id                                           question  \\n\",\n       \"0  56ddde6b9a695914005b9628               In what country is Normandy located?  \\n\",\n       \"1  56ddde6b9a695914005b9629                 When were the Normans in Normandy?  \\n\",\n       \"2  56ddde6b9a695914005b962a      From which countries did the Norse originate?  \\n\",\n       \"3  56ddde6b9a695914005b962b                          Who was the Norse leader?  \\n\",\n       \"4  56ddde6b9a695914005b962c  What century did the Normans first gain their ...  \"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"\\n\",\n    \"squad_df = pd.DataFrame()\\n\",\n    \"for row in tqdm(squad_dev):\\n\",\n    \"    squad_df = squad_df.append({\\n\",\n    \"        'question': row['question'],\\n\",\n    \"        'context': row['context'],\\n\",\n    \"        'id': row['id']\\n\",\n    \"    }, ignore_index=True)\\n\",\n    \"squad_df.head()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

From here, we can quickly drop duplicate contexts with the `drop_duplicates` method. As we no longer have duplicates, we can append `'con'` to each context ID, giving each _unique_ context a unique ID different from any question IDs.

```json
{
  "_key": "17db75f7ba01",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>context</th>\\n\",\n       \"      <th>id</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>The Normans (Norman: Nourmands; French: Norman...</td>\\n\",\n       \"      <td>56ddde6b9a695914005b9628con</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>The Norman dynasty had a major political, cult...</td>\\n\",\n       \"      <td>56dddf4066d3e219004dad5fcon</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>17</th>\\n\",\n       \"      <td>The English name \\\"Normans\\\" comes from the Fren...</td>\\n\",\n       \"      <td>56dde0379a695914005b9636con</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>21</th>\\n\",\n       \"      <td>In the course of the 10th century, the initial...</td>\\n\",\n       \"      <td>56dde0ba66d3e219004dad75con</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>28</th>\\n\",\n       \"      <td>Before Rollo's arrival, its populations did no...</td>\\n\",\n       \"      <td>56dde1d966d3e219004dad8dcon</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"                                              context  \\\\\\n\",\n       \"0   The Normans (Norman: Nourmands; French: Norman...   \\n\",\n       \"9   The Norman dynasty had a major political, cult...   \\n\",\n       \"17  The English name \\\"Normans\\\" comes from the Fren...   \\n\",\n       \"21  In the course of the 10th century, the initial...   \\n\",\n       \"28  Before Rollo's arrival, its populations did no...   \\n\",\n       \"\\n\",\n       \"                             id  \\n\",\n       \"0   56ddde6b9a695914005b9628con  \\n\",\n       \"9   56dddf4066d3e219004dad5fcon  \\n\",\n       \"17  56dde0379a695914005b9636con  \\n\",\n       \"21  56dde0ba66d3e219004dad75con  \\n\",\n       \"28  56dde1d966d3e219004dad8dcon  \"\n      ]\n     },\n     \"execution_count\": 13,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"no_dupe = squad_df.drop_duplicates(\\n\",\n    \"    subset='context',\\n\",\n    \"    keep='first'\\n\",\n    \")\\n\",\n    \"# also drop question column\\n\",\n    \"no_dupe = no_dupe.drop(columns=['question'])\\n\",\n    \"# and give each context a slightly unique ID\\n\",\n    \"no_dupe['id'] = no_dupe['id'] + 'con'\\n\",\n    \"no_dupe.head()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We now have unique question IDs in the `squad_df` dataframe and unique context IDs in the `no_dupe` dataframe. Next, we perform an inner join on the `context` feature to bring these two sets of IDs together and find our question ID to context ID mappings.

```json
{
  "_key": "5d31021658c0",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>context</th>\\n\",\n       \"      <th>id_x</th>\\n\",\n       \"      <th>question</th>\\n\",\n       \"      <th>id_y</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>The Normans (Norman: Nourmands; French: Norman...</td>\\n\",\n       \"      <td>56ddde6b9a695914005b9628</td>\\n\",\n       \"      <td>In what country is Normandy located?</td>\\n\",\n       \"      <td>56ddde6b9a695914005b9628con</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>The Normans (Norman: Nourmands; French: Norman...</td>\\n\",\n       \"      <td>56ddde6b9a695914005b9629</td>\\n\",\n       \"      <td>When were the Normans in Normandy?</td>\\n\",\n       \"      <td>56ddde6b9a695914005b9628con</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>The Normans (Norman: Nourmands; French: Norman...</td>\\n\",\n       \"      <td>56ddde6b9a695914005b962a</td>\\n\",\n       \"      <td>From which countries did the Norse originate?</td>\\n\",\n       \"      <td>56ddde6b9a695914005b9628con</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>The Normans (Norman: Nourmands; French: Norman...</td>\\n\",\n       \"      <td>56ddde6b9a695914005b962b</td>\\n\",\n       \"      <td>Who was the Norse leader?</td>\\n\",\n       \"      <td>56ddde6b9a695914005b9628con</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>The Normans (Norman: Nourmands; French: Norman...</td>\\n\",\n       \"      <td>56ddde6b9a695914005b962c</td>\\n\",\n       \"      <td>What century did the Normans first gain their ...</td>\\n\",\n       \"      <td>56ddde6b9a695914005b9628con</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"                                             context  \\\\\\n\",\n       \"0  The Normans (Norman: Nourmands; French: Norman...   \\n\",\n       \"1  The Normans (Norman: Nourmands; French: Norman...   \\n\",\n       \"2  The Normans (Norman: Nourmands; French: Norman...   \\n\",\n       \"3  The Normans (Norman: Nourmands; French: Norman...   \\n\",\n       \"4  The Normans (Norman: Nourmands; French: Norman...   \\n\",\n       \"\\n\",\n       \"                       id_x  \\\\\\n\",\n       \"0  56ddde6b9a695914005b9628   \\n\",\n       \"1  56ddde6b9a695914005b9629   \\n\",\n       \"2  56ddde6b9a695914005b962a   \\n\",\n       \"3  56ddde6b9a695914005b962b   \\n\",\n       \"4  56ddde6b9a695914005b962c   \\n\",\n       \"\\n\",\n       \"                                            question  \\\\\\n\",\n       \"0               In what country is Normandy located?   \\n\",\n       \"1                 When were the Normans in Normandy?   \\n\",\n       \"2      From which countries did the Norse originate?   \\n\",\n       \"3                          Who was the Norse leader?   \\n\",\n       \"4  What century did the Normans first gain their ...   \\n\",\n       \"\\n\",\n       \"                          id_y  \\n\",\n       \"0  56ddde6b9a695914005b9628con  \\n\",\n       \"1  56ddde6b9a695914005b9628con  \\n\",\n       \"2  56ddde6b9a695914005b9628con  \\n\",\n       \"3  56ddde6b9a695914005b9628con  \\n\",\n       \"4  56ddde6b9a695914005b9628con  \"\n      ]\n     },\n     \"execution_count\": 14,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"squad_df = squad_df.merge(no_dupe, how='inner', on='context')\\n\",\n    \"squad_df.head()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We’re now ready to build the three mapping dictionaries for the `InformationRetrievalEvaluator`. First, we map question/context IDs to questions/contexts.

```json
{
  "_key": "ccf3b01d8913",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'56ddde6b9a695914005b9628': 'In what country is Normandy located?',\\n\",\n       \" '56ddde6b9a695914005b9629': 'When were the Normans in Normandy?',\\n\",\n       \" '56ddde6b9a695914005b962a': 'From which countries did the Norse originate?',\\n\",\n       \" ...}\"\n      ]\n     },\n     \"execution_count\": 15,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"ir_queries = {\\n\",\n    \"    row['id_x']: row['question'] for i, row in squad_df.iterrows()\\n\",\n    \"}\\n\",\n    \"ir_queries\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'56ddde6b9a695914005b9628con': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people... to evolve over the succeeding centuries.',\\n\",\n       \" '56dddf4066d3e219004dad5fcon': 'The Norman dynasty had a major political, cultural and military impact on medieval... north Africa and the Canary Islands.',\\n\",\n       \" '56dde0379a695914005b9636con': 'The English name \\\"Normans\\\" comes from the French words Normans/Normanz, plural of... 9th century) to mean \\\"Norseman, Viking\\\".',\\n\",\n       \" ...}\"\n      ]\n     },\n     \"execution_count\": 17,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"ir_corpus = {\\n\",\n    \"    row['id_y']: row['context'] for i, row in squad_df.iterrows()\\n\",\n    \"}\\n\",\n    \"ir_corpus\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And then map question IDs to a _set_ of relevant context IDs. For the SQuAD data, we only have _many-to-one_ or _one-to-one_ question ID to context ID mappings, but we will write our code to _additionally_ handle _one-to-many_ mappings (so we can handle other, non-SQuAD datasets).

```json
{
  "_key": "553fb8be7c02",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'56ddde6b9a695914005b9628': {'56ddde6b9a695914005b9628con'},\\n\",\n       \" '56ddde6b9a695914005b9629': {'56ddde6b9a695914005b9628con'},\\n\",\n       \" '56ddde6b9a695914005b962a': {'56ddde6b9a695914005b9628con'},\\n\",\n       \" ...}\"\n      ]\n     },\n     \"execution_count\": 18,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"ir_relevant_docs = {key: [] for key in squad_df['id_x'].unique()}\\n\",\n    \"for i, row in squad_df.iterrows():\\n\",\n    \"    # we append in the case of a question ID being connected to\\n\",\n    \"    # multiple context IDs\\n\",\n    \"    ir_relevant_docs[row['id_x']].append(row['id_y'])\\n\",\n    \"# this must be in format {question_id: {set of context_ids}}\\n\",\n    \"ir_relevant_docs = {key: set(values) for key, values in ir_relevant_docs.items()}\\n\",\n    \"ir_relevant_docs\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Our evaluator inputs are ready, so we initialize the evaluator and then evaluate our `model`.

```json
{
  "_key": "3521e4cdce24",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ir_eval = InformationRetrievalEvaluator(\\n\",\n    \"    ir_queries, ir_corpus, ir_relevant_docs\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"0.7414982703270794\"\n      ]\n     },\n     \"execution_count\": 20,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"ir_eval(model)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We return a mAP@K score of 0.74, where @K is _100_ by default. This performance is comparable to other state-of-the-art retriever models. Performing the same evaluation with the `multi-qa-mpnet-base-cos-v1` returns a mAP@K score of 0.76, just two percentage points greater than our custom model.

```json
{
  "_key": "a2623d140002",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"0.7610692295248334\"\n      ]\n     },\n     \"execution_count\": 21,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"qa = SentenceTransformer('multi-qa-mpnet-base-cos-v1')\\n\",\n    \"\\n\",\n    \"ir_eval(qa)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Of course, if your target domain was SQuAD data, the pretrained `multi-qa-mpnet-base-cos-v1` model would be the better model. But if you have your own unique dataset and domain. A custom model fine-tuned on that domain will _very likely_ outperform existing models like `multi-qa-mpnet-base-cos-v1` _in that domain_.

## Storing the Vectors

We have our retriever model, we’ve evaluated it, and we’re happy with its performance. But we don’t know how to use it.

When you perform a Google search, Google does _not_ look at the whole internet, encode all of that information into vector embeddings, and then compare all of those vectors to your query vector. We would be waiting a _very_ long time to return results if that were the case.

Instead, Google has already searched for, collected, and encoded all of that data. Google then stores those encoded vectors in some sort of vector database. When you query now, the only thing Google needs to encode is your question.

Taking this a step further, comparing your query vector to _all_ vectors indexed by Google (which represent the entire Google-accessible internet) would still take an incredibly long time. We refer to this accurate but inefficient comparison of every single vector as an _exhaustive search_.

For big datasets, an exhaustive search is too slow. The solution to this is to perform an _approximate search_. An approximate search allows us to massively reduce our search scope to a smaller but (hopefully) more relevant sub-section of the index. Making our search times much more manageable.

The [Pinecone vector database](https://www.pinecone.io/) is a straightforward and robust solution that allows us to (1) store our context vectors and (2) perform an _accurate and fast_ approximate search. These are the two elements we need for a promising ODQA pipeline.

Again, we need to work through a few steps to set up our vector database.

![Steps from retriever and context preparation (top-right) that allow us to encode contexts into context vectors. After initializing a vector database index, we can populate the index with the context vectors.](https://cdn.sanity.io/images/vr8gru94/production/39aad764e0d6a87a31ae323ab3bc1fa396064b48-1920x1080.png)


After working through each of those steps, we will be ready to begin retrieving relevant contexts.

### Encoding Contexts

We have already created our retriever model, and during the earlier evaluation step, we downloaded the SQuAD validation data. We can use this same validation data and encode all _unique_ contexts.

```json
{
  "_key": "cd0fe690cd71",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['id', 'title', 'context', 'question', 'answers'],\\n\",\n       \"    num_rows: 1204\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 3,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"unique_contexts = []\\n\",\n    \"unique_ids = []\\n\",\n    \"\\n\",\n    \"# make list of IDs that represent only first instance of\\n\",\n    \"# each context\\n\",\n    \"for row in squad_dev:\\n\",\n    \"    if row['context'] not in unique_contexts:\\n\",\n    \"        unique_contexts.append(row['context'])\\n\",\n    \"        unique_ids.append(row['id'])\\n\",\n    \"\\n\",\n    \"# now filter out any samples that aren't included in unique IDs\\n\",\n    \"squad_dev = squad_dev.filter(lambda x: True if x['id'] in unique_ids else False)\\n\",\n    \"squad_dev\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 301/301 [20:18<00:00,  4.05s/ba]\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['answers', 'context', 'encoding', 'id', 'question', 'title'],\\n\",\n       \"    num_rows: 1204\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 4,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# and now encode the unique contexts\\n\",\n    \"squad_dev = squad_dev.map(lambda x: {\\n\",\n    \"    'encoding': model.encode(x['context']).tolist()\\n\",\n    \"}, batched=True, batch_size=4)\\n\",\n    \"squad_dev\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

After removing duplicate contexts, we’re left with 1,204 samples. It is a tiny dataset but large enough for our example.

### Initializing the Index

Before adding the context vectors to our index, we need to initialize it. Fortunately, Pinecone makes this very easy. We start by installing the Pinecone client if required:

`!pip install pinecone-client`

Then we initialize a connection to Pinecone. For this, we need a [free API key](https://app.pinecone.io/).

```json
{
  "_key": "f12df2968af0",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pinecone\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"API_KEY = 'YOUR_API_KEY'\\n\",\n    \"\\n\",\n    \"pinecone.init(api_key=API_KEY, environment='YOUR_ENV')\\n\",\n    \"# (find env next to API key in console\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We then create a new index with `pinecone.create_index`. Before initializing the index, we should check that the index name does not already exist (which it will not if this is your first time creating the index).

```json
{
  "_key": "c5b007047fa5",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# check if index already exists, if not we create it\\n\",\n    \"if 'squad-index' not in pinecone.list_indexes():\\n\",\n    \"    pinecone.create_index(\\n\",\n    \"        name='squad-index', dimension=model.get_sentence_embedding_dimension(), metric='cosine'\\n\",\n    \"    )\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"768\"\n      ]\n     },\n     \"execution_count\": 11,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# we use this to get required index dims\\n\",\n    \"model.get_sentence_embedding_dimension()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

When creating a new index, we need to specify the index `name`, and the dimensionality of vectors to be added. We either check our encoded context vectors’ dimensions directly or find the dimension attribute within the retriever model (as shown above).

### Populating the Index

After creating both our index and the context vectors, we can go ahead and _upsert_ (upload) the vectors into our index.

```json
{
  "_key": "513df2fbf78a",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# initialize connection to the new index\\n\",\n    \"index = pinecone.Index('squad-index')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 25/25 [00:13<00:00,  1.91it/s]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from tqdm.auto import tqdm  # progress bar\\n\",\n    \"\\n\",\n    \"upserts = [(v['id'], v['encoding'], {'text': v['context']}) for v in squad_dev]\\n\",\n    \"# now upsert in chunks\\n\",\n    \"for i in tqdm(range(0, len(upserts), 50)):\\n\",\n    \"    i_end = i + 50\\n\",\n    \"    if i_end > len(upserts): i_end = len(upserts)\\n\",\n    \"    index.upsert(vectors=upserts[i:i_end])\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Pinecone expects us to [upsert data](https://www.pinecone.io/docs/insert-data/) in the format:

```python
vectors =
[
(id_0, vector_0, metadata_0),
(id_1, vector_1, metadata_1)
]
```

Our IDs are the unique alphanumeric identifiers that we saw earlier in the SQuAD data. The vectors are our encoded context vectors formatted as lists; the metadata is a dictionary that allows us to store extra information in a key-value format.

---

_Using the metadata field, Pinecone allows us to_ [create complex or straightforward metadata filters](https://www.pinecone.io/learn/vector-search-filtering/) _to target our search scope to specific numeric ranges, categories, and more._

---

Once the upsert is complete, the retrieval components of our ODQA pipeline are ready to go, and we can begin asking questions.

## Making Queries

With everything set up, querying our retriever-vector database pipeline is pretty straightforward. We first define a question and encode it as we did for our context vectors before.

```json
{
  "_key": "7c6c05f29abe",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"query = \\\"When were the Normans in Normandy?\\\"\\n\",\n    \"xq = model.encode([query]).tolist()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

After creating our query vector, we pass it to Pinecone via the `index.query` method, specify how many results we’d like to return with `top_k`, and `include_metadata` so that we can see the text associated with each returned vector.

```json
{
  "_key": "df91b975cf25",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'results': [{'matches': [{'id': '56dddf4066d3e219004dad5f',\\n\",\n       \"    'metadata': {'text': 'The Norman dynasty had a '\\n\",\n       \"        'major political, cultural and '\\n\",\n       \"        'military impact on medieval '\\n\",\n       \"        'Europe and even the Near '\\n\",\n       \"        'East. The Normans were famed '\\n\",\n       \"        'for their martial spirit and '\\n\",\n       \"        'eventually for their '\\n\",\n       \"        '...'\\n\",\n       \"        'Ireland, and to the coasts of '\\n\",\n       \"        'north Africa and the Canary '\\n\",\n       \"        'Islands.'},\\n\",\n       \"    'score': 0.678345382,\\n\",\n       \"    'values': []},\\n\",\n       \"    {'id': '56ddde6b9a695914005b9628',\\n\",\n       \"    'metadata': {'text': 'The Normans (Norman: '\\n\",\n       \"        'Nourmands; French: Normands; '\\n\",\n       \"        'Latin: Normanni) were the '\\n\",\n       \"        'people who in the 10th and '\\n\",\n       \"        '11th centuries gave their '\\n\",\n       \"        'name to Normandy, a region in '\\n\",\n       \"        'France. They were descended '\\n\",\n       \"        '...'\\n\",\n       \"        'of the 10th century, and it '\\n\",\n       \"        'continued to evolve over the '\\n\",\n       \"        'succeeding centuries.'},\\n\",\n       \"    'score': 0.667023182,\\n\",\n       \"    'values': []}],\\n\",\n       \"'namespace': ''}]}\"\n      ]\n     },\n     \"execution_count\": 4,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"xc = index.query(xq, top_k=2, include_metadata=True)\\n\",\n    \"xc\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We return the correct context as our second top result in this example. The first result is relevant in the context of Normans and Normandy, but it does not answer the specific question of _when_ the Normans were in Normandy.

Let’s try a couple more questions.

```json
{
  "_key": "0adbc5b492d1",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'results': [{'matches': [{'id': '56e19724cd28a01900c679f6',\\n\",\n       \"    'metadata': {'text': 'A function problem is a '\\n\",\n       \"        'computational problem where a '\\n\",\n       \"        'single output (of a total '\\n\",\n       \"        'function) is expected for '\\n\",\n       \"        'every input, but the output '\\n\",\n       \"        '...'\\n\",\n       \"        'and the integer factorization '\\n\",\n       \"        'problem.'},\\n\",\n       \"    'score': 0.7924698,\\n\",\n       \"    'values': []},\\n\",\n       \"    {'id': '56e17a7ccd28a01900c679a1',\\n\",\n       \"    'metadata': {'text': 'A computational problem can '\\n\",\n       \"        'be viewed as an infinite '\\n\",\n       \"        '...'},\\n\",\n       \"    'score': 0.662115633,\\n\",\n       \"    'values': []},\\n\",\n       \"    {'id': '56e1a0dccd28a01900c67a2e',\\n\",\n       \"    'metadata': {'text': 'It is tempting to think that '\\n\",\n       \"        'the notion of function '\\n\",\n       \"        '...'},\\n\",\n       \"    'score': 0.615972638,\\n\",\n       \"    'values': []},\\n\",\n       \"    {'id': '56e19557e3433e1400422fee',\\n\",\n       \"    'metadata': {'text': 'An example of a decision '\\n\",\n       \"        'problem is the following. The '\\n\",\n       \"        '...'},\\n\",\n       \"    'score': 0.599050403,\\n\",\n       \"    'values': []},\\n\",\n       \"    {'id': '56e190bce3433e1400422fc8',\\n\",\n       \"    'metadata': {'text': 'Decision problems are one of '\\n\",\n       \"        'the central objects of study '\\n\",\n       \"        '...'},\\n\",\n       \"    'score': 0.593822241,\\n\",\n       \"    'values': []}],\\n\",\n       \"'namespace': ''}]}\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"xq = model.encode([\\n\",\n    \"    \\\"How many outputs are expected for each input in a function problem?\\\"\\n\",\n    \"]).tolist()\\n\",\n    \"index.query(xq, top_k=5, include_metadata=True)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

For this question, we return the correct context as the highest result with a much higher score than the remaining samples.

```json
{
  "_key": "b3c80c5785ed",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'results': [{'matches': [{'id': '56de4b074396321400ee2793',\\n\",\n       \"    'metadata': {'text': 'In England, the period of '\\n\",\n       \"        '...'\\n\",\n       \"        'the Early Gothic. In southern '\\n\",\n       \"        'Italy, the Normans '\\n\",\n       \"        'incorporated elements of '\\n\",\n       \"        'Islamic, Lombard, and '\\n\",\n       \"        'Byzantine building techniques '\\n\",\n       \"        '...'},\\n\",\n       \"    'score': 0.604390621,\\n\",\n       \"    'values': []},\\n\",\n       \"    {'id': '56de51244396321400ee27ef',\\n\",\n       \"    'metadata': {'text': 'In Britain, Norman art '\\n\",\n       \"        'primarily survives as '\\n\",\n       \"        '...'},\\n\",\n       \"    'score': 0.487686485,\\n\",\n       \"    'values': []},\\n\",\n       \"    {'id': '56de4a89cffd8e1900b4b7bd',\\n\",\n       \"    'metadata': {'text': 'Norman architecture typically '\\n\",\n       \"        'stands out as a new stage in '\\n\",\n       \"        '...'},\\n\",\n       \"    'score': 0.451720327,\\n\",\n       \"    'values': []},\\n\",\n       \"    {'id': '56de4b5c4396321400ee2799',\\n\",\n       \"    'metadata': {'text': 'In the visual arts, the '\\n\",\n       \"        'Normans did not have the rich '\\n\",\n       \"        '...'},\\n\",\n       \"    'score': 0.343783677,\\n\",\n       \"    'values': []},\\n\",\n       \"    {'id': '57287c142ca10214002da3d0',\\n\",\n       \"    'metadata': {'text': 'The Yuan undertook extensive '\\n\",\n       \"        'public works. Among Kublai '\\n\",\n       \"        '...'},\\n\",\n       \"    'score': 0.335578799,\\n\",\n       \"    'values': []}],\\n\",\n       \"'namespace': ''}]}\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"xq = model.encode([\\n\",\n    \"    \\\"Who used Islamic, Lombard, etc construction techniques in the Mediterranean?\\\"\\n\",\n    \"]).tolist()\\n\",\n    \"index.query(xq, top_k=5, include_metadata=True)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We return the correct context in the first position. Again, there is a good separation between sample scores of the correct context and other contexts.

That’s it for this guide to fine-tuning and implementing a custom retriever model in an ODQA pipeline. Now we can implement two of the most crucial components in ODQA: enabling a more human and natural approach to information retrieval.

One of the most incredible things about ODQA is how widely applicable it is. Organizations across almost every industry have the opportunity to benefit from more intelligent and efficient information retrieval.

Any organization that handles unstructured information such as word documents, PDFs, emails, and more has a clear use case: freeing this information and enabling easy and natural access through QA systems.

Although this is the most apparent use case, there are many more, whether it be an internal efficiency speedup or a key component in a product (as with Google search). The opportunities are both broad and highly impactful.

## References

[1] N. Reimers, [Neural Search for Low Resource Scenarios](https://www.youtube.com/watch?v=XNJThigyvos) (2021), YouTube

S. Sawtelle, [Mean Average Precision (MAP) For Recommender Systems](https://sdsawtelle.github.io/blog/output/mean-average-precision-MAP-for-recommender-systems.html) (2016), GitHub