# Reader Models for Open Domain Question-Answering

> Open-domain question-answering (ODQA) is a wildly popular pipeline of databases and language models that allow us to ask a machine human-like questions and return comprehensible and even intelligent answers.

Open-domain question-answering (ODQA) is a wildly popular _pipeline_ of databases and language models that allow us to ask a machine human-like questions and return comprehensible and even intelligent answers.

Despite the outward guise of simplicity, ODQA requires a reasonably advanced set of components placed together to enable the _extractive_ Q&A functionality.

We call this _extractive_ Q&A because the models are not generating an answer. Instead, the answer already exists but is hidden somewhere within potentially thousands, millions, or even more data sources.

By enabling extractive Q&A, we enable a more _intelligent_ and _efficient_ way to retrieve information from what can be massive stores of data.

[Video](https://www.youtube.com/watch?v=-fzCSPsfMic)


ODQA relies on three components: the vector database, for storing encoded vector representations of the data we will search, a retriever to handle context and question encoding, and a reader model that consumes relevant _retrieved_ contexts and identifies a shorter, more specific answer.

The reader is the final act in an ODQA pipeline; it takes the contexts returned by the vector database and retriever components and _reads_ them. Our reader will then return what it believes to be the _specific answer_ to our question.

To be exact, we don’t get the ‘specific answer’. The model is reading _input IDs_, which are integers representing words or subwords. So, rather than returning a human-readable text answer, it actually returns a _span_ of input ID positions.

![Question (grey), context (cyan), and answer (blue). The model doesn’t read the strings. It reads token IDs, and so when outputting a prediction for the answer, it outputs a span of token IDs that it believes represent the answer.](https://cdn.sanity.io/images/vr8gru94/production/7e83834ab5f8297b5d22762fa2eaa72d72800224-1920x860.png)


To fine-tune a model, we need two inputs and two labels. The inputs are the question and a relevant context, and the labels are the answer’s start and end positions.

![Inputs (cyan) and target labels (answer start and end positions, blue). The start and end positions are the token positions from the encoded question-context input_ids tensor that represent the start and end of the answer (extracted from the context).](https://cdn.sanity.io/images/vr8gru94/production/0ace6654fbbb4aa018a29849d99a602687749b66-1920x980.png)


There isn’t much more to fine-tuning a reader model. It’s a relatively straightforward process. The most complex part is pre-processing the training data.

With our overview complete, let’s dive into the details and work through an actual training example.

## Implementation

There are more steps when training a reader model than just _train the model_. As mentioned, these other steps can prove to be the tricky part. In our case, we have three distinct steps.

1. Download and pre-process Q&A dataset
2. Fine-tune the model
3. Evaluation

Without any further ado, let’s begin with the data.

### Download and Pre-process

We will be using the **S**tanford **Q**uestion and **A**nswering **D**ataset (SQuAD) for fine-tuning. We can download it with HuggingFace _Datasets_.

```json
{
  "_key": "7fbba5d097ca",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['id', 'title', 'context', 'question', 'answers'],\\n\",\n       \"    num_rows: 130319\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 1,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from datasets import load_dataset\\n\",\n    \"\\n\",\n    \"squad = load_dataset('squad_v2', split='train')\\n\",\n    \"squad\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Looking at this, we have _five_ features, of which we only care about `question`, `context` for the inputs, and `answers` for the labels.

```json
{
  "_key": "763657c561ca",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'id': '56be85543aeaaa14008c9063',\\n\",\n       \" 'title': 'Beyoncé',\\n\",\n       \" 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer...',\\n\",\n       \" 'question': 'When did Beyonce start becoming popular?',\\n\",\n       \" 'answers': {'answer_start': 269,\\n\",\n       \"  'text': 'in the late 1990s'}}\"\n      ]\n     },\n     \"execution_count\": 2,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"squad[0]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We must make a few transformations to format the `answers` into the start and end token ID positions we need. We have `answer_start`, but this gives us the position within the context _string_ that the answer begins. These positions are not what the model needs. Instead, it requires the start position using token ID indexes.

```json
{
  "_key": "48625f95e4e1",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"286\"\n      ]\n     },\n     \"execution_count\": 2,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# we can get the end position of the answer\\n\",\n    \"squad[0]['answers']['answer_start'][0] + len(squad[0]['answers']['text'][0])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"'in the late 1990s'\"\n      ]\n     },\n     \"execution_count\": 3,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"\\n\",\n    \"squad[0]['context'][269:286]\\n\",\n    \"# this works, but only for strings, not for the token IDs that we need for BERT\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

That is our main hurdle. To push through it, we will take three steps:

1. Tokenize the context.
2. Convert `answer_start` to a token ID index.
3. Find the end token index using the starting position and answer `text`.

Starting with **tokenize the context**, we first initialize a tokenizer using the HuggingFace _Transformers_ library.

```json
{
  "_key": "e9563b6efccb",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from transformers import BertTokenizerFast\\n\",\n    \"# initialize the tokenizer\\n\",\n    \"tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# tokenize our question-context pairs\\n\",\n    \"squad = squad.map(lambda x: tokenizer(\\n\",\n    \"    x['question'], x['context'], max_length=384,\\n\",\n    \"    padding='max_length', truncation=True,\\n\",\n    \"    return_offsets_mapping=True\\n\",\n    \"))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Then we tokenize our question-context pairs, and this returns _three_ tensors by default:

- `input_ids`, the token ID representation of our text.
- `attention_mask` a list of values telling our model whether to apply the attention mechanism to respective token embeddings with `1` or to ignore padding token positions with `0`.
- `token_type_ids` indicates sentence A (the question) with the first set of `0` values, sentence B (the context) with `1` values, and remaining padding tokens with the trailing `0` values.

We have added another tensor called `offset_mapping` by setting `return_offsets_mapping=True`. This tensor is very important for finding our label values for training our model.

Earlier, we found the start and end positions for the _character_ positions from our _context_ string. As mentioned, we cannot use these. We need the token positions, and the `offset_mapping` tensor is essential in finding the token positions.

```json
{
  "_key": "37331b3179f1",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"'[CLS] when did beyonce start becoming popular? [SEP] beyonce giselle knowles - carter ( / biːˈjɒnseɪ / bee - yon - say ) ( born september 4, 1981 ) is an american singer... singles \\\" crazy in love \\\" and \\\" baby boy \\\". [SEP] [PAD] [PAD] [PAD] [PAD]...'\"\n      ]\n     },\n     \"execution_count\": 10,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"tokenizer.decode(squad[0]['input_ids'])\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Another consideration when finding the token position is that when we tokenized, we tokenized both the question _and_ context as shown above where we follow the format `[CLS] question [SEP] context [SEP] padding`. To find the answer start and end positions, we must shift the values by the length of the question segment.

To find the question and context segment lengths, we use the `token_type_ids` tensor.

```json
{
  "_key": "c4dbdf29a1f5",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"(9, 165)\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"question_len = 0\\n\",\n    \"# get question length by identifying where 0 tokens first stop\\n\",\n    \"for x in squad[0]['token_type_ids']:\\n\",\n    \"    if x != 1:\\n\",\n    \"        question_len += 1\\n\",\n    \"    else: break\\n\",\n    \"# context is represented by 1s, so we take a sum to get context len\\n\",\n    \"context_len = sum(squad[0]['token_type_ids'])\\n\",\n    \"question_len, context_len\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We need to consider one additional case where the answer has been truncated or never existed (some records have no answer). In both of these scenarios, we set the start and end positions to `0`.

```json
{
  "_key": "5e27cbcce8c7",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 34,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def char_to_id(sample):\\n\",\n    \"    char_start = sample['answers']['answer_start']\\n\",\n    \"    char_end = sample['answers']['answer_end']\\n\",\n    \"    # find the question length\\n\",\n    \"    question_len = 0\\n\",\n    \"    for x in sample['token_type_ids']:\\n\",\n    \"        if x != 1:\\n\",\n    \"            question_len += 1\\n\",\n    \"        else: break\\n\",\n    \"    # and get the context length\\n\",\n    \"    context_len = sum(sample['token_type_ids'])\\n\",\n    \"    # get offset mappings for context segment\\n\",\n    \"    context_mappings = sample['offset_mapping'][question_len:][:context_len-1]\\n\",\n    \"    for i, mapping in enumerate(context_mappings):\\n\",\n    \"        if char_start >= mapping[0] and char_start <= mapping[1]:\\n\",\n    \"            token_start = question_len + i\\n\",\n    \"        if char_end >= mapping[0] and char_end <= mapping[1]:\\n\",\n    \"            token_end = question_len + i + 1\\n\",\n    \"            return {'start_positions': token_start, 'end_positions': token_end}\\n\",\n    \"        if i == len(context_mappings) - 1:\\n\",\n    \"            # this means the answer tokens are out of range, eg have been truncated\\n\",\n    \"            # and therefore there is no answer\\n\",\n    \"            token_start, token_end = 0, 0\\n\",\n    \"            return {'start_positions': token_start, 'end_positions': token_end}\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 36,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 130319/130319 [02:56<00:00, 737.32ex/s]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"squad = squad.map(lambda x: char_to_id(x))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 37,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'id': '56be85543aeaaa14008c9063',\\n\",\n       \" 'title': 'Beyoncé',\\n\",\n       \" 'context': 'Beyoncé Giselle Knowles-Carter... singles \\\"Crazy in Love\\\" and \\\"Baby Boy\\\".',\\n\",\n       \" 'question': 'When did Beyonce start becoming popular?',\\n\",\n       \" 'answers': {'answer_end': 286,\\n\",\n       \"  'answer_start': 269,\\n\",\n       \"  'text': 'in the late 1990s'},\\n\",\n       \" 'input_ids': [...],\\n\",\n       \" 'token_type_ids': [...],\\n\",\n       \" 'attention_mask': [...],\\n\",\n       \" 'offset_mapping': [...],\\n\",\n       \" 'start_positions': 75,\\n\",\n       \" 'end_positions': 79}\"\n      ]\n     },\n     \"execution_count\": 37,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"squad[0]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Once we have the start and end positions, we need to define how we will load the dataset into our model for training. At the moment, our dataset will return lists of dictionaries for each training batch.

We cannot feed lists of dictionaries into our model. Instead, we need to pull these dictionaries into single batch-size tensors. For that, we use the `default_data_collator` function.

```json
{
  "_key": "497e9855476e",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 39,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],\\n\",\n       \"    num_rows: 130319\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 39,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# remove all unecessary columns (only need input_ids, attention_mask,\\n\",\n    \"# token_type_ids, start_positions, end_positions)\\n\",\n    \"squad = squad.remove_columns(['id', 'title', 'context', 'question', 'answers', 'offset_mapping'])\\n\",\n    \"squad\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 42,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from transformers import default_data_collator\\n\",\n    \"# prepare format of data being fed into model\\n\",\n    \"data_collator = default_data_collator\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We don’t need to do anything else with our dataset or data collator for now, so we move on to the next step of fine-tuning.

### Fine-tuning the Model

As mentioned, we will be fine-tuning the model using the HuggingFace _Transformers_ `Trainer` class. To use this, we first need a model to fine-tune, which we load as usual with transformers.

```json
{
  "_key": "4c2f20271406",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 40,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from transformers import BertForQuestionAnswering\\n\",\n    \"\\n\",\n    \"model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Next, we set up the `Trainer` training parameters.

```json
{
  "_key": "22781d62f411",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 41,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from transformers import TrainingArguments\\n\",\n    \"\\n\",\n    \"batch_size = 24\\n\",\n    \"epochs = 3\\n\",\n    \"\\n\",\n    \"args = TrainingArguments(\\n\",\n    \"    'bert-base-uncased-squad2',\\n\",\n    \"    learning_rate=2e-5,\\n\",\n    \"    per_device_train_batch_size=batch_size,\\n\",\n    \"    num_train_epochs=epochs,\\n\",\n    \"    weight_decay=0.1,\\n\",\n    \"    warmup_steps=int(len(squad)*epochs*0.1)\\n\",\n    \")\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We use tried and testing training parameters used in the first BERT for QA with SQuADv2 paper _and_ Deepset AI’s BERT training parameters, we set a learning rate of `2e-5`, `0.1` weight decay, and train in batches of `24` for `3` epochs [1] [2].

```json
{
  "_key": "45ca2fdc25b1",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 43,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from transformers import Trainer\\n\",\n    \"import torch\\n\",\n    \"\\n\",\n    \"device = 'cuda:0' if torch.cuda.is_available() else 'cpu'\\n\",\n    \"\\n\",\n    \"trainer = Trainer(\\n\",\n    \"    model.to(device),\\n\",\n    \"    args,\\n\",\n    \"    train_dataset=squad,\\n\",\n    \"    data_collator=data_collator,\\n\",\n    \"    tokenizer=tokenizer\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 44,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Running training *****\\n\",\n      \"  Num examples = 130319\\n\",\n      \"  Num Epochs = 3\\n\",\n      \"  Instantaneous batch size per device = 24\\n\",\n      \"  Total train batch size (w. parallel, distributed & accumulation) = 24\\n\",\n      \"  Gradient Accumulation steps = 1\\n\",\n      \"  Total optimization steps = 16290\\n\",\n      \"  3%|▎         | 500/16290 [02:45<1:27:11,  3.02it/s]Saving model checkpoint to bert-base-uncased-squad2...\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Model weights saved in bert-base-uncased-squad2\\\\checkpoint-16000\\\\pytorch_model.bin\\n\",\n      \"tokenizer config file saved in bert-base-uncased-squad2\\\\checkpoint-16000\\\\tokenizer_config.json\\n\",\n      \"Special tokens file saved in bert-base-uncased-squad2\\\\checkpoint-16000\\\\special_tokens_map.json\\n\",\n      \"100%|██████████| 16290/16290 [1:26:22<00:00,  3.24it/s]\\n\",\n      \"\\n\",\n      \"Training completed. Do not forget to share your model on huggingface.co/models =)\\n\",\n      \"\\n\",\n      \"\\n\",\n      \"100%|██████████| 16290/16290 [1:26:22<00:00,  3.14it/s]\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{'train_runtime': 5182.0739, 'train_samples_per_second': 75.444, 'train_steps_per_second': 3.144, 'train_loss': 2.052677161566241, 'epoch': 3.0}\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"TrainOutput(global_step=16290, training_loss=2.052677161566241, metrics={'train_runtime': 5182.0739, 'train_samples_per_second': 75.444, 'train_steps_per_second': 3.144, 'train_loss': 2.052677161566241, 'epoch': 3.0})\"\n      ]\n     },\n     \"execution_count\": 44,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"trainer.train()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Like we said, fine-tuning the model is the easy part. We can find our model files in the directory defined in the `args` parameter, in this case, `./bert-base-uncased-squad2`. We will see a set of folders named `checkpoint-x` in this directory. The last of those is the _latest_ model checkpoint saved during training.

![Model and tokenizer files in the /bert-reader-squad2 model directory.](https://cdn.sanity.io/images/vr8gru94/production/da108701ac18f64e90a3932a3f643a08ef0ea48d-1420x1040.png)


By default, a new checkpoint is saved every 500 steps. These checkpoint saves mean the _final_ model (at step 27,150) is not the final model but rather the model at step 27,000.

There is unlikely to be a noticeable difference between these two states, so we either take the model files from `./bert-base-uncased-squad2/checkpoint-24000` or we manually save our model with:

```json
{
  "_key": "9276db66d1b6",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 45,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Saving model checkpoint to bert-reader-squad2\\n\",\n      \"Configuration saved in bert-reader-squad2\\\\config.json\\n\",\n      \"Model weights saved in bert-reader-squad2\\\\pytorch_model.bin\\n\",\n      \"tokenizer config file saved in bert-reader-squad2\\\\tokenizer_config.json\\n\",\n      \"Special tokens file saved in bert-reader-squad2\\\\special_tokens_map.json\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"trainer.save_model('bert-reader-squad2')\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We can find the model files in the specified directory.

### Inference

Before moving on to the next step of evaluation, let’s take a look at how we can use this model.

First, we initialize a transformers `pipeline`.

```json
{
  "_key": "f1b465fbc48d",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 33,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from transformers import pipeline\\n\",\n    \"\\n\",\n    \"model_name = 'bert-reader-squad2'\\n\",\n    \"\\n\",\n    \"qa = pipeline(\\n\",\n    \"    'question-answering',\\n\",\n    \"    model=model_name,\\n\",\n    \"    tokenizer=model_name\\n\",\n    \")\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Next, we prepare the evaluation data. Again we will use the `squad_v2` dataset from HuggingFace, taking the _validation_ split.

```json
{
  "_key": "abf48aec3ea3",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 34,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['id', 'title', 'context', 'question', 'answers'],\\n\",\n       \"    num_rows: 11873\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 34,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from datasets import load_dataset\\n\",\n    \"\\n\",\n    \"dev = load_dataset('squad_v2', split='validation')\\n\",\n    \"dev\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

The `pipeline` requires an iterable set of key-value pairs where the only keys are `question` and `context`. We can simply drop the unneeded columns of `id` and `title` to handle this. However, we will need to keep track of the true answers during the next step of _evaluation_, so we store them in a separate `ans` dataset.

```json
{
  "_key": "e4892852412c",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ans = dev['answers']\\n\",\n    \"dev = dev.remove_columns(['id', 'title', 'answers'])\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

To make a prediction, we take a single _question_ and _context_ and feed them into our pipeline `qa`:

```json
{
  "_key": "780422b81804",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'score': 0.7120676040649414,\\n\",\n       \" 'start': 94,\\n\",\n       \" 'end': 122,\\n\",\n       \" 'answer': '10th and 11th centuries gave'}\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"qa({\\n\",\n    \"    'question': dev[1]['question'],\\n\",\n    \"    'context': context\\n\",\n    \"})\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We’ll process the whole dataset like this in the next section.

## Evaluation

We’ve technically finished fine-tuning our model, but it’s not of much use if we can’t validate its performance. We need confidence in the model’s performance.

Evaluation of our reader model is a little tricky as we want to identify matches between true and predicted answer labels. The most straightforward approach is to use an **E**xact **M**atch metric. This metric will simply tell us `1` if the true and predicted answers are _precisely_ the same or `0` if not.

There are two reasons we might want to avoid this and try something more flexible. First, we may find that a model predicts the correct answer, but when decoded, the predicted tokens are in a slightly different format.

The second reason is that our model might predict a _partially correct_ answer and partially correct is better than nothing, but this _better than nothing_ isn’t accounted for by the EM metric.

```json
{
  "_key": "eaa3deeacc43",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"(0,\\n\",\n       \" [{'rouge-1': {'r': 1.0, 'p': 0.5, 'f': 0.6666666622222223},\\n\",\n       \"   'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},\\n\",\n       \"   'rouge-l': {'r': 1.0, 'p': 0.5, 'f': 0.6666666622222223}}])\"\n      ]\n     },\n     \"execution_count\": 2,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"a = \\\"the Normans\\\"\\n\",\n    \"b = \\\"Normans\\\"\\n\",\n    \"\\n\",\n    \"exact_match = int(a == b)\\n\",\n    \"rouge_score = rouge.get_scores(a, b)\\n\",\n    \"exact_match, rouge_score\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We can solve the first issue in _most cases_ by normalizing both the true and predicted answers, meaning we lowercase, remove punctuation, and remove any other potential points of conflict.

The second problem requires a more sophisticated solution, and it is best if we _do not_ use the EM metric. Instead, we use _ROUGE_.

There are a few different ROUGE metrics. We will focus on ROUGE-N, which measures the number of matching _n-grams_ between the predicted and true answers, where an n-gram is a grouping of tokens/words.

The _N_ in ROUGE-_N_ stands for the number of tokens/words within a single n-gram. This means that ROUGE-1 compares individual tokens/words (unigrams), ROUGE-2 compares tokens/words in chunks of two (bigrams), and so on.

![Example of unigram, bigram, and trigram which are single-token, double-token, and triple-token groupings respectively.](https://cdn.sanity.io/images/vr8gru94/production/a64a185bbee68f22e5f2ec1cf4f07f27e018b35f-1920x720.png)


Either way, we return a score of `1` for an exact match, `0` for no match, or any value in between.

```json
{
  "_key": "da33ce0c91e1",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from rouge import Rouge\\n\",\n    \"\\n\",\n    \"rouge = Rouge()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'rouge-1': {'r': 1.0, 'p': 1.0, 'f': 0.999999995},\\n\",\n       \"  'rouge-2': {'r': 1.0, 'p': 1.0, 'f': 0.999999995},\\n\",\n       \"  'rouge-l': {'r': 1.0, 'p': 1.0, 'f': 0.999999995}}]\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"rouge.get_scores('hello this is an exact match', 'hello this is an exact match')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'rouge-1': {'r': 0.0, 'p': 0.0, 'f': 0.0},\\n\",\n       \"  'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},\\n\",\n       \"  'rouge-l': {'r': 0.0, 'p': 0.0, 'f': 0.0}}]\"\n      ]\n     },\n     \"execution_count\": 13,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"rouge.get_scores('hello this is not a match', 'because nothing matches')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'rouge-1': {'r': 0.5, 'p': 0.4, 'f': 0.4444444395061729},\\n\",\n       \"  'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},\\n\",\n       \"  'rouge-l': {'r': 0.25, 'p': 0.2, 'f': 0.22222221728395072}}]\"\n      ]\n     },\n     \"execution_count\": 14,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"rouge.get_scores('this is a half match', 'because half is matching')\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

To apply ROUGE-1 for measuring reader model performance, we first need to _predict_ answers using our model. We can then compare these predicted answers to the true answers.

```json
{
  "_key": "caf7b2cb30db",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 11873/11873 [18:40<00:00, 10.59it/s]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"results = []\\n\",\n    \"\\n\",\n    \"for i in tqdm(range(len(dev))):\\n\",\n    \"    out = qa(dev[i])\\n\",\n    \"    results.append({\\n\",\n    \"        **out,\\n\",\n    \"        'true_answer': ans[i]['text']\\n\",\n    \"    })\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Finally, given the two sets of answers, we can call `rouge.get_scores` to return recall `r`, precision `p`, and F1 `f` scores for both uni and bi-grams.

We still need to deal with where there is no answer and that the SQuAD evaluation set contains four possible answers for each sample.

```json
{
  "_key": "881ac2ec77bf",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'id': '56ddde6b9a695914005b9629',\\n\",\n       \" 'title': 'Normans',\\n\",\n       \" 'context': 'The Normans were the people who in the 10th and 11th centuries gave their name to Normandy, a region... over the succeeding centuries.',\\n\",\n       \" 'question': 'When were the Normans in Normandy?',\\n\",\n       \" 'answers': {'text': ['10th and 11th centuries',\\n\",\n       \"   'in the 10th and 11th centuries',\\n\",\n       \"   '10th and 11th centuries',\\n\",\n       \"   '10th and 11th centuries'],\\n\",\n       \"  'answer_start': [94, 87, 94, 94]}}\"\n      ]\n     },\n     \"execution_count\": 5,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dev[1]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'id': '5ad39d53604f3c001a3fe8d1',\\n\",\n       \" 'title': 'Normans',\\n\",\n       \" 'context': 'The Normans were the people who in the 10th and 11th centuries gave their name to Normandy, a region... over the succeeding centuries.',\\n\",\n       \" 'question': \\\"Who gave their name to Normandy in the 1000's and 1100's\\\",\\n\",\n       \" 'answers': {'text': [], 'answer_start': []}}\"\n      ]\n     },\n     \"execution_count\": 10,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dev[5]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We could check if the model correctly predicted that no answer exists for the ‘no answer’ scenario. If the model correctly identifies that there is no answer, we would return a score of _1.0_. Otherwise, we would return a score of _0.0_.

We will calculate the ROUGE-1 F1 score for every possible answer to deal with the multiple answers and take the best score.

After calculating all scores, we take the average value. This average value is the final ROUGE-1 F1 score for the model.

| Model | ROUGE-1 F1 |
| bert-reader-squad2 | 0.354 |
| deepset/bert-base-uncased-squad2 | 0.450 |

These scores seem surprisingly low. A big reason for this is the _no answer scenarios_. Let’s take a look at a few.

```json
{
  "_key": "35807a8f4c2d",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'context': 'The Normans were the people who in the 10th and 11th centuries gave their name to Normandy...',\\n\",\n       \" 'question': \\\"Who gave their name to Normandy in the 1000's and 1100's\\\"}\"\n      ]\n     },\n     \"execution_count\": 20,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dev[5]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'context': 'In the course of the 10th century, the initially destructive incursions of Norse war bands into the rivers of France evolved into more permanent encampments...',\\n\",\n       \" 'question': 'when did Nors encampments ivolve into destructive incursions?'}\"\n      ]\n     },\n     \"execution_count\": 21,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dev[24]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'context': '... Jerónimo de Ayanz y Beaumont received patents in 1606 for fifty steam powered inventions, including a water pump for draining inundated mines...',\\n\",\n       \" 'question': 'In what year did Jeronimo de Ayanz y Beaumont patent a water pump for draining patients?'}\"\n      ]\n     },\n     \"execution_count\": 22,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dev[1917]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

If, like me, you’re wondering how these are unanswerable, take note of the particular question and context wording. The first example specifies the 1000s and 1100s, but the context is the 10th and 11th centuries, e.g., 1100s and 1200s. The second example question should be _"**destructive incursions**_ _devolved into_ _**encampments**"_. The third should be _“draining_ _**mines**"_.

Even by humans, each of these questions is easily mistaken as answerable. If we remove unanswerable examples, the model scores are less surprising.

| Model | ROUGE-1 F1 |
| bert-reader-squad2 | 0.708 |
| deepset/bert-base-uncased-squad2 | 0.901 |

The importance of identifying unanswerable questions varies between use cases. Many will not need to identify unanswerable questions, so question whether your models should prioritize unanswerable question identification or focus on performing well on answerable questions.

That’s it for this walkthrough in fine-tuning reader models for ODQA pipelines. By understanding how to fine-tune a QA reader model, we are able to effectively optimize the final step in the ODQA pipeline for our own specific use cases.

Pairing this with a custom vector database and retriever components allows us to add highly optimized ODQA capabilities to a variety of possible use cases, such as internal document search, e-commerce product discovery, or anything where a more natural information retrieval experience can be beneficial.

## References

[1] Y. Zhang, Z. Xu, [BERT for Question Answering on SQuAD 2.0](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15848021.pdf) (2019)

[2] [Model Card for](https://huggingface.co/deepset/bert-base-uncased-squad2) [deepset/bert-base-uncased-squad2](https://huggingface.co/deepset/bert-base-uncased-squad2), HuggingFace Model Hub