# Next-Gen Sentence Embeddings with Multiple Negatives Ranking Loss

> Transformer-produced sentence embeddings have come a long way in a very short time. Starting with the slow but accurate similarity prediction of BERT cross-encoders, the world of sentence embeddings was ignited with the introduction of SBERT in 2019.

Transformer-produced sentence embeddings have come a long way in a very short time. Starting with the slow but accurate similarity prediction of BERT cross-encoders, the world of [sentence embeddings](https://www.pinecone.io/learn/series/nlp/sentence-embeddings/) was ignited with the introduction of SBERT in 2019 [1]. Since then, many more sentence transformers have been introduced. These models quickly made the original SBERT obsolete.

How did these newer sentence transformers manage to outperform SBERT so quickly? The answer is _multiple negatives ranking (MNR) loss_.

This article will cover what MNR loss is, the data it requires, and how to implement it to fine-tune our own high-quality sentence transformers.

Implementation will cover two training approaches. The first is more involved, and outlines the exact steps to fine-tune the model. The second approach makes use of the `sentence-transformers` library’s excellent utilities for fine-tuning.

[Video](https://www.youtube.com/watch?v=or5ew7dqA-c)


## NLI Training

As explained in our article on [softmax loss](https://www.pinecone.io/learn/series/nlp/train-sentence-transformers-softmax/), we can fine-tune sentence transformers using **N**atural **L**anguage **I**nference (NLI) datasets.

These datasets contain many sentence pairs, some that _imply_ each other, and others that _do not imply_ each other. As with the softmax loss article, we will use two of these datasets: the Stanford Natural Language Inference (SNLI) and Multi-Genre NLI (MNLI) corpora.

These two corpora total to 943K sentence pairs. Each pair consists of a `premise` and `hypothesis` sentence, which are assigned a `label`:

- **0** — _entailment_, e.g. the `premise` suggests the `hypothesis`.
- **1** — _neutral_, the `premise` and `hypothesis` could both be true, but they are not necessarily related.
- **2** — _contradiction_, the `premise` and `hypothesis` contradict each other.

When fine-tuning with MNR loss, we will be dropping all rows with _neutral_ or _contradiction_ labels — keeping only the positive _entailment_ pairs.

We will be feeding sentence A (the `premise`, known as the _anchor_) followed by sentence B (the `hypothesis`, when the label is **0**, this is called the _positive_) into BERT on each step. Unlike softmax loss, we do not use the `label` feature.

These training steps are performed in batches. Meaning several anchor-positive pairs are processed at once.

The model is then optimized to produce similar embeddings between pairs while maintaining different embeddings for non-pairs. We will explain this in more depth soon.

### Data Preparation

Let’s look at the data preparation process. We first need to download and merge the two NLI datasets. We will use the `datasets` library from Hugging Face.

```json
{
  "_key": "267dd8201718",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"source\": [\n    \"import datasets\\n\",\n    \"\\n\",\n    \"snli = datasets.load_dataset('snli', split='train')\\n\",\n    \"mnli = datasets.load_dataset('glue', 'mnli', split='train')\\n\",\n    \"\\n\",\n    \"snli = snli.cast(mnli.features)\\n\",\n    \"\\n\",\n    \"dataset = datasets.concatenate_datasets([snli, mnli])\\n\",\n    \"\\n\",\n    \"del snli, mnli\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Because we are using MNR loss, we only want anchor-positive pairs. We can apply a filter to remove all other pairs (including erroneous `-1` labels).

```json
{
  "_key": "6f6efb0c97c9",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"source\": [\n    \"print(f\\\"before: {len(dataset)} rows\\\")\\n\",\n    \"dataset = dataset.filter(\\n\",\n    \"    lambda x: True if x['label'] == 0 else False\\n\",\n    \")\\n\",\n    \"print(f\\\"after: {len(dataset)} rows\\\")\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"before: 942854 rows\\n\"\n     ]\n    },\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stderr\",\n     \"text\": [\n      \"100%|██████████| 943/943 [00:17<00:00, 53.31ba/s]\"\n     ]\n    },\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"after: 314315 rows\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

The dataset is now prepared differently depending on the training method we are using. We will continue preparation for the more involved PyTorch approach. If you’d rather just train a model and care less about the steps involved, feel free to skip ahead to the next section.

For the PyTorch approach, we must tokenize our own data. To do that, we will be using a `BertTokenizer` from the `transformers` library and applying the `map` method on our `dataset`.

```json
{
  "_key": "1fbd285a8d40",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"source\": [\n    \"from transformers import BertTokenizer\\n\",\n    \"\\n\",\n    \"tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\\n\",\n    \"\\n\",\n    \"dataset = dataset.map(\\n\",\n    \"    lambda x: tokenizer(\\n\",\n    \"            x['premise'], max_length=128, padding='max_length',\\n\",\n    \"            truncation=True\\n\",\n    \"        ), batched=True\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"dataset = dataset.rename_column('input_ids', 'anchor_ids')\\n\",\n    \"dataset = dataset.rename_column('attention_mask', 'anchor_mask')\\n\",\n    \"\\n\",\n    \"dataset\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['anchor_mask', 'hypothesis', 'anchor_ids', 'label', 'premise', 'token_type_ids'],\\n\",\n       \"    num_rows: 314315\\n\",\n       \"})\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 3\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"source\": [\n    \"Encode `hypothesis` encodings.\"\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"source\": [\n    \"dataset = dataset.map(\\n\",\n    \"    lambda x: tokenizer(\\n\",\n    \"            x['hypothesis'], max_length=128, padding='max_length',\\n\",\n    \"            truncation=True\\n\",\n    \"    ), batched=True\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"dataset = dataset.rename_column('input_ids', 'positive_ids')\\n\",\n    \"dataset = dataset.rename_column('attention_mask', 'positive_mask')\\n\",\n    \"\\n\",\n    \"dataset = dataset.remove_columns(['premise', 'hypothesis', 'label', 'token_type_ids'])\\n\",\n    \"\\n\",\n    \"dataset\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['anchor_ids', 'anchor_mask', 'positive_mask', 'positive_ids'],\\n\",\n       \"    num_rows: 314315\\n\",\n       \"})\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 4\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

After that, we’re ready to initialize our `DataLoader`, which will be used for loading batches of data into our model during training.

```json
{
  "_key": "ece24d5daa9a",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"source\": [\n    \"dataset.set_format(type='torch', output_all_columns=True)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"source\": [\n    \"import torch\\n\",\n    \"\\n\",\n    \"batch_size = 32\\n\",\n    \"\\n\",\n    \"loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And with that, our data is ready. Let’s move on to training.

### PyTorch Fine-Tuning

When training SBERT models, we don’t start from scratch. Instead, we begin with an already pretrained BERT — all we need to do is _fine-tune_ it for building sentence embeddings.

```python
from transformers import BertModel

# start from a pretrained bert-base-uncased model
model = BertModel.from_pretrained('bert-base-uncased')
```

MNR and softmax loss training approaches use a * ‘siamese’*-BERT architecture during fine-tuning. Meaning that during each step, we process a _sentence A_ (our _anchor_) into BERT, followed by _sentence B_ (our _positive_).

![Siamese-BERT network, the anchor and positive sentence pairs are processed separately. A mean pooling layer converts token embeddings into sentence embeddings.sentence A is our anchor and sentence B the positive.](https://cdn.sanity.io/images/vr8gru94/production/f570df278a344cd53fca7f045cef4db9b7c81ac9-1920x1080.png)


Because these two sentences are processed _separately_, it creates a _siamese_-like network with two identical BERTs trained in parallel. In reality, there is only a single BERT being used twice in each step.

We can extend this further with _triplet_-networks. In the case of triplet networks for MNR, we would pass three sentences, an _anchor_, it’s _positive_, and it’s _negative_. However, we are _not_ using triplet-networks, so we have removed the negative rows from our dataset (rows where `label` is `2`).

![Triplet networks use the same logic but with an added sentence. For MNR loss this other sentence is the negative pair of the anchor.](https://cdn.sanity.io/images/vr8gru94/production/b6eb33679dc0961b6f6f5d7a58be466bbfd0f5de-1920x1080.png)


BERT outputs 512 768-dimensional embeddings. We convert these into _averaged_ sentence embeddings using _mean-pooling_. Using the siamese approach, we produce two of these per step — one for the _anchor_ that we will call `a`, and another for the _positive_ called `p`.

```python
# define mean pooling function
def mean_pool(token_embeds, attention_mask):
    # reshape attention_mask to cover 768-dimension embeddings
    in_mask = attention_mask.unsqueeze(-1).expand(
        token_embeds.size()
    ).float()
    # perform mean-pooling but exclude padding tokens (specified by in_mask)
    pool = torch.sum(token_embeds * in_mask, 1) / torch.clamp(
        in_mask.sum(1), min=1e-9
    )
    return pool
```

In the `mean_pool` function, we’re taking these token-level embeddings (the 512) and the sentence `attention_mask` tensor. We resize the `attention_mask` to match the higher `768`-dimensionality of the token embeddings.

The resized mask `in_mask` is applied to the token embeddings to exclude padding tokens from the mean pooling operation. Mean-pooling takes the average activation of values across each dimension but _excluding_ those padding values, which would reduce the average activation. This operation transformers our token-level embeddings (shape `512*768`) to sentence-level embeddings (shape `1*768`).

These steps are performed in _batches_, meaning we do this for many _(anchor, positive)_ pairs in parallel. That is important in our next few steps.

```json
{
  "_key": "fad8df5d052a",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"source\": [\n    \"a.shape # check shape of batched inputs (batch_size == 32)\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"torch.Size([32, 768])\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 20\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"source\": [\n    \"p.shape\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"torch.Size([32, 768])\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 21\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

First, we calculate the cosine similarity between each anchor embedding (`a`) and _all_ of the positive embeddings in the same batch (`p`).

```json
{
  "_key": "4eeacae80430",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"source\": [\n    \"# define cosine sim layer\\n\",\n    \"cos_sim = torch.nn.CosineSimilarity()\\n\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 24,\n   \"source\": [\n    \"scores = []\\n\",\n    \"for a_i in a:\\n\",\n    \"    scores.append(cos_sim(a_i.reshape(1, a_i.shape[0]), p))\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 25,\n   \"source\": [\n    \"scores = torch.stack(scores)\\n\",\n    \"scores\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"tensor([[0.7799, 0.3883, 0.7147,  ..., 0.7094, 0.7934, 0.6639],\\n\",\n       \"        [0.6685, 0.5236, 0.6153,  ..., 0.6807, 0.7095, 0.6229],\\n\",\n       \"        [0.7462, 0.4453, 0.8049,  ..., 0.7482, 0.8092, 0.5914],\\n\",\n       \"        ...,\\n\",\n       \"        [0.7298, 0.4693, 0.6516,  ..., 0.8444, 0.8349, 0.6369],\\n\",\n       \"        [0.7391, 0.4418, 0.7139,  ..., 0.8012, 0.9189, 0.6312],\\n\",\n       \"        [0.7391, 0.4418, 0.7139,  ..., 0.8012, 0.9189, 0.6312]],\\n\",\n       \"       device='cuda:0', grad_fn=<StackBackward>)\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 25\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"source\": [\n    \"scores.shape\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"torch.Size([32, 32])\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 26\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

From here, we produce a vector of cosine similarity scores (of size `batch_size`) for each anchor embedding `a_i` _(or size_ _`2 * batch_size`_ _for triplets)_. Each anchor should share the highest score with its positive pair, `p_i`.

[Video](https://d33wubrfki0l68.cloudfront.net/bf222034e7fea505bffbdf3ae2d21a78137dfb60/e92d5/images/fine-tuning-sentence-transformers-mnr-loss-4.mp4)


To optimize for this, we use a set of increasing label values to mark where the highest score should be for each `a_i`, and categorical [cross-entropy loss](https://www.pinecone.io/learn/cross-entropy-loss/).

```json
{
  "_key": "7e3d2d053c47",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 27,\n   \"source\": [\n    \"labels = torch.tensor(range(len(scores)), dtype=torch.long, device=scores.device)\\n\",\n    \"labels\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,\\n\",\n       \"        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],\\n\",\n       \"       device='cuda:0')\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 27\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 28,\n   \"source\": [\n    \"# define loss function\\n\",\n    \"loss_func = torch.nn.CrossEntropyLoss()\\n\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"CrossEntropyLoss()\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 28\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 29,\n   \"source\": [\n    \"loss_func(scores, labels)\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"tensor(3.3966, device='cuda:0', grad_fn=<NllLossBackward>)\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 29\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And that’s every component we need for fine-tuning with MNR loss. Let’s put that all together and set up a training loop. First, we move our model and layers to a CUDA-enabled GPU _if available_.

```json
{
  "_key": "a5633f8b0f67",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 30,\n   \"source\": [\n    \"# set device and move model there\\n\",\n    \"device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\\n\",\n    \"model.to(device)\\n\",\n    \"print(f'moved to {device}')\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"moved to cuda\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 31,\n   \"source\": [\n    \"# define layers to be used in multiple-negatives-ranking\\n\",\n    \"cos_sim = torch.nn.CosineSimilarity()\\n\",\n    \"loss_func = torch.nn.CrossEntropyLoss()\\n\",\n    \"scale = 20.0  # we multiply similarity score by this scale value\\n\",\n    \"# move layers to device\\n\",\n    \"cos_sim.to(device)\\n\",\n    \"loss_func.to(device)\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"CrossEntropyLoss()\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 31\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Then we set up the optimizer and schedule for training. We use an Adam optimizer with a linear warmup for 10% of the total number of steps.

```json
{
  "_key": "079446d42bbd",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 32,\n   \"source\": [\n    \"from transformers.optimization import get_linear_schedule_with_warmup\\n\",\n    \"\\n\",\n    \"# initialize Adam optimizer\\n\",\n    \"optim = torch.optim.Adam(model.parameters(), lr=2e-5)\\n\",\n    \"\\n\",\n    \"# setup warmup for first ~10% of steps\\n\",\n    \"total_steps = int(len(anchors) / batch_size)\\n\",\n    \"warmup_steps = int(0.1 * total_steps)\\n\",\n    \"scheduler = get_linear_schedule_with_warmup(\\n\",\n    \"    optim, num_warmup_steps=warmup_steps,\\n\",\n    \"    num_training_steps=total_steps-warmup_steps\\n\",\n    \")\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And now we define the training loop, using the same training process that we worked through before.

```json
{
  "_key": "8a1e4a73827e",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 33,\n   \"source\": [\n    \"from tqdm.auto import tqdm\\n\",\n    \"\\n\",\n    \"# 1 epoch should be enough, increase if wanted\\n\",\n    \"for epoch in range(epochs):\\n\",\n    \"    model.train()  # make sure model is in training mode\\n\",\n    \"    # initialize the dataloader loop with tqdm (tqdm == progress bar)\\n\",\n    \"    loop = tqdm(loader, leave=True)\\n\",\n    \"    for batch in loop:\\n\",\n    \"        # zero all gradients on each new step\\n\",\n    \"        optim.zero_grad()\\n\",\n    \"        # prepare batches and more all to the active device\\n\",\n    \"        anchor_ids = batch['anchor']['input_ids'].to(device)\\n\",\n    \"        anchor_mask = batch['anchor']['attention_mask'].to(device)\\n\",\n    \"        pos_ids = batch['positive']['input_ids'].to(device)\\n\",\n    \"        pos_mask = batch['positive']['attention_mask'].to(device)\\n\",\n    \"        # extract token embeddings from BERT\\n\",\n    \"        a = model(\\n\",\n    \"            anchor_ids, attention_mask=anchor_mask\\n\",\n    \"        )[0]  # all token embeddings\\n\",\n    \"        p = model(\\n\",\n    \"            pos_ids, attention_mask=pos_mask\\n\",\n    \"        )[0]\\n\",\n    \"        # get the mean pooled vectors\\n\",\n    \"        a = mean_pool(a, anchor_mask)\\n\",\n    \"        p = mean_pool(p, pos_mask)\\n\",\n    \"        # calculate the cosine similarities\\n\",\n    \"        scores = torch.stack([\\n\",\n    \"            cos_sim(\\n\",\n    \"                a_i.reshape(1, a_i.shape[0]), p\\n\",\n    \"            ) for a_i in a])\\n\",\n    \"        # get label(s) - we could define this before if confident of consistent batch sizes\\n\",\n    \"        labels = torch.tensor(range(len(scores)), dtype=torch.long, device=scores.device)\\n\",\n    \"        # and now calculate the loss\\n\",\n    \"        loss = loss_func(scores*scale, labels)\\n\",\n    \"        # using loss, calculate gradients and then optimize\\n\",\n    \"        loss.backward()\\n\",\n    \"        optim.step()\\n\",\n    \"        # update learning rate scheduler\\n\",\n    \"        scheduler.step()\\n\",\n    \"        # update the TDQM progress bar\\n\",\n    \"        loop.set_description(f'Epoch {epoch}')\\n\",\n    \"        loop.set_postfix(loss=loss.item())\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stderr\",\n     \"text\": [\n      \"Epoch 0: 100%|██████████| 9823/9823 [49:02<00:00,  3.34it/s, loss=0.00158]\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

With that, we’ve fine-tuned our BERT model using MNR loss. Now we save it to file.

```json
{
  "_key": "001cb79e483e",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 34,\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"model_path = './sbert_test_mnr'\\n\",\n    \"\\n\",\n    \"if not os.path.exists(model_path):\\n\",\n    \"    os.mkdir(model_path)\\n\",\n    \"\\n\",\n    \"model.save_pretrained(model_path)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And this can now be loaded using either the `SentenceTransformer` or HF `from_pretrained` methods. Before we move on to testing the model performance, let’s look at how we can replicate that fine-tuning logic using the _much simpler_ `sentence-transformers` library.

## Fast Fine-Tuning

As we already mentioned, there is an easier way to fine-tune models using MNR loss. The `sentence-transformers` library allows us to use pretrained sentence transformers and comes with some handy training utilities.

We will start by preprocessing our data. This is the same as we did before for the first few steps.

```json
{
  "_key": "7c221096e342",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"source\": [\n    \"import datasets\\n\",\n    \"\\n\",\n    \"snli = datasets.load_dataset('snli', split='train')\\n\",\n    \"mnli = datasets.load_dataset('glue', 'mnli', split='train')\\n\",\n    \"\\n\",\n    \"snli = snli.cast(mnli.features)\\n\",\n    \"\\n\",\n    \"dataset = datasets.concatenate_datasets([snli, mnli])\\n\",\n    \"\\n\",\n    \"del snli, mnli\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"source\": [\n    \"print(f\\\"before: {len(dataset)} rows\\\")\\n\",\n    \"dataset = dataset.filter(\\n\",\n    \"    lambda x: True if x['label'] == 0 else False\\n\",\n    \")\\n\",\n    \"print(f\\\"after: {len(dataset)} rows\\\")\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"before: 942854 rows\\n\"\n     ]\n    },\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stderr\",\n     \"text\": [\n      \"100%|██████████| 943/943 [00:17<00:00, 53.31ba/s]\"\n     ]\n    },\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"after: 314315 rows\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Before, we tokenized our data and then loaded it into a PyTorch `DataLoader`. This time we follow a _slightly different format_. We * don’t* tokenize; we reformat into a list of `sentence-transformers` `InputExample` objects and use a slightly different `DataLoader`.

```json
{
  "_key": "244412ef1c46",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"source\": [\n    \"from sentence_transformers import InputExample\\n\",\n    \"from tqdm.auto import tqdm  # so we see progress bar\\n\",\n    \"\\n\",\n    \"train_samples = []\\n\",\n    \"for row in tqdm(nli):\\n\",\n    \"    train_samples.append(InputExample(\\n\",\n    \"        texts=[row['premise'], row['hypothesis']]\\n\",\n    \"    ))\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stderr\",\n     \"text\": [\n      \"100%|██████████| 314315/314315 [00:19<00:00, 15980.23it/s]\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"source\": [\n    \"from sentence_transformers import datasets\\n\",\n    \"\\n\",\n    \"batch_size = 32\\n\",\n    \"\\n\",\n    \"loader = datasets.NoDuplicatesDataLoader(\\n\",\n    \"    train_samples, batch_size=batch_size)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Our `InputExample` contains just our `a` and `p` sentence pairs, which we then feed into the `NoDuplicatesDataLoader` object. This data loader ensures that each batch is duplicate-free — a helpful feature when ranking pair similarity across _randomly_ sampled pairs with MNR loss.

Now we define the model. The `sentence-transformers` library allows us to build models using _modules_. We need just a transformer model (we will use `bert-base-uncased` again) and a mean pooling module.

```json
{
  "_key": "05e52f9cb0eb",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"source\": [\n    \"from sentence_transformers import models, SentenceTransformer\\n\",\n    \"\\n\",\n    \"bert = models.Transformer('bert-base-uncased')\\n\",\n    \"pooler = models.Pooling(\\n\",\n    \"    bert.get_word_embedding_dimension(),\\n\",\n    \"    pooling_mode_mean_tokens=True\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"model = SentenceTransformer(modules=[bert, pooler])\\n\",\n    \"\\n\",\n    \"model\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"SentenceTransformer(\\n\",\n       \"  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel \\n\",\n       \"  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\\n\",\n       \")\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 4\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We now have an initialized model. Before training, all that’s left is the loss function — MNR loss.

```json
{
  "_key": "03d1d065506d",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"source\": [\n    \"from sentence_transformers import losses\\n\",\n    \"\\n\",\n    \"loss = losses.MultipleNegativesRankingLoss(model)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And with that, we have our data loader, model, and loss function ready. All that’s left is to fine-tune the model! As before, we will train for a single epoch and warmup for the first 10% of our training steps.

```json
{
  "_key": "db659af97c81",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"source\": [\n    \"epochs = 1\\n\",\n    \"warmup_steps = int(len(loader) * epochs * 0.1)\\n\",\n    \"\\n\",\n    \"model.fit(\\n\",\n    \"    train_objectives=[(loader, loss)],\\n\",\n    \"    epochs=epochs,\\n\",\n    \"    warmup_steps=warmup_steps,\\n\",\n    \"    output_path='./sbert_test_mnr2',\\n\",\n    \"    show_progress_bar=False\\n\",\n    \")  # I set 'show_progress_bar=False' as it printed every step\\n\",\n    \"#    on to a new line\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {\n    \"tags\": []\n   }\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And a couple of hours later, we have a new sentence transformer model trained using MNR loss. It goes without saying that using the `sentence-transformers` training utilities makes life _much easier_. To finish off the article, let’s look at the performance of our MNR loss SBERT next to other sentence transformers.

## Compare Sentence Transformers

We’re going to use a semantic textual similarity (STS) dataset to test the performance of _four models_; our _MNR loss_ SBERT (using PyTorch and `sentence-transformers`), the _original_ SBERT, and an MPNet model trained with MNR loss on a [1B+ sample dataset](https://huggingface.co/spaces/flax-sentence-embeddings/sentence-embeddings).

The first thing we need to do is download the STS dataset. Again we will use `datasets` from Hugging Face.

```json
{
  "_key": "b6c94eae5ea6",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"source\": [\n    \"import datasets\\n\",\n    \"\\n\",\n    \"sts = datasets.load_dataset('glue', 'stsb', split='validation')\\n\",\n    \"\\n\",\n    \"sts\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['sentence1', 'sentence2', 'label', 'idx'],\\n\",\n       \"    num_rows: 1500\\n\",\n       \"})\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 1\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

STSb (or STS benchmark) contains sentence pairs in features `sentence1` and `sentence2` assigned a similiarity score from _0 -> 5_.

Three samples from the validation set of STSb:

| sentence1 | sentence2 | label | idx |
| A man with a hard hat is dancing. | A man wearing a hard hat is dancing. | 5.0 | 0 |
| A man is riding a bike. | A woman is riding a horse. | 1.4 | 149 |
| A man is buttering a piece of bread. | A slow loris hanging on a cord. | 0.0 | 127 |

Because the similarity scores range from 0 -> 5, we need to normalize them to a range of 0 -> 1. We use `map` to do this.

```json
{
  "_key": "bcc0d610ca84",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"source\": [\n    \"sts = sts.map(lambda x: {'label': x['label'] / 5.0})\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We’re going to be using `sentence-transformers` evaluation utilities. We first need to reformat the STSb data using the `InputExample` class — passing the sentence features as `texts` and similarity scores to the `label` argument.

```json
{
  "_key": "3e9952dbe84b",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"source\": [\n    \"from sentence_transformers import InputExample\\n\",\n    \"\\n\",\n    \"samples = []\\n\",\n    \"for sample in sts:\\n\",\n    \"    samples.append(InputExample(\\n\",\n    \"        texts=[sample['sentence1'], sample['sentence2']],\\n\",\n    \"        label=sample['label']\\n\",\n    \"    ))\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

To evaluate the models, we need to initialize the appropriate evaluator object. As we are evaluating continuous similarity scores, we use the `EmbeddingSimilarityEvaluator`.

```json
{
  "_key": "756c923be730",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"source\": [\n    \"from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator\\n\",\n    \"\\n\",\n    \"evaluator = EmbeddingSimilarityEvaluator.from_input_examples(\\n\",\n    \"    samples, write_csv=False\\n\",\n    \")\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And with that, we’re ready to begin evaluation. We load our model as a `SentenceTransformer` object and pass the model to our `evaluator`.

The evaluator outputs the * Spearman’s rank correlation* between the cosine similarity scores calculated from the model’s output embeddings and the similarity scores provided in STSb. A high correlation between the two values outputs a value close to *+1*, and no correlation would output *0*.

```json
{
  "_key": "95434bd4a7fc",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"source\": [\n    \"from sentence_transformers import SentenceTransformer\\n\",\n    \"\\n\",\n    \"model = SentenceTransformer('./sbert_test_mnr2')\\n\",\n    \"\\n\",\n    \"evaluator(model)\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"0.8395419746815114\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 5\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

For the model fine-tuned with `sentence-transformers`, we output a correlation of _0.84_, meaning our model outputs good similarity scores according to the scores assigned to STSb. Let’s compare that with other models.

| Model | Score |
| all_datasets_v3_mpnet-base | 0.89 |
| Custom SBERT with MNR (sentence-transformers) | 0.84 |
| Original SBERT bert-base-nli-mean-tokens | 0.81 |
| Custom SBERT with softmax (sentence-transformers) | 0.80 |
| Custom SBERT with MNR (PyTorch) | 0.79 |
| Custom SBERT with softmax (PyTorch) | 0.67 |
| bert-base-uncased | 0.61 |

The top two models are trained using MNR loss, followed by the original SBERT.

These results support the advice given by the authors of `sentence-transformers`, that models trained with MNR loss outperform those trained with softmax loss in building high-performing sentence embeddings [2].

Another key takeaway here is that despite our best efforts and the complexity of building these models with PyTorch, _every_ model trained using the easy-to-use `sentence-transformers` utilities far outperformed them.

In short; fine-tune your models with MNR loss, and do it with the `sentence-transformers` library.

---

That’s it for this walkthrough and guide to fine-tuning sentence transformer models with multiple negatives ranking loss — the current best approach for building high-performance models.

We covered preprocessing the two most popular NLI datasets — the Stanford NLI and multi-genre NLI corpora — for fine-tuning with MNR loss. Then we delved into the details of this fine-tuning approach using PyTorch before taking advantage of the excellent training utilities provided by the `sentence-transformers` library.

Finally, we learned how to evaluate our sentence transformer models with the semantic textual similarity benchmark (STSb). Identifying the highest performing models.

## References

[1] N. Reimers, I. Gurevych, [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (2019), ACL

[2] N. Reimers, [Sentence Transformers NLI Training Readme](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/nli), GitHub