# Making the Most of Data: Domain Transfer with BERT

> When building language models, we can spend months optimizing training and model parameters, but it’s useless if we don’t have the correct data.

When building language models, we can spend months optimizing training and model parameters, but it’s useless if we don’t have the correct data.

The success of our language models relies first and foremost on data. We covered a part way solution to this problem by applying the [Augmented SBERT training strategy to in-domain problems](https://www.pinecone.io/learn/series/nlp/data-augmentation/). That is, given a small dataset, we can artificially enlarge it to enhance our training data and improve model performance.

In-domain assumes that our target use case aligns to that small initial dataset. But what if the only data we have _does not_ align? Maybe we have Quora question duplicate pairs, but we want to identify similar questions on StackOverflow.

Given this scenario, we must transfer information from the out-of-domain (or _source_) dataset to our target domain. We will learn how to do this here. First, we will learn to assess which source datasets align best with our target domain quickly. Then we will explain and work through the AugSBERT domain-transfer training strategy [2].

[Video](https://www.youtube.com/watch?v=a8jyue22SJM)


## Will it Work?

Before we even begin training our models, we can get a good approximation of whether the method will work with some simple _n-gram_ matching statistics [1].

We count how many n-grams two different domains share. If our _source_ domain shares minimal similarity to a _target_ domain, as measured by _n-gram_ matches, it is less likely to output good results.

This behavior is reasonably straightforward to understand; given our two _source_-_target_ domains, overlapping n-grams indicate the linguistic and semantic gap (or _overlap_) between the two domains.

![Small n-gram overlap indicates a more significant gap between domains. More significant gaps require larger bridges (better models). The closer the two domains, the easier it is to bridge the gap.](https://cdn.sanity.io/images/vr8gru94/production/56bc6b16b86456803b27e60e642b393a05fe0050-1920x1080.png)


The greater the gap, the more difficult it is to bridge it using our training strategy. Although models are becoming better at generalization, they’re [still](https://www.nature.com/articles/d41586-019-03013-5) [brittle](https://www.nature.com/articles/d41586-019-03013-5) [when compared to our human-level ability](https://www.nature.com/articles/d41586-019-03013-5) to adapt knowledge across domains.

The _brittleness_ of language models means a small change can hamper performance. The more significant that change, the less likely our model will successfully translate its existing knowledge to the new domain.

We are similar. Although people are much more flexible and can apply pre-existing knowledge across domains incredibly well, we’re not perfect.

Given a book, we can tilt the pages at a slight five-degree angle, and most people will hardly notice the difference and continue reading. Turn the book upside-down, and many people will be unable to read. Others will begin to read slower. Our performance degrades with this small change.

If we are then given the same book in another language, most of us will have difficulty comprehending the book. It is still the same book, presented differently.

The knowledge transfer of models across different domains works in the same way: Greater change results in lower performance.

### Calculating Domain Correlation

We will measure the _n-gram overlap_ between **five** domains, primarily from [Hugging Face Datasets](https://huggingface.co/datasets).

| Dataset | Download Script |
| STSb | load_dataset('glue', 'stsb') |
| Quora Question Pairs (QQP) | load_dataset('glue', 'qqp') |
| Microsoft Research Paraphrase Corpus (MRPC) | load_dataset('glue', 'mrpc') |
| Recognizing Textual Entailment (RTE) | load_dataset('glue', 'rte') |
| Medical Question Pairs (Med-QP) | see below |

[Link to Medical Question Pairs (Med-QP)](https://gist.github.com/jamescalam/2dbc9874b599dde95d8ddcdd018dfcf6)

To calculate the similarity, we perform three operations:

- Tokenize datasets

```json
{
  "_key": "86242074f9a5",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from transformers import PreTrainedTokenizerFast\\n\",\n    \"\\n\",\n    \"tokenizer = PreTrainedTokenizerFast.from_pretrained('bert-base-uncased')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']\"\n      ]\n     },\n     \"execution_count\": 2,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"text = \\\"the quick brown fox jumped over the lazy dog\\\"\\n\",\n    \"\\n\",\n    \"tokens = tokenizer.tokenize(text)\\n\",\n    \"tokens\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

- Merge tokens into bi-grams (two-token pairs)

```json
{
  "_key": "51e836aed316",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"['the quick', 'brown fox', 'jumped over', 'the lazy', 'dog']\"\n      ]\n     },\n     \"execution_count\": 3,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"ngrams = []\\n\",\n    \"n = 2  # 2 for bigrams\\n\",\n    \"for i in range(0, len(tokens), n):\\n\",\n    \"    ngrams.append(' '.join(tokens[i:i+n]))\\n\",\n    \"ngrams\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

- Calculate the Jaccard similarity between different n-grams.

```json
{
  "_key": "b816da5e74a5",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# create new bigrams to compare against\\n\",\n    \"ngrams_2 = ['the little', 'brown fox', 'is very',  'slow']\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def jaccard (x: list, y: list):\\n\",\n    \"    # convert lists to sets\\n\",\n    \"    x = set(x)\\n\",\n    \"    y = set(y)\\n\",\n    \"    # calculate overlap\\n\",\n    \"    shared = x.intersection(y)\\n\",\n    \"    total = x.union(y)\\n\",\n    \"    return len(shared) / len(total)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"0.125\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"jaccard(ngrams, ngrams_2)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

_([Full script here](https://gist.github.com/jamescalam/15b48b1d9689e70ab9073e374ba3dc4a))_

After performing each of these steps and calculating the [Jaccard similarity](https://www.pinecone.io/learn/semantic-search/) between each dataset, we should get a _rough indication_ of how transferable models trained in one domain could be to another.

![Jaccard similarity scores between each of the five datasets.](https://cdn.sanity.io/images/vr8gru94/production/c223fe37a1052d3ab449bc928f397cfa3190f29a-1920x1080.png)


We can see that the _MedQP_ dataset has the lowest similarity to other datasets. The remainder are all reasonably similar.

Other factors contribute to how well we can expect domain transfer to perform, such as the size of the source dataset and subsequent performance of the source cross encoder model within its own domain. We’ll take a look at these statistics soon.

## Implementing Domain Transfer

The AugSBERT training strategy for domain transfer follows a similar pattern to that explained in our [in-domain AugSBERT article](https://www.pinecone.io/learn/series/nlp/data-augmentation/). With the one exception that we train our cross-encoder in one domain and the bi-encoder (sentence transformer) in another.

At a high-level it looks like this:

![AugSBERT training strategy for cross-domain use.](https://cdn.sanity.io/images/vr8gru94/production/e0a7b3d2cc3bef6ce1016893828b4ca3760296f9-1920x820.png)


We start with a labeled dataset from our _source domain_ and an unlabeled dataset in our _target domain_. The source domain should be as similar as possible to our target domain.

The next step is to train the source domain cross-encoder. For this, we want to maximize cross encoder performance, as the bi-encoder will essentially learn to replicate the cross-encoder. Better cross-encoder performance translates to better bi-encoder performance.

If the target dataset is very small (1-3K pairs), we may need to augment the dataset. We do this because bi-encoder models require more data to be trained to the same level as a cross-encoder model. A good target dataset should contain 10K or more pairs, although this can vary by use case.

We label the previously _unlabeled_ (and possibly _augmented_) target domain dataset with the trained cross-encoder.

The final step is to take the now labeled target domain data and use it to train the bi-encoder model.

That is all there is to it. We will add additional evaluation steps to confirm that the models are performing as expected, but otherwise, we’ll stick with the described process.

We already have our five datasets, and we will use each as both source and target data to see the difference in performance between domains.

When using a dataset for the _target domain_, we emulate a real-world use case (where we have no target data labels) by _not_ including existing labels and instead relying solely on the cross-encoder-generated labels.

### Cross Encoder Training

After downloading our labeled source data, we train the cross encoder. To do this, we need to format the source data into `InputExample` objects, then load them into a PyTorch `DataLoader`.

```python
from sentence_transformers import InputExample
from torch.utils.data import DataLoader

data = []
# iterate through each row in dataset
for row in ds:
    # append InputExample object to the list
    data.append(InputExample(
        texts=[row['sentence1'], row['sentence2']],
        label=float(row['label'])
    ))

# initialize PyTorch DataLoader using data
source = DataLoader(
    data, shuffle=True, batch_size=16
)
```

It can be a good idea to take validation samples for either the source or target domains and create an `evaluator` that can be passed to the cross encoder training function. With this, the script will output Pearson and Spearman correlation scores that we can use to assess model performance.

```python
from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator

dev_data = []
# iterate through each row again (this time the validation split)
for row in dev:
    # build up using InputExample objects
    dev_data.append(InputExample(
        texts=[row['sentence1'], row['sentence2']],
        label=float(row['label'])
    ))

# the dev data goes into an evaluator
evaluator = CECorrelationEvaluator.from_input_examples(
    dev_data
)
```

To train the cross encoder model, we initialize a `CrossEncoder` and use the `fit` method. `fit` takes the source data dataloader, evaluator (optional), where we would like to save the trained model `output_path`, and a few training parameters.

```python
# initialize the cross encoder
cross_encoder = CrossEncoder('bert-base-uncased', num_labels=1)

# setup the number of warmup steps, 0.2 == 20% warmup
num_epochs = 1
warmup = int(len(source) * num_epochs * 0.2)

cross_encoder.fit(
    train_dataloader=source,
    evaluator=evaluator,
    epochs=num_epochs,
    warmup_steps=warmup,
    optimizer_params={'lr': 5e-5},  # default 2e-5
    output_path=f'bert-{SOURCE}-cross-encoder'
)
```

For the training parameters, it is a good idea to test various learning rates and warm-up steps. A single epoch is usually enough to train the cross-encoder, and anything beyond this is likely to cause overfitting. Overfitting is bad when the target data is _in-domain_, and, when it is out-of-domain, it’s _even_ worse.

For the five models being trained (plus one more trained on a restricted Quora-QP dataset containing 10K rather than 400K training pairs), the following learning rate and percentage of warm-up steps were used.

| Model | Learning Rate | Warmup | Evaluation (Spearman, Pearson) |
| bert-mrpc-cross-encoder | 5e-5 | 35% | (0.704, 0.661) |
| bert-stsb-cross-encoder | 2e-5 | 30% | (0.889, 0.887) |
| bert-rte-cross-encoder | 5e-5 | 30% | (0.383, 0.387) |
| bert-qqp10k-cross-encoder | 5e-5 | 20% | (0.688, 0.676) |
| bert-qqp-cross-encoder | 5e-5 | 20% | (0.823, 0.772) |
| bert-medqp-cross-encoder | 5e-5 | 40% | (0.737, 0.714) |

The Spearman and Pearson correlation values measure the correlation between the predicted and true labels for sentence pairs in the validation set. A value of 0.0 signifies no correlation, 0.5 is a moderate correlation, and 0.8+ represents strong correlation.

These results are fairly good, in particular the `bert-stsb-cross-encoder` and _full_ `bert-qqp-cross-encoder` models return great performance. However, the RTE model `bert-rte-cross-encoder` performance is far from good.

The poor RTE performance is in part likely due to the small dataset size. However, as it is not significantly smaller than other datasets (Med-QP and MRPC in particular), we can assume the dataset is (1) not as clean or (2) that RTE is a more complex task.

| Dataset | Size |
| MRPC | 3,668 |
| STSb | 5,749 |
| RTE | 2,490 |
| Quora-QP | 363,846 |
| Med-QP | 2,753 |

We will find that this poor RTE performance doesn’t necessarily translate to poor performance in other domains. Indeed, _very good_ performance in the source domain can actually hinder performance in the target domain because the model must be able to _generalize_ well and not _specialize_ too much in a particular domain.

We will later be taking a pretrained BERT model, which already has a certain degree of performance in the target domains. Overtraining in the source domain can pull the pretrained model alignment away from the target domain, hindering performance.

A better measure of potential performance is to evaluate against a small (or big if possible) validation set in the target domain.

![Correlation scores between source cross-encoder models (x-axis) and target domain dev sets (x-axis). The bottom row indicates baseline performance using a pretrained Bert-base-uncased model without fine-tuning. Lower scores are marked with red, above with cyan, and roughly equal with grey.](https://cdn.sanity.io/images/vr8gru94/production/bf6b9b3e1c96e2fafd407efd951149b6016d1a28-1920x1080.png)


These correlation values are a good indication of the performance we can expect from our bi-encoder model. Immediately it is clear that the MedQP domain is not easily bridged as expected from the earlier n-gram analysis.

At this point, we can consider dropping the low performing source domains. Although we will keep them to see how these low cross-encoder scores translate to bi-encoder performance.

### Labeling the Target Data

The next step is to create our labeled target dataset. We use the cross-encoder trained in the source domain to label the _unlabeled_ target data.

This is relatively straightforward. We take the unlabeled sentence pairs, transform them into a _list_ of sentence pairs, and feed them into the `cross_encoder.predict` method.

```python
# target data is from the training sets from prev snippets
# (but we ignore the label feature, otherwise there is nothing to predict)
pairs = list(zip(target['sentence1'], target['sentence2']))

scores = cross_encoder.predict(pairs)
```

We return a set of similarity scores, which we can append to the target data and use it to train our bi-encoder.

```python
import pandas as pd

# store everything in a pandas DataFrame
target = pd.DataFrame({
    'sentence1': target['sentence1'],
    'sentence2': target['sentence2'],
    'label': scores.tolist()  # cross encoder predicted labels
})
# and save to file
target.to_csv('target_data.tsv', sep='\t', index=False)
```

### Training the Bi-Encoder

The final step in the training process is training the bi-encoder/sentence transformer itself. Everything we’ve done so far has been to label the target dataset.

Now that we have the dataset, we first need to reformat it using `InputExample` objects and a `DataLoader` as before.

```python
from torch.utils.data import DataLoader
from sentence_transformers import InputExample

# create list of InputExamples
train = []
for i, row in target.iterrows():
  	train.append(InputExample(
      	texts=[row['sentence1'], row['sentence2']],
      	label=float(row['label'])
    ))
# and place in PyTorch DataLoader
loader = DataLoader(
  	train, shuffle=True, batch_size=BATCH_SIZE
)
```

Then we initialize the bi-encoder. We will be using a pretrained `bert-base-uncased` model from [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) followed by a mean pooling layer to transform word-level embeddings to sentence embeddings.

```json
{
  "_key": "d1efd87db024",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"SentenceTransformer(\\n\",\n       \"  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel \\n\",\n       \"  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\\n\",\n       \")\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from sentence_transformers import models, SentenceTransformer\\n\",\n    \"\\n\",\n    \"# initialize model\\n\",\n    \"bert = models.Transformer('bert-base-uncased')\\n\",\n    \"# and mean pooling layer\\n\",\n    \"pooler = models.Pooling(\\n\",\n    \"    bert.get_word_embedding_dimension(),\\n\",\n    \"    pooling_mode_mean_tokens=True\\n\",\n    \")\\n\",\n    \"# then place them together\\n\",\n    \"model = SentenceTransformer(modules=[bert, pooler])\\n\",\n    \"model\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"5188bc372fa413aa2565ae5d28228f50ad7b2c4ebb4a82c5900fd598adbb6408\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

The labels output by our cross encoder are continuous values in the range 0.0 -> 1.0, which means we can use a loss function like `CosineSimilarityLoss`. Then we’re ready to train our model as we have done before.

```python
# setup loss function
loss = losses.CosineSimilarityLoss(model=model)

# and training
epochs = 1
# warmup for first 30% of training steps (test diff values here)
warmup_steps = int(len(loader) * epochs * 0.3)

model.fit(
  train_objectives=[(loader, loss)],
  epochs=epochs,
  warmup_steps=warmup_steps,
  output_path=f'bert-target'
)
```

### Evaluation and Augmentation

At this point, we can evaluate the bi-encoder model performance on a validation set of the target data. We use the `EmbeddingSimilarityEvaluator` to measure how closely the predicted, and true labels correlate ([script here](https://gist.github.com/jamescalam/64e38a2a8e84db61e5739f9fe41c12f2)).

![Bi-encoder correlation scores. Many reach very close to the equivalent cross-encoder performance (and some even exceed it). Yellow highlights indicate an improved performance after augmentation via random sampling.](https://cdn.sanity.io/images/vr8gru94/production/14de6e6c11c94c4171b23c11aa9dfe66fa7a7480-1920x1080.png)


The first bi-encoder results are reasonable, with most scoring higher than the Bert benchmark. Highlighted results indicate the original score (in the center) followed by scores _after_ augmenting target datasets with random sampling. Where data augmentation showed little-to-no improvement, scores were excluded.

One reason we might see improvement is quite simple. Bi-encoders require relatively large training sets. Our datasets are all tiny, except for QQP (which does produce a 72% correlation score in `bert-Smedqp-Tqqp`). Augmented datasets help us satisfy the data-hungry nature of bi-encoder training.

Fortunately, we already set up most of what we needed to _augment_ our target datasets. We have the cross-encoders for labeling, and all that is left is to generate new pairs.

As covered in our [in-domain AugSBERT article](https://www.pinecone.io/learn/series/nlp/data-augmentation/), we can generate new pairs with _random sampling_. All this means is that we create new sentence pairs by mixing-and-matching sentences from features A and B.

After generating these new pairs, we score them using the relevant cross-encoder. And [like magic](https://gist.github.com/jamescalam/062673282c2a8da13e8084bb7a5bbb35), we have thousands of new samples to train our bi-encoders with.

With or without random sampling, we can see results that align with the performance of our cross-encoder models, which is precisely what we would expect. This similarity in results means that the knowledge from our cross-encoders is being distilled successfully into our _faster_ bi-encoder models.

That is it for the Augmented SBERT training strategy and its application to domain transfer. Effective domain transfer allows us to broaden the horizon of sentence transformer use across many more domains.

The most common blocker for new language tools that rely on BERT or other transformer models is a lack of data. We do not eliminate the problem entirely using this technique, but we can reduce it.

Given a new domain that is not _too far_ from the domain of existing datasets, we can now build better-performing sentence transformers. Sometimes in the range of just a few percentage point improvements, and at other times, we see much more significant gains.

Thanks to AugSBERT, we can now tackle a few of those previously inaccessible domains.

## References

[1] D. Shah, et al., [Adversarial Domain Adaption for Duplicate Question Detection](https://aclanthology.org/D18-1131/) (2018), EMNLP Proc.

[2] N. Thakur, et al., [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240) (2021), NAACL