# Tomayto, Tomahto, Transformer: Multilingual Sentence Transformers

> We’ve learned about how sentence transformers can be used to create high-quality vector representations of text. We can then use these vectors to find similar vectors, which can be used for many applications such as semantic search or topic modeling.

We’ve learned about how [sentence transformers](https://www.pinecone.io/learn/series/nlp/sentence-embeddings/) can be used to create high-quality [vector representations](https://www.pinecone.io/learn/series/nlp/dense-vector-embeddings-nlp/) of text. We can then use these vectors to find similar vectors, which can be used for many applications such as semantic search or topic modeling.

These models are _very_ good at producing meaningful, information-dense vectors. But they don’t allow us to compare sentences across different languages.

Often this may not be a problem. However, the world is becoming increasingly interconnected, and many companies span across multiple borders and languages. Naturally, there is a need for sentence vectors that are language agnostic.

Unfortunately, very few textual similarity datasets span multiple languages, particularly for less common languages. And the standard training methods used for sentence transformers would require these types of datasets.

Different approaches need to be used. Fortunately, some techniques allow us to extend models to other languages using more easily obtained language translations.

In this article, we will cover how multilingual models work and are built. We’ll learn how to develop our own multilingual sentence transformers, the datasets to look for, and how to use high-performing pretrained multilingual models.

[Video](https://www.youtube.com/watch?v=NNS5pOpjvAQ)


## Multilingual Models

By using multilingual sentence transformers, we can map similar sentences from different languages to similar vector spaces.

If we took the sentence `"I love plants"` and the Italian equivalent `"amo le piante"`, the ideal multilingual sentence transformer would view both of these as exactly the same.

![A multilingual model will map sentences from different languages into the same vector space. Similar sentences to similar vectors, creating ‘language-agnostic’ vectors (as highlighted).](https://cdn.sanity.io/images/vr8gru94/production/00473dc2f562de8a703c226dd05bcff08e38d7a3-1920x960.png)


The model should identify `"mi piacciono le piante"` (_I like plants_) as more similar to `"I love plants"` than `"ho un cane arancione"` (_I have an orange dog_).

Why would we need a model like this? For any scenario we might find usual sentence transformers applied; identifying similar documents, finding plagiarism, topic modeling, and so on. But now, used across borders or extended to previously inaccessible populations.

The lack of suitable datasets means that many languages have limited access to language models. By starting with existing, high-performance monolingual models trained in high resource languages (such as English), we can use multilingual training techniques to extend the performance of these models to other languages using significantly less data.

## Training approaches

Typical training methods for sentence transformer models use some sort of contrastive training function. Given a high similarity sentence pair, models are optimized to produce high similarity sentence vectors.

Training data for this is not hard to come by as long as you stick to common languages, mainly English. But it can be hard to find data like this in other languages.

Both examples below rely _in-part or in full_ on having translation pairs rather than similarity pairs, which are easier to find. There are _many_ materials in the world that have been translated, but far fewer that compare similar same-language sentences.

### Translation-based Bridge Tasks

Using a multi-task training setup, we train on two alternate datasets:

1. An English dataset containing question-answer or anchor-positive) pairs (anchor-positive meaning two high-similarity sentences).
2. _Parallel data_ containing cross-language pairs (English_sentence, German_sentence).

The idea here is that the model learns monolingual sentence-pair relationships via a (more common) source language dataset. Then learns how to translate that knowledge into a multilingual scope using _parallel data_ [2].

This approach works, but we have chosen to focus on the _next_ multilingual training approach for a few reasons:

- The amount of training data required is high. The multilingual universal sentence encoder (mUSE) model was trained on over a billion sentence pairs [3].
- It uses a multi-task dual-encoder architecture. Optimizing two models in parallel is harder as both training tasks must be balanced (optimizing one is hard enough…).
- Results can be mediocre without the use of hard negatives [1]. Hard negatives are sentences that _seem similar_ (often on a related topic) but are irrelevant/or contradict the _anchor_ sentence. Because they’re _harder_ for a model to identify as dissimilar, by training on these, the model becomes better.

Let’s move on to our preferred approach and the focus of the remainder of the article.

### Multilingual Knowledge Distillation

Another approach is to use **multilingual knowledge distillation** — a more recent method introduced by Nils Reimers and Iryna Gurevych in 2020 [1]. With this, we use two models during fine-tuning, the _teacher_ and _student_ models.

The teacher model is an already fine-tuned sentence transformer used for creating embeddings in a single language (most likely English). The student model is a transformer that has been _pretrained_ on a multilingual corpus.

---

_There are two stages to training a transformer model. Pretraining refers to the initial training of the core model using techniques such as masked-language modeling (MLM), producing a ‘language engine’. Fine-tuning comes after — where the core model is trained for a specific task like semantic similarity, Q&A, or classification._

_However, it is also common to refer to previously fine-tuned models as pretrained._

---

We then need a parallel dataset (translation pairs) containing translations of our sentences. These translation pairs are fed into the teacher and student models.

![Chart showing the flow of information from parallel pairs through the teacher and student models and the optimization performed using MSE loss. Adapted from [1].](https://cdn.sanity.io/images/vr8gru94/production/2d10acf40972b49d6da17a3bba2ca7a04387fb59-1920x640.png)


Let’s assume we have English-Italian pairs. The English sentence is fed into our teacher and student models, producing two English sentence vectors. Then we feed the Italian sentence into the student model. We calculate the mean squared error (MSE) loss between the one teacher vector and the two student vectors. The student model is optimized using this loss.

The student model will learn to mimic the monolingual teacher model — but for multiple languages.

Using multilingual knowledge distillation is an excellent way to extend language options using already trained models. It requires much less data than training from scratch, and the data it uses is widely available — translated pairs of sentences.

## Fine-tuning with Multilingual Sentence Transformers

The final question is, how do we build one of these models? We covered multilingual knowledge distillation _conceptually_, but translating concepts into code is never as straightforward as it seems.

Luckily for us, the `sentence-transformers` library makes this process _much_ easier. Let’s see how we can use the library to build our very own multilingual models.

### Data Preparation

As always, we start with data. We need a data source that contains multilingual pairs, split into our _source_ language and _target_ language(s).

Note that we wrote language(s) — we can fine-tune a model on _many_ languages. In fact, some of the multilingual models in [sentence-transformers](https://sbert.net/docs/pretrained_models.html#multi-lingual-models) support more than _50_ languages. All of these are trained with multilingual knowledge distillation.

In the paper from Reimers and Gurevych, one dataset uses translated subtitles from thousands of TED talks. These subtitles also cover a wide range of languages (as we will see). We can access a similar dataset using HF `datasets`.

```json
{
  "_key": "0de2a6a3f36f",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"source\": [\n    \"import datasets\\n\",\n    \"\\n\",\n    \"ted = datasets.load_dataset('ted_multi', split='train')\\n\",\n    \"ted\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stderr\",\n     \"text\": [\n      \"Reusing dataset ted_multi_translate\\n\"\n     ]\n    },\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['translations', 'talk_name'],\\n\",\n       \"    num_rows: 258098\\n\",\n       \"})\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 1\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

This dataset contains a list of language labels, the translated sentences, and the talk they came from. We only really care about the labels and sentences.

```json
{
  "_key": "a5b3ef446380",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"source\": [\n    \"ted[0]\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"{'translations': {'language': ['ar',\\n\",\n       \"   'bg',\\n\",\n       \"   'de',\\n\",\n       \"   '...\\n\",\n       \"   'zh-cn',\\n\",\n       \"   'zh-tw'],\\n\",\n       \"  'translation': ['من ضمن جميع المثبطات المقلقة التي نعاني منها اليوم نفكر في ',\\n\",\n       \"   'Наред с всички обезпокоителни дефицити',\\n\",\n       \"   'Unter den schwierigen Problemen',\\n\",\n       \"   '...\\n\",\n       \"   '当今我们与之斗争的所有不足中',\\n\",\n       \"   '在所有今日世人仍然必需去努力實現的種種令人憂心的缺點之中']},\\n\",\n       \" 'talk_name': 'jonas_gahr_store_in_defense_of_dialogue'}\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 2\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We need to transform this dataset into a friendlier format. The data we feed into training will consist of nothing more than pairs of _source_ sentences and their respective _translations_.

To create this format, we need to use the language labels to (1) identify the position of our _source_ sentence and (2) extract translations of languages we want to fine-tune on. Which will look something like this:

```json
{
  "_key": "2d58df10d5bf",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"source\": [\n    \"# get the index\\n\",\n    \"idx = ted[0]['translations']['language'].index('en')\\n\",\n    \"idx\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"4\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 3\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"source\": [\n    \"# use the index to get the corresponding translation\\n\",\n    \"source = ted[0]['translations']['translation'][idx]\\n\",\n    \"source\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"'Amongst all the troubling deficits we struggle with today — we think of financial and economic primarily — the ones that concern me most is the deficit of political dialogue — our ability to address modern conflicts as they are , to go to the source of what they &apos;re all about and to understand the key players and to deal with them .'\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 4\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"source\": [\n    \"# use that info to create all (source, translation) pairs\\n\",\n    \"pairs = []\\n\",\n    \"for i, translation in enumerate(ted[0]['translations']['translation']):\\n\",\n    \"    # we don't want to use the source language (English) as a translation\\n\",\n    \"    if i != idx:\\n\",\n    \"        pairs.append((source, translation))\\n\",\n    \"\\n\",\n    \"# let's see what we have\\n\",\n    \"pairs[0]\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"('Amongst all the troubling deficits we struggle with today — we think of financial and economic primarily — the ones that concern me most is the deficit of political dialogue — our ability to address modern conflicts as they are , to go to the source of what they &apos;re all about and to understand the key players and to deal with them .',\\n\",\n       \" 'من ضمن جميع المثبطات المقلقة التي نعاني منها اليوم نفكر في المقام الاول في الامور المالية والاقتصادية واكثر ما يهمني بشكل اكثر هو عجز الحوار السياسي — قدرتنا على فهم الصراعات الحديثة على ماهي عليه , بالذهاب الى اصلها الفعلي وعلى فهم اللاعبين الرئيسيين وعلى التعامل معهم')\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 5\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Here we returned _27_ pairs from a single row of data. We don’t _have_ to limit the languages we fine-tune on. Still, unless you plan on using and evaluating every possible language, it’s likely a good idea to restrict the range.

We will use English `en` as our source language. For target languages, we will use Italian `it`, Spanish `es`, Arabic `ar`, French `fr`, and German `de`. _These are ISO language codes, which you can find_ [here](http://www.mathguide.de/info/tools/languagecode.html).

Later we will be using a `ParallelSentencesDataset` class, which expects our pairs to be separated by a tab character `\t`, and each language pair in a different dataset — so we add that in too.

```json
{
  "_key": "49d2b243dd55",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"source\": [\n    \"from sentence_transformers import InputExample\\n\",\n    \"from tqdm.auto import tqdm  # so we see progress bar\\n\",\n    \"\\n\",\n    \"# initialize list of languages to keep\\n\",\n    \"lang_list = ['it', 'es', 'ar', 'fr', 'de']\\n\",\n    \"# create dict to store our pairs\\n\",\n    \"train_samples = {f'en-{lang}': [] for lang in lang_list}\\n\",\n    \"\\n\",\n    \"# now build our training samples list\\n\",\n    \"for row in tqdm(ted):\\n\",\n    \"    # get source (English)\\n\",\n    \"    idx = row['translations']['language'].index('en')\\n\",\n    \"    source = row['translations']['translation'][idx].strip()\\n\",\n    \"    # loop through translations\\n\",\n    \"    for i, lang in enumerate(row['translations']['language']):\\n\",\n    \"        # check if lang is in lang list\\n\",\n    \"        if lang in lang_list:\\n\",\n    \"            translation = row['translations']['translation'][i].strip()\\n\",\n    \"            train_samples[f'en-{lang}'].append(\\n\",\n    \"                source+'\\\\t'+translation\\n\",\n    \"            )\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stderr\",\n     \"text\": [\n      \"100%|██████████| 258098/258098 [00:22<00:00, 11477.42it/s]\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"en-it: 204503\\nen-es: 196026\\nen-ar: 214111\\nen-fr: 192304\\nen-de: 167888\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# how many pairs for each language?\\n\",\n    \"for lang_pair in train_samples.keys():\\n\",\n    \"    print(f'{lang_pair}: {len(train_samples[lang_pair])}')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"source\": [\n    \"source+'\\\\t'+translation\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"'( Applause )\\\\t( Applausi )'\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 9\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Hopefully, all TED talk subtitles end with `'( Applause )'`. With that, let’s save our training data to file ready for the `ParallelSentencesDataset` class to pick it up again later.

```json
{
  "_key": "d14f9f6d1ac9",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 27,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import gzip\\n\",\n    \"\\n\",\n    \"if not os.path.exists('./data'):\\n\",\n    \"    os.mkdir('./data')\\n\",\n    \"\\n\",\n    \"# save to file, sentence transformers reader will expect tsv.gz file\\n\",\n    \"for lang_pair in train_samples.keys():\\n\",\n    \"    with gzip.open(f'./data/ted-train-{lang_pair}.tsv.gz', 'wt', encoding='utf-8') as f:\\n\",\n    \"        f.write('\\\\n'.join(train_samples[lang_pair]))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

That’s it for data preparation. Now let’s move on to set everything up for fine-tuning.

### Set-up and Training

Before training, we need _four_ things:

- Our `teacher` model.
- The new `student` model.
- A loaded `DataLoader` to feed the _(source, translation)_ pairs into our model during training.
- The loss function.

Let’s start with our _teacher_ and _student_ models.

#### Model Selection

We already know we need a _teacher_ and a _student_, but how do we choose a _teacher_ and _student_? Well, the _teacher_ must be a competent model in producing sentence embeddings, just as we’d like our teachers to be competent in the topic they are teaching us.

The ideal student can take what the teacher teaches and extend that knowledge beyond the teacher’s capabilities. We want the same from our student model. That means that it must be capable of functioning with different languages.

[Video](https://d33wubrfki0l68.cloudfront.net/50c8187c5d8c96076d045688b9a224fd034d44f4/89a30/images/multilingual-transformers-3.mp4)


Not all models can do this, and of the models that can — some are better than others.

The first check for a capable student model is its tokenizer. Can the student’s tokenizer deal with a variety of languages?

BERT uses a WordPiece tokenizer. That means that it encodes either word-level or sub-word-level chunks of text. The vocabulary of a pretrained BERT tokenizer is already set and limited to (mostly) English tokens. If we begin introducing unrecognizable words/word pieces, the tokenizer will convert them into ‘unknown’ tokens or small character sequences.

When BERT sees the occasional unknown token, it’s not a problem. But if we feed many unknowns to BERT — it becomes unmanageable. If I gave you the sentence:

`I went to the [UNK] today to buy some milk.`

You could probably fill in the ‘unknown’ `[UNK]` with an accurate guess of ‘shop’ or ‘store’. What if I gave you:

`[UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]`

Can you fill in the blanks? In this sentence, I said _“I went for a walk in the forest yesterday”_ — if you guessed correct, well done! If not, well, that’s to be expected.

BERT works in the same way. It can fill in the occasional blank, but too many, and the task is unsolvable. Let’s take a look at how BERTs tokenizer copes with different languages.

```json
{
  "_key": "8729c2bd83ff",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 28,\n   \"source\": [\n    \"from transformers import BertTokenizer\\n\",\n    \"\\n\",\n    \"bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 29,\n   \"source\": [\n    \"sentences = [\\n\",\n    \"    'we will include several languages',\\n\",\n    \"    '一些中文单词',\\n\",\n    \"    'το ελληνικό αλφάβητο είναι πολύ ωραίο',\\n\",\n    \"    'ჩვენ გვაქვს ქართული'\\n\",\n    \"]\\n\",\n    \"\\n\",\n    \"for text in sentences:\\n\",\n    \"    print(bert_tokenizer.tokenize(text))\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"['we', 'will', 'include', 'several', 'languages']\\n['一', '[UNK]', '中', '文', '[UNK]', '[UNK]']\\n['τ', '##ο', 'ε', '##λ', '##λ', '##η', '##ν', '##ι', '##κ', '##ο', 'α', '##λ', '##φ', '##α', '##β', '##η', '##τ', '##ο', 'ε', '##ι', '##ν', '##α', '##ι', 'π', '##ο', '##λ', '##υ', 'ω', '##ρ', '##α', '##ι', '##ο']\\n['[UNK]', '[UNK]', '[UNK]']\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

The tokenizer misses most of our Chinese text and all of the Georgian text. Greek is split into character-level tokens, limiting the length of input sequences to just 512 characters. Additionally, character-level tokens carry limited meaning.

A BERT tokenizer is therefore not ideal. There is another transformer model built for multilingual comprehension called XLM-RoBERTa (XLMR).

XLMR uses a _SentencePiece_-based tokenizer with a vocabulary of 250K tokens. This means XLMR already _knows_ many more words/characters than BERT. _SentencePiece_ also handles new languages much better thanks to language-agnostic preprocessing (it treats all sentences as sequences of Unicode characters) [4].

```json
{
  "_key": "f674cc4afcb0",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 30,\n   \"source\": [\n    \"from transformers import XLMRobertaTokenizer\\n\",\n    \"\\n\",\n    \"xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 31,\n   \"source\": [\n    \"for text in sentences:\\n\",\n    \"    print(xlmr_tokenizer.tokenize(text))\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"['▁we', '▁will', '▁include', '▁several', '▁language', 's']\\n['▁', '一些', '中文', '单', '词']\\n['▁το', '▁ελληνικό', '▁αλ', 'φά', 'βη', 'το', '▁είναι', '▁πολύ', '▁ωραίο']\\n['▁ჩვენ', '▁გვაქვს', '▁ქართული']\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We can see straight away that our XLMR tokenizer handles these other languages _much_ better. Naturally, we’ll use XLMR as our student.

The student model will be initialized from Hugging Face’s transformers. It has _not_ been fine-tuned to produce sentence vectors, and we need to initialize a _mean pooling_ to convert the 512 token vectors into a single sentence vector.

To put these two components together, we will use `sentence-transformers` transformer and pooling modules.

```json
{
  "_key": "3c01c19c4431",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 32,\n   \"source\": [\n    \"from sentence_transformers import models\\n\",\n    \"\\n\",\n    \"xlmr = models.Transformer('xlm-roberta-base')\\n\",\n    \"pooler = models.Pooling(\\n\",\n    \"    xlmr.get_word_embedding_dimension(),\\n\",\n    \"    pooling_mode_mean_tokens=True\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"student = SentenceTransformer(modules=[xlmr, pooler])\\n\",\n    \"student\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"SentenceTransformer(\\n\",\n       \"  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel \\n\",\n       \"  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\\n\",\n       \")\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 32\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

That’s our student. Our teacher must be an already fine-tuned monolingual sentence transformer model. We could try the `all-mpnet-base-v2` model:

```json
{
  "_key": "56db462e354c",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 31,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"SentenceTransformer(\\n\",\n       \"  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel \\n\",\n       \"  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\\n\",\n       \"  (2): Normalize()\\n\",\n       \")\"\n      ]\n     },\n     \"execution_count\": 31,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from sentence_transformers import SentenceTransformer\\n\",\n    \"\\n\",\n    \"teacher = SentenceTransformer('all-mpnet-base-v2')\\n\",\n    \"teacher\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

But here, there is a final _normalization_ layer. We need to avoid outputting normalized embeddings for our student to mimic. So we either remove that normalization layer or use a model without it. The `paraphrase` models do _not_ use normalization. We’ll use one of those.

```json
{
  "_key": "f46001721538",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 31,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"SentenceTransformer(\\n\",\n       \"  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel \\n\",\n       \"  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\\n\",\n       \")\"\n      ]\n     },\n     \"execution_count\": 31,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"teacher = SentenceTransformer('paraphrase-distilroberta-base-v2')\\n\",\n    \"teacher\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And with that, we’re ready to set everything up for fine-tuning.

#### Fine-Tuning

For fine-tuning, we now need to initialize our data loader and loss function. Starting with the data loader, we first need to initialize a `ParallelSentencesDataset` object.

```json
{
  "_key": "a627949f0b7b",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 33,\n   \"source\": [\n    \"from sentence_transformers import ParallelSentencesDataset\\n\",\n    \"\\n\",\n    \"data = ParallelSentencesDataset(student_model=student, teacher_model=teacher, batch_size=32, use_embedding_cache=True)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And once we have initialized the dataset object, we load in our data.

```json
{
  "_key": "c5ca9f9d0c76",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"source\": [\n    \"max_sentences_per_language = 500000\\n\",\n    \"train_max_sentence_length = 250 # max num of characters per sentence\\n\",\n    \"\\n\",\n    \"train_files = [f for f in os.listdir('./data') if 'train' in f]\\n\",\n    \"for f in train_files:\\n\",\n    \"    print(f)\\n\",\n    \"    data.load_data('./data/'+f, max_sentences=max_sentences_per_language, max_sentence_length=train_max_sentence_length)\"\n   ],\n   \"cell_type\": \"code\",\n   \"metadata\": {},\n   \"execution_count\": 34,\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"ted-train-en-ar.tsv.gz\\n\",\n      \"ted-train-en-de.tsv.gz\\n\",\n      \"ted-train-en-es.tsv.gz\\n\",\n      \"ted-train-en-fr.tsv.gz\\n\",\n      \"ted-train-en-it.tsv.gz\\n\"\n     ]\n    }\n   ]\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

With our dataset ready, all we do is pass it to a PyTorch data loader.

```json
{
  "_key": "c7692dcaa63d",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 41,\n   \"source\": [\n    \"from torch.utils.data import DataLoader\\n\",\n    \"\\n\",\n    \"loader = DataLoader(data, shuffle=True, batch_size=32)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

The final thing we need for fine-tuning is our loss function. As we saw before, we will be calculating the MSE loss, which we initialize like so:

```json
{
  "_key": "16602a99d0ba",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 42,\n   \"source\": [\n    \"from sentence_transformers import losses\\n\",\n    \"\\n\",\n    \"loss = losses.MSELoss(model=student)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

It’s that simple! Now we’re onto the fine-tuning itself. As usual with `sentence-transformers` we call the `.fit` method on our student model.

```json
{
  "_key": "d52b80d3a7e0",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 43,\n   \"source\": [\n    \"from sentence_transformers import evaluation\\n\",\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"epochs = 1\\n\",\n    \"warmup_steps = int(len(loader) * epochs * 0.1)\\n\",\n    \"\\n\",\n    \"student.fit(\\n\",\n    \"    train_objectives=[(loader, loss)],\\n\",\n    \"    epochs=epochs,\\n\",\n    \"    warmup_steps=warmup_steps,\\n\",\n    \"    output_path='./xlmr-ted',\\n\",\n    \"    optimizer_params={'lr': 2e-5, 'eps': 1e-6, 'correct_bias': False},\\n\",\n    \"    save_best_model=True,\\n\",\n    \"    show_progress_bar=False\\n\",\n    \")\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8-final\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And we wait. Once fine-tuning is complete, we find the new model in the `./xlmr-ted` directory. The model can be loaded using the `SentenceTransformer` class as we would any other sentence transformer.

It would be helpful to understand how our model is performing, so let’s take a look at model evaluation.

### Evaluation

To evaluate our model, we need a multilingual textual similarity dataset. That is a dataset containing multilingual pairs and their respective similarity scores. A great one is the Sentence Textual Similarity benchmark (STSb) multilingual dataset.

We can find this dataset on HF `datasets`, named `stsb_multi_mt`. It includes _a lot_ of different languages, but we will stick to evaluating two, English and Italian. First, we download both of those.

```json
{
  "_key": "c11ab6412859",
  "_type": "colabBlock",
  "jsonContent": "{\n \"metadata\": {\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8-final\"\n  },\n  \"orig_nbformat\": 2,\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2,\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stderr\",\n     \"text\": [\n      \"Reusing dataset stsb_multi_mt\\n\"\n     ]\n    },\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['sentence1', 'sentence2', 'similarity_score'],\\n\",\n       \"    num_rows: 1379\\n\",\n       \"})\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 1\n    }\n   ],\n   \"source\": [\n    \"import datasets\\n\",\n    \"\\n\",\n    \"en = datasets.load_dataset('stsb_multi_mt', 'en', split='test')\\n\",\n    \"en\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stderr\",\n     \"text\": [\n      \"Reusing dataset stsb_multi_mt\\n\"\n     ]\n    },\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['sentence1', 'sentence2', 'similarity_score'],\\n\",\n       \"    num_rows: 1379\\n\",\n       \"})\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 2\n    }\n   ],\n   \"source\": [\n    \"it = datasets.load_dataset('stsb_multi_mt', 'it', split='test')\\n\",\n    \"it\"\n   ]\n  }\n ]\n}"
}
```

Each row of the different language sets aligns with the same row in the other language sets. Meaning _sentence1_ in row _0_ of the English dataset is translated to _sentence1_ in row _0_ of the Italian dataset.

```json
{
  "_key": "6d481a2d2f54",
  "_type": "colabBlock",
  "jsonContent": "{\n \"metadata\": {\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8-final\"\n  },\n  \"orig_nbformat\": 2,\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2,\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"{'sentence1': 'A girl is styling her hair.',\\n\",\n       \" 'sentence2': 'A girl is brushing her hair.',\\n\",\n       \" 'similarity_score': 2.5}\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 3\n    }\n   ],\n   \"source\": [\n    \"en[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"{'sentence1': 'Una ragazza si acconcia i capelli.',\\n\",\n       \" 'sentence2': 'Una ragazza si sta spazzolando i capelli.',\\n\",\n       \" 'similarity_score': 2.5}\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 4\n    }\n   ],\n   \"source\": [\n    \"it[0]\"\n   ]\n  }\n ]\n}"
}
```

Here the Italian dataset _sentence1_ means _‘A girl is styling her hair’_. This alignment also applies to _sentence2_ and the _similarity_score_.

One thing we do need to change in this dataset is the _similarity_score_. When we calculate the _positive_ cosine similarity between sentence vectors, we will output a zero (no similarity) to one (exact matches) value. The _similarity_score_ varies between zero to five. We must normalize this to bring it within the correct range.

```json
{
  "_key": "02247a5248d1",
  "_type": "colabBlock",
  "jsonContent": "{\n \"metadata\": {\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8-final\"\n  },\n  \"orig_nbformat\": 2,\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2,\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"{'sentence1': 'A girl is styling her hair.',\\n\",\n       \" 'sentence2': 'A girl is brushing her hair.',\\n\",\n       \" 'similarity_score': 0.5}\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 5\n    }\n   ],\n   \"source\": [\n    \"en = en.map(lambda x: {'similarity_score': x['similarity_score'] / 5.0})\\n\",\n    \"it = it.map(lambda x: {'similarity_score': x['similarity_score'] / 5.0})\\n\",\n    \"\\n\",\n    \"en[0]\"\n   ]\n  }\n ]\n}"
}
```

Before feeding our data into a similarity evaluator, we need to reformat it to use an `InputExample` format. While we do this, we will also merge English and Italian sets to create a new English-Italian dataset for evaluation.

```json
{
  "_key": "e1a3f07d59e1",
  "_type": "colabBlock",
  "jsonContent": "{\n \"metadata\": {\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8-final\"\n  },\n  \"orig_nbformat\": 2,\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2,\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from sentence_transformers import InputExample\\n\",\n    \"\\n\",\n    \"en_samples = []\\n\",\n    \"it_samples = []\\n\",\n    \"en_it_samples = []\\n\",\n    \"\\n\",\n    \"for i in range(len(en)):\\n\",\n    \"    en_samples.append(InputExample(\\n\",\n    \"        texts=[en[i]['sentence1'], en[i]['sentence2']],\\n\",\n    \"        label=en[i]['similarity_score']\\n\",\n    \"    ))\\n\",\n    \"    it_samples.append(InputExample(\\n\",\n    \"        texts=[it[i]['sentence1'], it[i]['sentence2']],\\n\",\n    \"        label=it[i]['similarity_score']\\n\",\n    \"    ))\\n\",\n    \"    en_it_samples.append(InputExample(\\n\",\n    \"        texts=[en[i]['sentence1'], it[i]['sentence2']],\\n\",\n    \"        label=en[i]['similarity_score']\\n\",\n    \"    ))\"\n   ]\n  }\n ]\n}"
}
```

We can use an `EmbeddingSimilarityEvaluator` class to evaluate the performance of our model. First, we need to initialize one of these evaluators for each of our sets.

```json
{
  "_key": "769d401fc9f9",
  "_type": "colabBlock",
  "jsonContent": "{\n \"metadata\": {\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8-final\"\n  },\n  \"orig_nbformat\": 2,\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2,\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator\\n\",\n    \"\\n\",\n    \"en_eval = EmbeddingSimilarityEvaluator.from_input_examples(\\n\",\n    \"    en_samples, write_csv=False\\n\",\n    \")\\n\",\n    \"it_eval = EmbeddingSimilarityEvaluator.from_input_examples(\\n\",\n    \"    it_samples, write_csv=False\\n\",\n    \")\\n\",\n    \"en_it_eval = EmbeddingSimilarityEvaluator.from_input_examples(\\n\",\n    \"    en_it_samples, write_csv=False\\n\",\n    \")\"\n   ]\n  }\n ]\n}"
}
```

And with that, we just pass our student model through each evaluator to return its performance.

```json
{
  "_key": "b3dce6b631c3",
  "_type": "colabBlock",
  "jsonContent": "{\n \"metadata\": {\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8-final\"\n  },\n  \"orig_nbformat\": 2,\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2,\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"0.816026950741276\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 15\n    }\n   ],\n   \"source\": [\n    \"from sentence_transformers import SentenceTransformer\\n\",\n    \"\\n\",\n    \"model = SentenceTransformer('./xlmr-ted')\\n\",\n    \"\\n\",\n    \"en_eval(model)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"0.7425311301081923\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 16\n    }\n   ],\n   \"source\": [\n    \"it_eval(model)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"0.7102280152242499\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 17\n    }\n   ],\n   \"source\": [\n    \"en_it_eval(model)\"\n   ]\n  }\n ]\n}"
}
```

That looks pretty good. Let’s see how it compares to our untrained student.

```json
{
  "_key": "cbd7f1213379",
  "_type": "colabBlock",
  "jsonContent": "{\n \"metadata\": {\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8-final\"\n  },\n  \"orig_nbformat\": 2,\n  \"kernelspec\": {\n   \"name\": \"search\",\n   \"display_name\": \"search\",\n   \"language\": \"python\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2,\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from sentence_transformers import models\\n\",\n    \"\\n\",\n    \"xlmr = models.Transformer('xlm-roberta-base')\\n\",\n    \"pooler = models.Pooling(\\n\",\n    \"    xlmr.get_word_embedding_dimension(),\\n\",\n    \"    pooling_mode_mean_tokens=True\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"student = SentenceTransformer(modules=[xlmr, pooler])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"0.4752794215862243\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 19\n    }\n   ],\n   \"source\": [\n    \"en_eval(student)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"0.49627607237070986\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 20\n    }\n   ],\n   \"source\": [\n    \"it_eval(student)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"0.22941283783717123\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 21\n    }\n   ],\n   \"source\": [\n    \"en_it_eval(student)\"\n   ]\n  }\n ]\n}"
}
```

Some really great results. We can now take the new model and use it with English `en`, Italian `it`, Spanish `es`, Arabic `ar`, French `fr`, and German `de`.

### Sentence Transformer Models

Fortunately, we rarely need to fine-tune our own model. We can load many high-performing multilingual models as quickly as we initialized our teacher model earlier.

We can find a list of these multilingual models on the [Pretrained Models](https://sbert.net/docs/pretrained_models.html#multi-lingual-models) page of the `sentence-transformers` docs. A few of which support more than 50 languages.

To initialize one of these, all we need is:

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
```

And that’s it, encode your sentences with `model.encode`, and you’re good to go.

---

That’s all for this article on multilingual sentence transformers. We’ve taken a look at the two most common approaches taken to train multilingual sentence transformers; multi-task translation-based bridging and multilingual knowledge distillation.

From there, we dived into the tune-tuning process of a multilingual model using multilingual knowledge distillation, covering the required data, loss functions, fine-tuning, and evaluation.

We’ve also looked at how to use the existing pretrained multilingual sentence transformers.

## References

[1] N. Reimers, I. Gurevych, [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813#) (2020), EMNLP

[2] M. Chidambaram, [Learning Cross-Lingual Sentence Representations vis a Multi-task Dual-Encoder Model](https://arxiv.org/abs/1810.12836) (2019), RepL4NLP

[3] Y. Yang, et al., [Multilingual Universal Sentence Encoder for Semantic Retrieval](https://arxiv.org/abs/1907.04307) (2020), ACL

[4] Google, [SentencePiece Repo](https://github.com/google/sentencepiece), GitHub