# Sentence Transformers: Meanings in Disguise

> Once you learn about and generate sentence embeddings, combine them with the Pinecone vector database to easily build applications like semantic search, deduplication, and multi-modal search. Try it now for free.

**Once you learn about and generate sentence embeddings, combine them with the** [Pinecone vector database](https://www.pinecone.io/) **to easily build applications like semantic search, deduplication, and multi-modal search.** [Try it now for free.](https://app.pinecone.io/)

Transformers have wholly rebuilt the landscape of natural language processing (NLP). Before transformers, we had _okay_ translation and language classification thanks to recurrent neural nets (RNNs) — their language comprehension was limited and led to many minor mistakes, and coherence over larger chunks of text was practically impossible.

Since the introduction of the first transformer model in the 2017 paper _‘Attention is all you need’_ [1], NLP has moved from RNNs to models like BERT and GPT. These new models can answer questions, write articles _(maybe GPT-3 wrote this)_, enable incredibly intuitive semantic search — and much more.

The funny thing is, for many tasks, the latter parts of these models are the same as those in RNNs — often a couple of feedforward NNs that output model predictions.

It’s the _input_ to these layers that changed. The [dense embeddings](https://www.pinecone.io/learn/series/nlp/dense-vector-embeddings-nlp/) created by transformer models are so much richer in information that we get massive performance benefits despite using the same final outward layers.

These increasingly rich sentence embeddings can be used to quickly compare sentence similarity for various use cases. Such as:

- **Semantic textual similarity (STS)** — comparison of sentence pairs. We may want to identify patterns in datasets, but this is most often used for benchmarking.
- **Semantic search** — information retrieval (IR) using semantic meaning. Given a set of sentences, we can search using a _‘query’_ sentence and identify the most similar records. Enables search to be performed on concepts (rather than specific words).
- **Clustering** — we can cluster our sentences, useful for topic modeling.

In this article, we will explore how these embeddings have been adapted and applied to a range of semantic similarity applications by using a new breed of transformers called _‘sentence transformers’_.

[Video](https://www.youtube.com/watch?v=WS1uVMGhlWQ)


## Some “Context”

Before we dive into sentence transformers, it might help to piece together why transformer embeddings are so much richer — and where the difference lies between a vanilla _transformer_ and a _sentence transformer_.

Transformers are indirect descendants of the previous RNN models. These old recurrent models were typically built from many recurrent _units_ like [LSTMs or GRUs](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21).

In _machine translation_, we would find [encoder-decoder networks](https://machinelearningmastery.com/encoder-decoder-recurrent-neural-network-models-neural-machine-translation/). The first model for _encoding_ the original language to a _context vector_, and a second model for _decoding_ this into the target language.

Encoder-decoder architecture with the single context vector shared between the two models, this acts as an information bottleneck as _all_ information must be passed through this point.



![Encoder-decoder architecture with the single context vector shared between the two models, this acts as an information bottleneck as all information must be passed through this point.](https://cdn.sanity.io/images/vr8gru94/production/ece1bcd8e6f31d36805d56f571f10d327a5c5cf1-1920x680.png)


The problem here is that we create an _information bottleneck_ between the two models. We’re creating a massive amount of information over multiple time steps and trying to squeeze it all through a single connection. This limits the encoder-decoder performance because much of the information produced by the encoder is lost before reaching the decoder.

The _attention mechanism_ provided a solution to the bottleneck issue. It offered another route for information to pass through. Still, it didn’t overwhelm the process because it focused _attention_ only on the most relevant information.

By passing a _context vector_ from each timestep into the attention mechanism (producing _annotation_ vectors), the information bottleneck is removed, and there is better information retention across longer sequences.

![Encoder-decoder with the attention mechanism. The attention mechanism considered all encoder output activations and each timestep’s activation in the decoder, which modifies the decoder outputs.](https://cdn.sanity.io/images/vr8gru94/production/bde26e06acb7b1f794e3e6ac5432cbdda7785238-1920x780.png)


During decoding, the model decodes one word/timestep at a time. An alignment (e.g., similarity) between the word and all encoder annotations is calculated for each step.

Higher alignment resulted in greater weighting to the encoder annotation on the output of the decoder step. Meaning the mechanism calculated which encoder words to pay _attention_ to.



![Attention between an English-French encoder and decoder, source [2].](https://cdn.sanity.io/images/vr8gru94/production/ab5024ccb25d38298b19f6da73eed1f5b1f9579e-1400x871.png)


The best-performing RNN encoder-decoders all used this attention mechanism.

### Attention is All You Need

In 2017, a paper titled _Attention Is All You Need_ was published. This marked a turning point in NLP. The authors demonstrated that we could remove the RNN networks and get superior performance using _just_ the attention mechanism — with a few changes.

This new attention-based model was named a _‘transformer’_. Since then, the NLP ecosystem has entirely shifted from RNNs to transformers thanks to their vastly superior performance and incredible capability for generalization.

The first transformer removed the need for RNNs through the use of _three_ key components:

- Positional Encoding
- Self-attention
- Multi-head attention

**Positional encoding** replaced the key advantage of RNNs in NLP — the ability to consider the order of a sequence (they were _recurrent_). It worked by adding a set of varying sine wave activations to each input embedding based on position.

**Self-attention** is where the attention mechanism is applied between a word and all of the other words in its own context (sentence/paragraph). This is different from vanilla attention which specifically focused on attention between encoders and decoders.

**Multi-head attention** can be seen as several _parallel_ attention mechanisms working together. Using several attention _heads_ allowed the representation of several sets of relationships (rather than a single set).

### Pretrained Models

The new transformer models generalized much better than previous RNNs, which were often built specifically for each use-case.

With transformer models, it is possible to use the same _‘core’_ of a model and simply swap the last few layers for different use cases (without retraining the _core_).

This new property resulted in the rise of _pretrained_ models for NLP. Pretrained transformer models are trained on vast amounts of training data — often at high costs by the likes of Google or OpenAI, then released for the public to use for free.

One of the most widely used of these pretrained models is BERT, or **B**idirectional **E**ncoder **R**epresentations from **T**ransformers by Google AI.

BERT spawned a whole host of further models and derivations such as distilBERT, RoBERTa, and ALBERT, covering tasks such as classification, Q&A, POS-tagging, and more.

### BERT for Sentence Similarity

So far, so good, but these transformer models had one issue when building sentence vectors: Transformers work using word or _token_-level embeddings, _not_ sentence-level embeddings.

Before sentence transformers, the approach to calculating _accurate_ sentence similarity with BERT was to use a cross-encoder structure. This meant that we would pass two sentences to BERT, add a classification head to the top of BERT — and use this to output a similarity score.

The BERT cross-encoder architecture consists of a BERT model which consumes sentences A and B. Both are processed in the same sequence, separated by a `[SEP]` token. All of this is followed by a feedforward NN classifier that outputs a similarity score.

![The BERT cross-encoder architecture consists of a BERT model which consumes sentences A and B. Both are processed in the same sequence, separated by a [SEP] token. All of this is followed by a feedforward NN classifier that outputs a similarity score.](https://cdn.sanity.io/images/vr8gru94/production/9a89f1b7dddd4c78da8b9ba0311c2ffd1ff18ffe-1920x1080.png)


The cross-encoder network does produce very accurate similarity scores (better than SBERT), but it’s _not scalable_. If we wanted to perform a similarity search through a small 100K sentence dataset, we would need to complete the cross-encoder inference computation 100K times.

To cluster sentences, we would need to compare all sentences in our 100K dataset, resulting in just under 500M comparisons — this is simply not realistic.

Ideally, we need to pre-compute sentence vectors that can be stored and then used whenever required. If these vector representations are good, all we need to do is calculate the cosine similarity between each.

With the original BERT (and other transformers), we can build a sentence embedding by averaging the values across all token embeddings output by BERT (if we input 512 tokens, we output 512 embeddings). Alternatively, we can use the output of the first `[CLS]` token (a BERT-specific token whose output embedding is used in classification tasks).

Using one of these two approaches gives us our sentence embeddings that can be stored and compared much faster, shifting search times from 65 hours to around 5 seconds (see below). However, the accuracy is not good, and is worse than using averaged GloVe embeddings (which were developed in 2014).

**The solution** to this lack of an accurate model _with_ reasonable latency was designed by Nils Reimers and Iryna Gurevych in 2019 with the introduction of sentence-BERT (SBERT) and the `sentence-transformers` library.

SBERT outperformed the previous state-of-the-art (SOTA) models for all common semantic textual similarity (STS) tasks — more on these later — except a single dataset (SICK-R).

Thankfully for scalability, SBERT produces sentence embeddings — so we do _not_ need to perform a whole inference computation for every sentence-pair comparison.

Reimers and Gurevych demonstrated the dramatic speed increase in 2019. Finding the most similar sentence pair from 10K sentences took 65 hours with BERT. With SBERT, embeddings are created in ~5 seconds and compared with cosine similarity in ~0.01 seconds.

Since the SBERT paper, many more sentence transformer models have been built using similar concepts that went into training the original SBERT. They’re all trained on many similar and dissimilar sentence pairs.

Using a loss function such as softmax loss, multiple negatives ranking loss, or MSE margin loss, these models are optimized to produce similar embeddings for similar sentences, and dissimilar embeddings otherwise.

Now you have some context behind sentence transformers, where they come from, and why they’re needed. Let’s dive into how they work.

_[3] The SBERT paper covers many of the statements, techniques, and numbers from this section._

## Sentence Transformers

We explained the cross-encoder architecture for sentence similarity with BERT. SBERT is similar but drops the final classification head, and processes one sentence at a time. SBERT then uses mean pooling on the final output layer to produce a sentence embedding.

Unlike BERT, SBERT is fine-tuned on sentence pairs using a _siamese_ architecture. We can think of this as having two identical BERTs in parallel that share the exact same network weights.

![An SBERT model applied to a sentence pair sentence A and sentence B. Note that the BERT model outputs token embeddings (consisting of 512 768-dimensional vectors). We then compress that data into a single 768-dimensional sentence vector using a pooling function.](https://cdn.sanity.io/images/vr8gru94/production/2425dc0efd3f73a0bf57b3bf85a091c78619ec2c-1920x1110.png)


In reality, we are using a single BERT model. However, because we process sentence A followed by sentence B as _pairs_ during training, it is easier to think of this as two models with tied weights.

### Siamese BERT Pre-Training

There are different approaches to training sentence transformers. We will describe the original process featured most prominently in the original SBERT that optimizes on _softmax-loss_. Note that this is a high-level explanation, we will save the in-depth walkthrough for another article.

The softmax-loss approach used the _‘siamese’_ architecture fine-tuned on the Stanford Natural Language Inference (SNLI) and Multi-Genre NLI (MNLI) corpora.

SNLI contains 570K sentence pairs, and MNLI contains 430K. The pairs in both corpora include a `premise` and a `hypothesis`. Each pair is assigned one of three labels:

- **0** — _entailment_, e.g. the `premise` suggests the `hypothesis`.
- **1** — _neutral_, the `premise` and `hypothesis` could both be true, but they are not necessarily related.
- **2** — _contradiction_, the `premise` and `hypothesis` contradict each other.

Given this data, we feed sentence A (let’s say the `premise`) into siamese BERT A and sentence B (`hypothesis`) into siamese BERT B.

The siamese BERT outputs our pooled sentence embeddings. There were the results of _three_ different pooling methods in the SBERT paper. Those are _mean_, _max_, and _[CLS]_-pooling. The _mean_-pooling approach was best performing for both NLI and STSb datasets.

There are now two sentence embeddings. We will call embeddings A `u` and embeddings B `v`. The next step is to concatenate `u` and `v`. Again, several concatenation approaches were tested, but the highest performing was a `(u, v, |u-v|)` operation:

![We concatenate the embeddings u, v, and |u - v|.](https://cdn.sanity.io/images/vr8gru94/production/c78a83baccb40c331a92ddb25d8a1e4c97e397ed-1920x840.png)


`|u-v|` is calculated to give us the element-wise difference between the two vectors. Alongside the original two embeddings (`u` and `v`), these are all fed into a feedforward neural net (FFNN) that has _three_ outputs.

These three outputs align to our NLI similarity labels **0**, **1**, and **2**. We need to calculate the softmax from our FFNN, which is done within the [cross-entropy loss function](https://www.pinecone.io/learn/cross-entropy-loss/). The softmax and labels are used to optimize on this _‘softmax-loss’_.

![The operations were performed during training on two sentence embeddings, u and v. Note that softmax-loss refers cross-entropy loss (which contains a softmax function by default).](https://cdn.sanity.io/images/vr8gru94/production/a7bc429139dfb58998cee4fe84341ef5b66f2019-1920x990.png)


The operations were performed during training on two sentence embeddings, `u` and `v`. Note that _softmax-loss_ refers cross-entropy loss (which contains a softmax function by default).

This results in our pooled sentence embeddings for similar sentences (label **0**) becoming _more similar_, and embeddings for dissimilar sentences (label **2**) becoming _less similar_.

Remember we are using _siamese_ BERTs **not** _dual_ BERTs. Meaning we don’t use two independent BERT models but a single BERT that processes sentence A followed by sentence B.

This means that when we optimize the model weights, they are pushed in a direction that allows the model to output more similar vectors where we see an _entailment_ label and more dissimilar vectors where we see a _contradiction_ label.

_We are working on a step-by-step guide to training a siamese BERT model with the SNLI and MNLI corpora described above using both the softmax-loss and multiple-negatives-ranking-loss approaches. You can get an email as soon as we release the article by_ [clicking here](https://www.pinecone.io/learn/) _(the form is at the bottom of the page)._

The fact that this training approach works is not particularly intuitive and indeed has been described by Reimers as _coincidentally_ producing good sentence embeddings [5].

Since the original paper, further work has been done in this area. Many more models such as the [latest MPNet and RoBERTa models trained on 1B+ samples](https://huggingface.co/spaces/flax-sentence-embeddings/sentence-embeddings) (producing much better performance) have been built. We will be exploring some of these in future articles, and the superior training approaches they use.

For now, let’s look at how we can initialize and use some of these sentence-transformer models.

### Getting Started with Sentence Transformers

The fastest and easiest way to begin working with sentence transformers is through the `sentence-transformers` library created by the creators of SBERT. We can install it with `pip`.

```python
!pip install sentence-transformers
```

We will start with the original SBERT model `bert-base-nli-mean-tokens`. First, we download and initialize the model.

```json
{
  "_key": "29c426bbed1f",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"source\": [\n    \"from sentence_transformers import SentenceTransformer\\n\",\n    \"\\n\",\n    \"model = SentenceTransformer('bert-base-nli-mean-tokens')\\n\",\n    \"\\n\",\n    \"model\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"SentenceTransformer(\\n\",\n       \"  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel \\n\",\n       \"  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\\n\",\n       \")\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 1\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

The output we can see here is the `SentenceTransformer` object which contains _three_ components:

- The **transformer** itself, here we can see the max sequence length of `128` tokens and whether to lowercase any input (in this case, the model does _not_). We can also see the model class, `BertModel`.
- The **pooling** operation, here we can see that we are producing a `768`-dimensional sentence embedding. We are doing this using the _mean pooling_ method.

Once we have the model, building sentence embeddings is quickly done using the `encode` method.

```json
{
  "_key": "46b4049adbc6",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"source\": [\n    \"sentences = [\\n\",\n    \"    \\\"the fifty mannequin heads floating in the pool kind of freaked them out\\\",\\n\",\n    \"    \\\"she swore she just saw her sushi move\\\",\\n\",\n    \"    \\\"he embraced his new life as an eggplant\\\",\\n\",\n    \"    \\\"my dentist tells me that chewing bricks is very bad for your teeth\\\",\\n\",\n    \"    \\\"the dental specialist recommended an immediate stop to flossing with construction materials\\\"\\n\",\n    \"]\\n\",\n    \"\\n\",\n    \"embeddings = model.encode(sentences)\\n\",\n    \"\\n\",\n    \"embeddings.shape\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"(5, 768)\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 2\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We now have sentence embeddings that we can use to quickly compare sentence similarity for the use cases introduced at the start of the article; STS, semantic search, and clustering.

We can put together a fast STS example using nothing more than a cosine similarity function and Numpy.

```json
{
  "_key": "918ec54ed1c4",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"from sentence_transformers.util import cos_sim\\n\",\n    \"\\n\",\n    \"sim = np.zeros((len(sentences), len(sentences)))\\n\",\n    \"\\n\",\n    \"for i in range(len(sentences)):\\n\",\n    \"    sim[i:,i] = cos_sim(embeddings[i], embeddings[i:])\\n\",\n    \"\\n\",\n    \"sim\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"array([[1.00000024, 0.        , 0.        , 0.        , 0.        ],\\n\",\n       \"       [0.40914285, 1.        , 0.        , 0.        , 0.        ],\\n\",\n       \"       [0.10909   , 0.4454796 , 1.        , 0.        , 0.        ],\\n\",\n       \"       [0.50074852, 0.30693918, 0.20791623, 0.99999958, 0.        ],\\n\",\n       \"       [0.29936209, 0.38607228, 0.28499269, 0.63849503, 1.0000006 ]])\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 3\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

![Heatmap showing cosine similarity values between all sentence-pairs.](https://cdn.sanity.io/images/vr8gru94/production/5c203c1b21b8099b34dbcdbccae13af57f038f98-1920x1080.png)


Here we have calculated the cosine similarity between every combination of our five sentence embeddings. Which are:

| Index | Sentence |
| 0 | the fifty mannequin heads floating in the pool kind of freaked them out |
| 1 | she swore she just saw her sushi move |
| 2 | he embraced his new life as an eggplant |
| 3 | my dentist tells me that chewing bricks is very bad for your teeth |
| 4 | the dental specialist recommended an immediate stop to flossing with construction materials |

IndexSentence0the fifty mannequin heads floating in the pool kind of freaked them out1she swore she just saw her sushi move2he embraced his new life as an eggplant3my dentist tells me that chewing bricks is very bad for your teeth4the dental specialist recommended an immediate stop to flossing with construction materials

We can see the highest similarity score in the bottom-right corner with `0.64`. As we would hope, this is for sentences `4` and `3`, which both describe poor dental practices using construction materials.

## Other sentence-transformers

Although we returned good results from the SBERT model, many more sentence transformer models have since been built. Many of which we can find in the `sentence-transformers` library.

These newer models can significantly outperform the original SBERT. In fact, SBERT is no longer listed as an available model on the [SBERT.net models page](https://www.sbert.net/docs/pretrained_models.html).

| Model | Avg. Performance | Speed | Size (MB) |
| all-mpnet-base-v2 | 63.30 | 2800 | 418 |
| all-roberta-large-v1 | 53.05 | 800 | 1355 |
| all-MiniLM-L12-v1 | 59.80 | 7500 | 118 |

We will cover some of these later models in more detail in future articles. For now, let’s compare one of the highest performers and run through our STS task.

```json
{
  "_key": "f30f87fc57f6",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"source\": [\n    \"# !pip install sentence-transformers\\n\",\n    \"\\n\",\n    \"from sentence_transformers import SentenceTransformer\\n\",\n    \"\\n\",\n    \"mpnet = SentenceTransformer('all-mpnet-base-v2')\\n\",\n    \"\\n\",\n    \"mpnet\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"SentenceTransformer(\\n\",\n       \"  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel \\n\",\n       \"  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\\n\",\n       \"  (2): Normalize()\\n\",\n       \")\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 1\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Here we have the `SentenceTransformer` model for `all-mpnet-base-v2`. The components are very similar to the `bert-base-nli-mean-tokens` model, with some small differences:

- `max_seq_length` has increased from `128` to `384`. Meaning we can process sequences that are _three_ times longer than we could with SBERT.
- The base model is now `MPNetModel` [4] not `BertModel`.
- There is an additional normalization layer applied to sentence embeddings.

Let’s compare the STS results of `all-mpnet-base-v2` against SBERT.

```json
{
  "_key": "7cbb5124696f",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"source\": [\n    \"embeddings = mpnet.encode(sentences)\\n\",\n    \"\\n\",\n    \"sim = np.zeros((len(sentences), len(sentences)))\\n\",\n    \"\\n\",\n    \"for i in range(len(sentences)):\\n\",\n    \"    sim[i:,i] = cos_sim(embeddings[i], embeddings[i:])\\n\",\n    \"\\n\",\n    \"sim\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"array([[ 1.00000048,  0.        ,  0.        ,  0.        ,  0.        ],\\n\",\n       \"       [ 0.26406282,  1.00000012,  0.        ,  0.        ,  0.        ],\\n\",\n       \"       [ 0.16503485,  0.16126671,  1.00000036,  0.        ,  0.        ],\\n\",\n       \"       [ 0.04334451,  0.04615867,  0.0567013 ,  1.00000036,  0.        ],\\n\",\n       \"       [ 0.05398509,  0.06101188, -0.01122264,  0.51847214,  0.99999952]])\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 6\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

![Heatmaps for both SBERT and the MPNet sentence transformer.](https://cdn.sanity.io/images/vr8gru94/production/e55da137c5f6a09518783dfb62a89a9c39e1293c-1920x1080.png)


The semantic representation of later models is apparent. Although SBERT correctly identifies `4` and `3` as the most similar pair, it also assigns reasonably high similarity to other sentence pairs.

On the other hand, the MPNet model makes a _very_ clear distinction between similar and dissimilar pairs, with most pairs scoring less than 0.1 and the `4`-`3` pair scored at _0.52_.

By increasing the separation between dissimilar and similar pairs, we’re:

1. Making it easier to automatically identify relevant pairs.
2. Pushing predictions closer to the _0_ and _1_ target scores for _dissimilar_ and _similar_ pairs used during training. This is something we will see more of in our future articles on fine-tuning these models.

---

That’s it for this article introducing sentence embeddings and the current SOTA sentence transformer models for building these incredibly useful embeddings.

Sentence embeddings, although only recently popularized, were produced from a long range of fantastic innovations. We described some of the mechanics applied to create the first sentence transformer, SBERT.

We also demonstrated that despite SBERT’s very recent introduction in 2019, other sentence transformers already outperform the model. Fortunately for us, it’s easy to switch out SBERT for one of these newer models with the `sentence-transformers` library.

In future articles, we will dive deeper into some of these newer models and how to train our own sentence transformers.

Subscribe for more semantic search material!Submit

## References

[1] A. Vashwani, et al., [Attention Is All You Need](https://arxiv.org/abs/1706.03762) (2017), NeurIPS

[2] D. Bahdanau, et al., [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) (2015), ICLR

[3] N. Reimers, I. Gurevych, [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (2019), ACL

[4] [MPNet Model](https://huggingface.co/transformers/model_doc/mpnet.html), Hugging Face Docs

[5] N. Reimers, [Natural Language Inference](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/README.md), sentence-transformers on GitHub

