# OpenAI's Text Embeddings v3

> OpenAI's text-embedding-3-large and text-embedding-3-small are the latest state-of-the-art models for embeddings, a critical component of Retrieval Augmented Generation (RAG) and the AI ecosystem.

James Briggs · 2024-01-25

In December 2022, in the middle of the unprecedented success of ChatGPT — OpenAI dropped another lesser-noticed, yet, world-changing AI model.

That model was creatively named `text-embedding-ada-002`. At the time, Ada 002 leapfrogged all other state-of-the-art (SotA) embedding models — including OpenAI's own previous record-setter; `text-search-davinci-001`.

Since then, OpenAI has remained surprisingly quiet on the embedding model front — despite the massive widespread adoption of embedding-dependant AI pipelines like [Retrieval Augmented Generation (RAG)](https://www.pinecone.io/learn/retrieval-augmented-generation/).

That lack of movement from OpenAI didn't matter much regarding adoption. Ada 002 is _still_ the most broadly adopted text embedding model. However, Ada 002 is about to be dethroned.

OpenAI is dethroning its own model. Again, they came up with very creative model names — `text-embedding-3-small` and `text-embedding-3-large`.

_First look video walkthrough:_

[Video](https://www.youtube.com/watch?v=cUyw5eG-VtM)


---

## At a Glance

These models are better, and we have the option of latency and storage-optimized `text-embedding-3-small`_or_ the higher accuracy `text-embedding-3-large`.

| Model | Dimensions | Max Tokens | Knowledge Cutoff | MIRACL avg | MTEB avg |
| --- | --- | --- | --- |
| `text-embedding-ada-002` | 1536 | 8191 | Sep 2021 | 31.4 | 61.0 |
| `text-embedding-3-small` | 1536 | 8191 | Sep 2021 | 44.0 | 62.3 |
| `text-embedding-3-large` | 3072 | 8191 | Sep 2021 | **54.9** | **64.6** |


Key takeaways here are the pretty _huge_ performance gains for multilingual embeddings — measured by the leap from **31.4%** to **54.9%** on the **MIRACL** benchmark. For English-language performance, we look at **MTEB** and see a smaller but still significant increase from **61%** to **64.6%**.

It's worth noting that the max tokens and knowledge cutoff have _not_ changed. That lack of new knowledge represents a minor drawback for use cases performing retrieval in domains requiring up-to-date knowledge.

We also have a different embedding dimensionality for the new v3 large model, resulting in higher storage costs and paired with higher embedding costs than what we get with Ada 002.

Now, there is some nuance to the dimensionality of these models. By _default_, these models use the dimensionality noted above. However, it turns out that they still perform even if we cut down those vectors.

For v3 small, we can keep just the first 512 dimensions. For v3 large, we can trim the vectors down to a _tiny_ 256-dimensions or a more midsized 1024-dimensions.

[Click here](https://github.com/pinecone-io/examples/blob/master/learn/search/semantic-search/openai-embed-v3/openai-embed-v3.ipynb) to try out the new OpenAI embedding models and see how they compare to Ada 002.

---

## What's so Special About These Models?

After further testing, the most exciting feature (for us) is that the 256-dimensional version of `text-embedding-3-large` can outperform the 1536-dimensional Ada 002. That is a 6x reduction in vector size.

OpenAI confirmed (after some prodding) that they achieved this via **M**atryoshka **R**epresentation **L**earning (MRL) [1].

MRL encodes information at different embedding dimensionalities. As per the paper, this enables up to 14x smaller embedding sizes with negligible degradation in accuracy.

---

## References

[1] A. Kusupati, et al., [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) (2022), NeurIPS 2022