# Product Quantization: Compressing high-dimensional vectors by 97%

> Product quantization (PQ) is a popular method for dramatically compressing high-dimensional vectors to use 97% less memory, and for making nearest-neighbor search speeds 5.5x faster in our tests.

![](https://cdn.sanity.io/images/vr8gru94/production/0425edbee092b14b4a6a32cfb50e99a1971399ba-1920x1080.gif)


---

[Pinecone](https://www.pinecone.io/) **lets you add vector search to applications without knowing anything about algorithm optimizations, and** **[it’s free to try](https://app.pinecone.io/). However, we know you like seeing how things work, so enjoy learning about memory-efficient search with product quantization!**

---

Vector similarity search can require huge amounts of memory. Indexes containing 1M dense vectors (a small dataset in today’s world) will often require several GBs of memory to store.

The problem of excessive memory usage is exasperated by high-dimensional data, and with ever-increasing dataset sizes, this can _very_ quickly become unmanageable.

Product quantization (PQ) is a popular method for dramatically compressing high-dimensional vectors to use 97% less memory, and for making nearest-neighbor search speeds 5.5x faster in our tests.

A composite IVF+PQ index speeds up the search by another 16.5x without affecting accuracy, for a whopping total speed increase of 92x compared to non-quantized indexes.

In this article, we will cover all you need to know about PQ: How it works, pros and cons, implementation in Faiss, composite IVFPQ indexes, and how to achieve the speed and memory optimizations mentioned above.

[Video](https://www.youtube.com/watch?v=t9mRf2S5vDI)


---

## What is Quantization

Quantization is a generic method that refers to the compression of data into a smaller space. I know that might not make much sense — let me explain.

First, let’s talk about dimensionality reduction — which is _not_ the same as quantization.

Let’s say we have a high-dimensional vector, it has a dimensionality of `128`. These values are 32-bit floats in the range of _0.0 -> 157.0_ (our scope `S`). Through dimensionality reduction, we aim to produce another, lower-dimensionality vector.

![Dimensionality reduction reduces the dimensionality D of vectors, but not the scope S.](https://cdn.sanity.io/images/vr8gru94/production/178d6efe3bfeae550d18bcd8bd3352817717674c-1920x850.png)


On the other hand, we have quantization. Quantization does not care about dimensionality `D`. Instead, it targets the potential scope of values. Rather than reducing `D`, we reduce `S`.

![Quantization reduces the scope S of possible vectors. Note that with pre-quantization the scope is typically infinite.](https://cdn.sanity.io/images/vr8gru94/production/ca04d5d84cd168d0c6a9dba146851ebde6612ead-1920x1080.png)


There are many ways of doing this. For example, we have _clustering_. When we cluster a set of vectors we replace the larger scope of potential values (all possible vectors), with a smaller _discrete and symbolic_ set of centroids.

And this is really how we can define a quantization operation. The transformation of a vector into a space with a finite number of possible values, where those values are _symbolic_ representations of the original vector.

Just to make it very clear, these symbolic representations vary in form. They can be centroids as is the case for PQ, or [binary codes like those produced by LSH](https://www.pinecone.io/learn/series/faiss/locality-sensitive-hashing-random-projection/).

### Why Product Quantization?

Quantization is primarily used to reduce the memory footprint of indexes — an important task when comparing large arrays of vectors as they must all be loaded in memory to be compared.

PQ is not the only quantization method that does this, however — but other methods do not manage to reduce memory size as effectively as PQ. We can actually calculate memory usage and quantization operation complexity for PQ and other methods like so:

`kmeans = kD PQ = mk^*D^* = k^{1/m}D`

We know that `D` represents the dimensionality of our input vectors, but `k` and `m` may be new. `k` represents the _total_ number of centroids (or _codes_) that will be used to represent our vectors. And `m` represents the number of _subvectors_ that we will split our vectors into (more on that later).

_(A ‘code’ refers to the quantized representation of our vectors)_

![Memory usage (and complexity) vs dimensionality using k=2048 and m=8.](https://cdn.sanity.io/images/vr8gru94/production/358bd5e61d3324c4484e35c402b4aa252e1e9eb9-1920x1080.png)


The problem here is that for good results, the recommended `k` value is of `2048` (211) or more[1]. Given a vector dimensionality of `D`, clustering without PQ leaves us with _very high_ memory requirements and complexity:

| Operation | Memory and complexity |
| k-means | kD = 2048×128 = 262144 |
| PQ | mk*D* = (k^(1/m))×D = (2048^(1/8))×128 = 332 |

Given an m value of `8`, the equivalent memory usage and _assignment_ complexity for PQ is significantly lower — thanks to the _chunking_ of vectors into subvectors and the subquantization process being applied to those smaller dimensionalities `k*` and `D*`, equal to `k/m` and `D/m` respectively.

A second important factor is quantizer training. Quantizers require datasets that are several times larger than k for effective training, that is _without product subquantization_.

Using subquantizers, we only need several multiples of `k* (which is k/m)` — this can still be a large number — but it can be significantly reduced.

## How Product Quantization Works

Let’s work through the logic of PQ. We would usually have many vectors (all of equal length) — but for the sake of simplicity, we will use a single vector in our examples.

In short, PQ is the process of:

- Taking a big, high-dimensional vector,
- Splitting it into equally sized chunks — our subvectors,
- Assigning each of these subvectors to its nearest _centroid_ (also called reproduction/reconstruction values),
- Replacing these centroid values with unique IDs — each ID represents a centroid

[Video](https://d33wubrfki0l68.cloudfront.net/af00b6764682bea50979e2285c0380f99e06466e/48940/images/product-quantization-6.mp4)


At the end of the process, we’ve reduced our highly dimensional vector — that requires _a lot of memory_ — to a tiny vector of IDs that require _very little memory_.

Our vector is of length D 12. We start by splitting this vector into `m` subvectors like so:

![Here we are splitting our high-dimensional vector x into several subvectors u_j.](https://cdn.sanity.io/images/vr8gru94/production/76b14448d700a0d7c3e52abce3cba6464a13237f-1920x1020.png)


```json
{
  "_key": "528071c46d97",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"source\": [\n    \"x = [1, 8, 3, 9, 1, 2, 9, 4, 5, 4, 6, 2]\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"source\": [\n    \"m = 4\\n\",\n    \"D = len(x)\\n\",\n    \"# ensure D is divisable by m\\n\",\n    \"assert D % m == 0\\n\",\n    \"# length of each subvector will be D / m (D* in notation)\\n\",\n    \"D_ = int(D / m)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"source\": [\n    \"# now create the subvectors\\n\",\n    \"u = [x[row:row+D_] for row in range(0, D, D_)]\\n\",\n    \"u\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"[[1, 8, 3], [9, 1, 2], [9, 4, 5], [4, 6, 2]]\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 3\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We can refer to each subvector by its position `j`.

For the next part, think about clustering. As a random example, given a large number of vectors we can say _“I want three clusters”_ — and we then optimize these cluster centroids to split our vectors into _three_ categories based on each vector’s nearest centroid.

For PQ we do the same thing with one minor difference. Each subvector space (subspace) is assigned its own set of clusters — and so what we produce is a set of clustering algorithms across multiple subspaces.

```json
{
  "_key": "7e53e08931ef",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"source\": [\n    \"k = 2**5\\n\",\n    \"assert k % m == 0\\n\",\n    \"k_ = int(k/m)\\n\",\n    \"print(f\\\"{k=}, {k_=}\\\")\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"k=32, k_=8\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"source\": [\n    \"from random import randint\\n\",\n    \"\\n\",\n    \"c = []  # our overall list of reproduction values\\n\",\n    \"for j in range(m):\\n\",\n    \"    # each j represents a subvector (and therefore subquantizer) position\\n\",\n    \"    c_j = []\\n\",\n    \"    for i in range(k_):\\n\",\n    \"        # each i represents a cluster/reproduction value position *inside* each subspace j\\n\",\n    \"        c_ji = [randint(0, 9) for _ in range(D_)]\\n\",\n    \"        c_j.append(c_ji)  # add cluster centroid to subspace list\\n\",\n    \"    # add subspace list of centroids to overall list\\n\",\n    \"    c.append(c_j)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Each of our subvectors will be assigned to one of these centroids. In PQ terminology these centroids are called _reproduction values_ and are represented by `cⱼ,ᵢ` where `j` is our subvector identifier, and `i` identifies the chosen centroid _(there are_ _`k*`_ _centroids for each subvector space j)_.

![Our subvectors are replaced with a specific centroid vector — which can then be replaced with a unique ID specific to that centroid vector.](https://cdn.sanity.io/images/vr8gru94/production/05bc75a87d7a07b8d844d13a1b954458e71431d2-1920x1080.png)


```json
{
  "_key": "8ba9bb791a44",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"source\": [\n    \"def euclidean(v, u):\\n\",\n    \"    distance = sum((x - y) ** 2 for x, y in zip(v, u)) ** .5\\n\",\n    \"    return distance\\n\",\n    \"\\n\",\n    \"def nearest(c_j, u_j):\\n\",\n    \"    distance = 9e9\\n\",\n    \"    for i in range(k_):\\n\",\n    \"        new_dist = euclidean(c_j[i], u_j)\\n\",\n    \"        if new_dist < distance:\\n\",\n    \"            nearest_idx = i\\n\",\n    \"            distance = new_dist\\n\",\n    \"    return nearest_idx\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"source\": [\n    \"ids = []\\n\",\n    \"for j in range(m):\\n\",\n    \"    i = nearest(c[j], u[j])\\n\",\n    \"    ids.append(i)\\n\",\n    \"ids\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"[1, 1, 2, 1]\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 7\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

When we process a vector with PQ, it is split into our subvectors, those subvectors are then processed and assigned to their nearest (sub)cluster centroids (reproduction values).

Rather than storing our quantized vector to be represented by the `D*`-dimensional centroids, we replace it with a centroid ID. Every centroid `cⱼ,ᵢ` has its own ID, which can later be used to map those ID values back to the full centroids via our codebook `c`.

```json
{
  "_key": "487c7664c184",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"source\": [\n    \"q = []\\n\",\n    \"for j in range(m):\\n\",\n    \"    c_ji = c[j][ids[j]]\\n\",\n    \"    q.extend(c_ji)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"source\": [\n    \"q\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"[1, 7, 7, 6, 3, 6, 7, 4, 4, 3, 9, 6]\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 9\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

With that, we have compressed a 12-dimensional vector into a 4-dimensional vector of IDs. We have used a small dimensionality here for the sake of simplicity, and so the benefits of such a technique may not be inherently clear.

Let’s switch from our original 12-dimensional vector of 8-bit integers to a more realistic 128-dimensional vector of 32-bit floats (as we will be using throughout the next section). We can find a good balance in performance after compression to an 8-bit integer vector containing just _eight_ dimensions.

`Original: 128×32 = 4096 Quantized: 8×8 = 64`

That’s a big difference — 64x!

---

## PQ Implementation in Faiss

So far we’ve worked through the logic behind a simple, readable implementation of PQ in Python. Realistically we wouldn’t use this because it is not optimized and we already have excellent implementations elsewhere. Instead, we would use a library like [Faiss](https://www.pinecone.io/learn/series/faiss/) — or a production-ready service like [Pinecone](https://www.pinecone.io/).

We’ll take a look at how we can build a PQ index in Faiss, and we’ll even take a look at combining PQ with an Inverted File (IVF) step to improve search speed.

[Video](https://www.youtube.com/watch?v=BMYBwbkbVec)


Before we start, we need to get data. We will be using the Sift1M dataset. It can be downloaded and opened using this script:

```json
{
  "_key": "a34699508d88",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import shutil\\n\",\n    \"import urllib.request as request\\n\",\n    \"from contextlib import closing\\n\",\n    \"\\n\",\n    \"# first we download the Sift1M dataset\\n\",\n    \"with closing(request.urlopen('ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz')) as r:\\n\",\n    \"    with open('sift.tar.gz', 'wb') as f:\\n\",\n    \"        shutil.copyfileobj(r, f)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import tarfile\\n\",\n    \"\\n\",\n    \"# the download leaves us with a tar.gz file, we unzip it\\n\",\n    \"tar = tarfile.open('sift.tar.gz', \\\"r:gz\\\")\\n\",\n    \"tar.extractall()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"# now define a function to read the fvecs file format of Sift1M dataset\\n\",\n    \"def read_fvecs(fp):\\n\",\n    \"    a = np.fromfile(fp, dtype='int32')\\n\",\n    \"    d = a[0]\\n\",\n    \"    return a.reshape(-1, d + 1)[:, 1:].copy().view('float32')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# data we will search through\\n\",\n    \"wb = read_fvecs('./sift/sift_base.fvecs')  # 1M samples\\n\",\n    \"# also get some query vectors to search with\\n\",\n    \"xq = read_fvecs('./sift/sift_query.fvecs')\\n\",\n    \"# take just one query (there are many in sift_learn.fvecs)\\n\",\n    \"xq = xq[0].reshape(1, xq.shape[1])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"(1, 128)\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"xq.shape\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"(1000000, 128)\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"wb.shape\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.9.5 64-bit\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.8\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Now let’s move onto our first index: IndexPQ.

### IndexPQ

Our first index is a pure PQ implementation using IndexPQ. To initialize the index we need to define three parameters.

```json
{
  "_key": "673a219f3956",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"source\": [\n    \"import faiss\\n\",\n    \"\\n\",\n    \"D = xb.shape[1]\\n\",\n    \"m = 8\\n\",\n    \"assert D % m == 0\\n\",\n    \"nbits = 8  # number of bits per subquantizer, k* = 2**nbits\\n\",\n    \"index = faiss.IndexPQ(D, m, nbits)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We have our vector dimensionality `D`, the number of subvectors we’d like to split our full vectors into (we must `assert` that `D` is divisible by `m`).

Finally, we include the nbits parameter. This defines the number of bits that each subquantizer can use, we can translate this into the number of centroids assigned to each subspace as `k_ = 2**nbits`. An `nbits` of 11 leaves us with `2048` centroids per subspace.

Because we are using PQ, which uses clustering — we must train our index using our xb dataset.

```json
{
  "_key": "45910a266672",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"source\": [\n    \"index.is_trained\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"False\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 3\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"source\": [\n    \"index.train(xb)  # PQ training can take some time when using large nbits\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And with that, we can go ahead and add our vectors to the index and `search`.

```json
{
  "_key": "be1325272b73",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"source\": [\n    \"dist, I = index.search(xq, k)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"source\": [\n    \"%%timeit\\n\",\n    \"index.search(xq, k)\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"1.49 ms ± 49.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

From our search, we will return the top `k` closest matches (not the same `k` used in earlier notation). Returning our distances in `dist`, and the indices in `I`.

We can compare the _recall_ performance of our `IndexPQ` against that of a flat index — which has ‘perfect’ recall (thanks to not compressing vectors and performing an exhaustive search).

```json
{
  "_key": "ef93d99b01da",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"source\": [\n    \"l2_index = faiss.IndexFlatL2(D)\\n\",\n    \"l2_index.add(xb)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"source\": [\n    \"%%time\\n\",\n    \"l2_dist, l2_I = l2_index.search(xq, k)\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"CPU times: user 46.1 ms, sys: 15.1 ms, total: 61.2 ms\\n\",\n      \"Wall time: 15 ms\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"source\": [\n    \"sum([1 for i in I[0] if i in l2_I])\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"50\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 10\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We’re getting 50% which is a reasonable recall _if_ we are happy to sacrifice the perfect results for the reduced memory usage of PQ. There’s also a reduction to just 18% of the flat search time — something that we can improve _even further_ using IVF later.

Lower recall rates are a major drawback of PQ. This can be counteracted _somewhat_ by using larger `nbits` values at the cost of slower search times and _very_ slow index construction times. However, very high recall is out of reach for both PQ and IVFPQ indexes. If higher recall is required another index should be considered.

How does IndexPQ compare to our flat index in terms of memory usage?

```json
{
  "_key": "d124697d35a7",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"def get_memory(index):\\n\",\n    \"    # write index to file\\n\",\n    \"    faiss.write_index(index, './temp.index')\\n\",\n    \"    # get file size\\n\",\n    \"    file_size = os.path.getsize('./temp.index')\\n\",\n    \"    # delete saved index\\n\",\n    \"    os.remove('./temp.index')\\n\",\n    \"    return file_size\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"source\": [\n    \"# first we check our Flat index, this should be larger\\n\",\n    \"get_memory(l2_index)\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"256000045\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 12\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"source\": [\n    \"# now our PQ index\\n\",\n    \"get_memory(index)\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"4131158\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 13\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Memory usage using IndexPQ is — put simply — fantastic, with a memory reduction of 98.4%. It is possible to translate some of these preposterous performance benefits into search speeds too by using an IVF+PQ index.

### IndexIVFPQ

To speed up our search time we can add another step, using an IVF index, which will act as the initial broad stroke in reducing the scope of vectors in our search.

After this, we continue our PQ search as we did before — but with a significantly reduced number of vectors. Thanks to minimizing our search scope, we should find we get vastly improved search speeds.

Let’s see how that works. First, we initialize our IVF+PQ index like so:

```json
{
  "_key": "74b43e22205d",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"source\": [\n    \"vecs = faiss.IndexFlatL2(D)\\n\",\n    \"\\n\",\n    \"nlist = 2048  # how many Voronoi cells (must be >= k* which is 2**nbits)\\n\",\n    \"nbits = 8  # when using IVF+PQ, higher nbits values are not supported\\n\",\n    \"index = faiss.IndexIVFPQ(vecs, D, nlist, m, nbits)\\n\",\n    \"print(f\\\"{2**nbits=}\\\")  # our value for nlist\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"2**nbits=256\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

We have a new parameter here, `nlist` defines how many Voronoi cells we use to cluster our _already_ quantized PQ vectors ([learn more about IndexIVF here](https://www.pinecone.io/learn/series/faiss/vector-indexes/)). You may be asking, what on earth is a Voronoi cell — what does any of this even mean? Let’s visualize some 2D ‘PQ vectors’:

![2D chart showing our reconstructed ‘PQ’ vectors. However, in reality, we would never use PQ for 2D vectors as there is simply not enough dimensionality for us to split into subvectors and subquantization.](https://cdn.sanity.io/images/vr8gru94/production/9e0dbf3ff9d987e13fa99601f4fd0dbb88d78e68-1920x1080.png)


Let’s add some Voronoi cells:

![2D chart showing our quantized ‘PQ’ vectors that have now been assigned to different Voronoi cells via IVF.](https://cdn.sanity.io/images/vr8gru94/production/cac8f97499227e918bca966c57b8c00591a81f3f-1920x1080.png)


At a high level, they’re simply a set of partitions. Similar vectors are assigned to different partitions (or _cells_), and when it comes to search — we introduce our query vector xq and restrict our search to the nearest cell:

![IVF allows us to restrict our search to only vectors that have been assigned nearby cells. The magenta point is our query vector xq. Now let’s go ahead with our train and search — and see how our search speed and recall are doing.](https://cdn.sanity.io/images/vr8gru94/production/5e002882007c9546c5a96cb79ac300bf3226ff7a-1920x1080.png)


```json
{
  "_key": "bac0aa8a71e5",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"source\": [\n    \"index.train(xb)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"source\": [\n    \"index.add(xb)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"source\": [\n    \"dist, I = index.search(xq, k)\"\n   ],\n   \"outputs\": [],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"source\": [\n    \"%%timeit\\n\",\n    \"index.search(xq, k)\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"stream\",\n     \"name\": \"stdout\",\n     \"text\": [\n      \"86.3 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\\n\"\n     ]\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"source\": [\n    \"sum([1 for i in I[0] if i in l2_I])\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"34\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 20\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

A lightning-fast search time of 86.3μs, but the recall has decreased from our IndexPQ significantly (50% to 34%). Given equivalent parameters, both IndexPQ and IndexIVFPQ _should_ be able to attain equal recall performance.

The secret to improving our recall, in this case, is bumping up the `nprobe` parameter — which tells us _how many_ of the nearest Voronoi cells to include in our search scope.

[Video](https://d33wubrfki0l68.cloudfront.net/55d7c99f61b9d00f795f15d30340af276758f139/9d76b/images/product-quantization-11-ivf-nprobe.mp4)


At one extreme, we can include all cells by setting nprobe to our `nlist` value — this will return the maximum possible recall.

Of course, we don’t want to include all cells. It would make our IVF index pointless as we would then not be restricting our search scope (making this equivalent to a flat index). Instead, we find the lowest `nprobe` value that achieves this recall performance.

```json
{
  "_key": "a72ff894f953",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"source\": [\n    \"index.nprobe = 2048  # this is searching all Voronoi cells\\n\",\n    \"dist, I = index.search(xq, k)\\n\",\n    \"sum([1 for i in I[0] if i in l2_I])\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"52\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 22\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"source\": [\n    \"index.nprobe = 2\\n\",\n    \"dist, I = index.search(xq, k)\",\n    \"sum([1 for i in I[0] if i in l2_I])\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"39\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 23\n    }\n   ],\n   \"metadata\": {}\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 24,\n   \"source\": [\n    \"index.nprobe = 48\\n\",\n    \"dist, I = index.search(xq, k)\\n\",\n    \"sum([1 for i in I[0] if i in l2_I])\"\n   ],\n   \"outputs\": [\n    {\n     \"output_type\": \"execute_result\",\n     \"data\": {\n      \"text/plain\": [\n       \"52\"\n      ]\n     },\n     \"metadata\": {},\n     \"execution_count\": 24\n    }\n   ],\n   \"metadata\": {}\n  }\n ],\n \"metadata\": {\n  \"orig_nbformat\": 4,\n  \"language_info\": {\n   \"name\": \"python\",\n   \"version\": \"3.8.8\",\n   \"mimetype\": \"text/x-python\",\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"pygments_lexer\": \"ipython3\",\n   \"nbconvert_exporter\": \"python\",\n   \"file_extension\": \".py\"\n  },\n  \"kernelspec\": {\n   \"name\": \"python3\",\n   \"display_name\": \"Python 3.8.8 64-bit ('ml': conda)\"\n  },\n  \"interpreter\": {\n   \"hash\": \"a683edd788238e5c64f9fa2e4bdd4387776bc5c6f4f0a84da0685f9a25e421d6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

With a nprobe of 48, we achieve the best possible recall score of 52% (as demonstrated with `nprobe == 2048`), while minimizing our search scope (and therefore maximizing search speeds).

By adding our IVF step, we’ve dramatically reduced our search time from 1.49ms for IndexPQ to 0.09ms for IndexIVFPQ. And thanks to our PQ vectors we’ve then paired that with minuscule memory usage that’s 96% lower than the Flat index.

All in all, IndexIVFPQ gave us a _huge reduction_ in memory usage — albeit slightly larger than IndexPQ at 9.2MB vs 6.5MB — and lightning-fast search speeds, all while maintaining a reasonable recall of around 50%.

---

That’s it for this article! We’ve covered the intuition behind product quantization (PQ), and how it manages to compress our index and enable incredibly efficient memory usage.

We put together the Faiss IndexPQ implementation and tested search times, recall, and memory usage — then optimized the index even further by pairing it with an IVF index using IndexIVFPQ.

|  | FlatL2 | PQ | IVFPQ |
| Recall (%) | 100 | 50 | 52 |
| Speed (ms) | 8.26 | 1.49 | 0.09 |
| Memory (MB) | 256 | 6.5 | 9.2 |

The results of our tests show impressive memory compression and search speeds, with reasonable recall scores.

If you’re interested in learning more about the variety of indexes available in search, including more detail on the IVF index, read our [article about the best indexes for similarity search](https://www.pinecone.io/learn/series/faiss/vector-indexes/).

## References

- [1] H Jégou, et al., [Product quantization for nearest neighbor search](https://www.researchgate.net/publication/47815472_Product_Quantization_for_Nearest_Neighbor_Search) (2010)
- [Jupyter Notebooks](https://github.com/pinecone-io/examples/tree/master/learn/search/faiss-ebook/product-quantization)