Facebook AI Similarity Search (Faiss) is one of the most popular implementations of efficient similarity search, but what is it — and how can we use it?
What is it that makes Faiss special? How do we make the best use of this incredible tool?
Note: Pinecone lets you implement vector search into your applications with just a few API calls, without knowing anything about Faiss. However, you like seeing how things work, so enjoy the guide!
Fortunately, it’s a brilliantly simple process to get started with. And in this article, we’ll explore some of the options FAISS provides, how they work, and — most importantly — how Faiss can make our search faster.
Check out the video walkthrough here:
What is Faiss?
Before we get started with any code, many of you will be asking — what is Faiss?
Faiss is a library — developed by Facebook AI — that enables efficient similarity search.
So, given a set of vectors, we can index them using Faiss — then using another vector (the query vector), we search for the most similar vectors within the index.
Now, Faiss not only allows us to build an index and search — but it also speeds up search times to ludicrous performance levels — something we will explore throughout this article.
Building Some Vectors
The first thing we need is data, we’ll be concatenating several datasets from this semantic test similarity hub repo. We will download each dataset, and extract the relevant text columns into a single list.
Next, we remove any duplicates, leaving us with 14.5K unique sentences. Finally, we build our dense vector representations of each sentence using the sentence-BERT library.
Now, building these sentence embeddings can take some time — so feel free to download them directly from here (you can use this script to load them into Python).
Plain and Simple
We’ll start simple. First, we need to set up Faiss. Now, if you’re on Linux — you’re in luck — Faiss comes with built-in GPU optimization for any CUDA-enabled Linux machine.
MacOS or Windows? Well, we’re less lucky.
(Don’t worry, it’s still ludicrously fast)
So, CUDA-enabled Linux users, type
conda install -c pytorch faiss-gpu. Everyone else,
conda install -c pytorch faiss-cpu. If you don’t want to use
conda there are alternative installation instructions here.
Once we have Faiss installed we can open Python and build our first, plain and simple index with
IndexFlatL2 measures the L2 (or Euclidean) distance between all given points between our query vector, and the vectors loaded into the index. It’s simple, very accurate, but not too fast.
L2 distance calculation between a query vector xq and our indexed vectors (shown as y)
In Python, we would initialize our
IndexFlatL2 index with our vector dimensionality (
768 — the output size of our sentence embeddings) like so:
Often, we’ll be using indexes that require us to train them before loading in our data. We can check whether an index needs to be trained using the
IndexFlatL2 is not an index that requires training, so we should return
Once ready, we load our embeddings and query like so:
Which returns the top
k vectors closest to our query vector
5747. Clearly, these are all great matches — all including either people running with a football or in the context of a football match.
Now, if we’d rather extract the numerical vectors from Faiss, we can do that too.
IndexFlatL2 index alone is computationally expensive, it doesn’t scale well.
When using this index, we are performing an exhaustive search — meaning we compare our query vector
xq to every other vector in our index, in our case that is 14.5K L2-distance calculations for every search.
Imagine the speed of our search for datasets containing 1M, 1B, or even more vectors — and when we include several query vectors?
Milliseconds taken to return a result (y-axis) / number of vectors in the index (x-axis) — relying solely on IndexFlatL2 quickly becomes slow
Our index quickly becomes too slow to be useful, so we need to do something different.
Partitioning The Index
Faiss allows us to add multiple steps that can optimize our search using many different methods. A popular approach is to partition the index into Voronoi cells.
We can imagine our vectors as each being contained within a Voronoi cell — when we introduce a new query vector, we first measure its distance between centroids, then restrict our search scope to that centroid’s cell.
Using this method, we would take a query vector
xq, identify the cell it belongs to, and then use our
IndexFlatL2 (or another metric) to search between the query vector and all other vectors belonging to that specific cell.
So, we are reducing the scope of our search, producing an approximate answer, rather than exact (as produced through exhaustive search).
To implement this, we first initialize our index using
IndexFlatL2 — but this time, we are using the L2 index as a quantizer step — which we feed into the partitioning
Here we’ve added a new parameter
nlist. We use
nlist to specify how many partitions (Voronoi cells) we’d like our index to have.
Now, when we built the previous
IndexFlatL2-only index, we didn’t need to train the index as no grouping/transformations were required to build the index. Because we added clustering with
IndexIVFFlat, this is no longer the case.
So, what we do now is train our index on our data — which we must do before adding any data to the index.
Now that our index is trained, we add our data just as we did before.
Let’s search again using the same indexed sentence embeddings and the same query vector
The search time has clearly decreased, in this case, we don’t find any difference between results returned by our exhaustive search, and this approximate search. But, often this can be the case.
If approximate search with
IndexIVFFlat returns suboptimal results, we can improve accuracy by increasing the search scope. We do this by increasing the
nprobe attribute value — which defines how many nearby cells to search.
Searching the single closest cell when nprobe == 1 (left), and searching the eight closest cells when nprobe == 8 (right)
We can implement this change easily.
Now, because we’re searching a larger scope by increasing the
nprobe value, we will see the search speed increase too.
Query time / number of vectors for the IVFFlat index with different nprobe values — 1, 5, 10, and 20
Although, even with the larger
nprobe value we still see much faster responses than we returned with our
If we go ahead and attempt to use
index.reconstruct(<vector_idx>) again, we will return a
RuntimeError as there is no direct mapping between the original vectors and their index position, due to the addition of the IVF step.
So, if we’d like to reconstruct the vectors, we must first create these direct mappings using
And from there we are able to reconstruct our vectors just as we did before.
We have one more key optimization to cover. All of our indexes so far have stored our vectors as full (eg
Flat) vectors. Now, in very large datasets this can quickly become a problem.
Fortunately, Faiss comes with the ability to compress our vectors using Product Quantization (PQ).
But, what is PQ? Well, we can view it as an additional approximation step with a similar outcome to our use of IVF. Where IVF allowed us to approximate by reducing the scope of our search, PQ approximates the distance/similarity calculation instead.
PQ achieves this approximated similarity operation by compressing the vectors themselves, which consists of three steps.
Three steps of product quantization
- We split the original vector into several subvectors.
- For each set of subvectors, we perform a clustering operation — creating multiple centroids for each sub-vector set.
- In our vector of sub-vectors, we replace each sub-vector with the ID of it’s nearest set-specific centroid.
To implement all of this, we use the IndexIVFPQ index — we’ll also need to
train the index before adding our embeddings.
And now we’re ready to begin searching using our new index.
Speed or Accuracy?
Through adding PQ we’ve reduced our IVF search time from ~7.5ms to ~5ms, a small difference on a dataset of this size — but when scaled up this becomes significant quickly.
However, we should also take note of the slightly different results being returned. Beforehand, with our exhaustive L2 search, we were returning
5747. Now, we see a slightly different order of results — and two different IDs,
Both of our speed optimization operations, IVF and PQ, come at the cost of accuracy. Now, if we print out these results we will still find that each item is relevant:
So, although we might not get the perfect result, we still get close — and thanks to the approximations, we get a much faster response.
Query time / number of vectors for our three indexes
And, as shown in the graph above, the difference in query times become increasingly relevant as our index size increases.
That’s it for this article! We’ve covered the essentials to getting started with building high-performance indexes for search in Faiss.
Clearly, a lot can be done using
IndexIVFPQ — and each has many parameters that can be fine-tuned to our specific accuracy/speed requirements. And as shown, we can produce some truly impressive results, at lightning-fast speeds very easily thanks to Faiss.
Want to run Faiss in production? Pinecone provides vector similarity search that’s production-ready, scalable, and fully managed.