Random Projection for Locality Sensitive Hashing
Locality sensitive hashing (LSH) is a widely popular technique used in approximate similarity search. The solution to efficient similarity search is a profitable one — it is at the core of several billion (and even trillion) dollar companies.
The problem with similarity search is scale. Many companies deal with millions-to-billions of data points every single day. Given a billion data points, is it feasible to compare all of them with every search?
Further, many companies are not performing single searches — Google deals with more than 3.8 million searches every minute.
Billions of data points combined with high-frequency searches are problematic — and we haven’t considered the dimensionality nor the similarity function itself. Clearly, an exhaustive search across all data points is unrealistic for larger datasets.
The solution to searching impossibly huge datasets? Approximate search. Rather than exhaustively comparing every pair, we approximate — restricting the search scope only to high probability matches.
Locality Sensitive Hashing
Locality Sensitive Hashing (LSH) is one of the most popular approximate nearest neighbors search (ANNS) methods.
At its core, it is a hashing function that allows us to group similar items into the same hash buckets. So, given an impossibly huge dataset — we run all of our items through the hashing function, sorting items into buckets.
Unlike most hashing functions, which aim to minimize hashing collisions — LSH algorithms aim to maximize hashing collisions.
Two hashing functions, the top (blue) minimizes hashing collisions. The bottom (magenta) maximizes hashing collisions — LSH aims to maximize collisions between similar items.
This result for LSH is that similar vectors produce the same hash value and are bucketed together. In contrast, dissimilar vectors will hopefully not produce the same hash value — being placed in different buckets.
Search With LSH
Performing search with LSH consists of three steps:
Index all of our vectors into their hashed vectors.
Introduce our query vector (search term). It is hashed using the same LSH function.
Compare our hashed query vector to all other hash buckets via Hamming distance — identifying the nearest.
At a very high level, that is the process of the LSH methodology we will be covering. However, we will explain all of this in much more detail further in the article.
Effects of Approximation
Before diving into the detail of LSH, we should note that grouping vectors into lower resolution hashed vectors also means that our search is not exhaustive (e.g., comparing every vector), so we expect a lower search quality.
We compress potentially huge dense vectors into highly compressed, grouped binary vectors.
But, the reason for accepting this lower search quality is the potential for significantly faster search speeds.
We’ve been purposely lax on the details here because there are several versions of LSH — each utilizing different hash building and distance/similarity metrics. The two most popular approaches are:
Shingling, MinHashing, and banded LSH (traditional approach)
Random hyperplanes with dot-product and Hamming distance
This article will focus on the random hyperplanes method , which is more commonly used and implemented in various popular libraries such as Faiss.
The random hyperplanes (also called random projection) approach is deceptively simple — although it can be hard to find details on the method.
Let’s learn through an example — we will be using the Sift1M dataset throughout our examples, which you can download using this script.
Now, given a single query vector
xq we want to identify the top
k nearest neighbors from the
Here we are returning the three nearest neighbors to our query vector xq.
Using the random projection method, we will reduce our highly-dimensional vectors into low-dimensionality binary vectors. Once we have these binary vectors, we can measure the distance between them using Hamming distance.
Let’s work through that in a little more detail.
The hyperplanes in this method are used to split our datapoints and assign a value of 0 for those data points that appear on the negative side of our hyperplane — and a value of 1 for those that appear on the positive side.
We assign a value of 1 to vectors on the +ve side of our hyperplane and a value of 0 to vectors on the -ve side of the hyperplane.
To identify which side of the hyperplane our data point is located, all we need is the normal vector of the plane — e.g., a vector perpendicular to the plane. We feed this normal vector (alongside our datapoint vector) into a dot product function.
If two vectors share the same direction, the resultant dot product is positive. If they do not share the same direction, it is negative.
Where our hyperplane normal vector produces a +ve dot-product with another vector, we can view that vector as being in front of the hyperplane. The reverse is true for vectors that produce a -ve dot-product.
In the unlikely case of both vectors being perfectly perpendicular (sitting on the hyperplane edge), the dot product is 0 — we will group this in with our negative direction vectors.
A single binary value doesn’t tell us much about the similarity of our vectors, but when we begin adding more hyperplanes — the amount of encoded information rapidly increases.
We add more hyperplanes to increase the amount of positional information stored in our binary vectors.
By projecting our vectors into a lower-dimensional space using these hyperplanes, we produce our new hashed vectors.
In the image above, we have used two hyperplanes, and realistically we will need many more — a property that we define using the
nbits parameter. We will discuss
nbits in more detail later, but for now, we will use four hyperplanes by setting
nbits = 4.
Now, let’s create the normal vectors of our hyperplanes in Python.
np.random.rand we create a set of random values in the range 0 → 1. We then add
-.5 to center our array values around the origin (0, 0). Visualizing these vectors, we see:
The normal vectors that define the positions of our hyperplanes, which are all centered around the origin (0, 0).
Now let’s add three vectors — a, b, and c — and work through building our hash values using our four normal vectors and their hyperplanes.
Visualizing this again, we have our three vectors a, b, and c — alongside our four hyperplanes (perpendicular to their respective normal vectors). Taking the +ve and -ve dot-product values for each gives us:
A zero shows that the vector is behind the plane (-ve dot product), and a one shows that the vector is in front of the plane (+ve dot product). We combine these to create our binary vectors.
Which produces our hashed vectors. Now, LSH uses these values to create buckets — which will contain some reference back to our vectors (Eg their IDs). Note that we do not store the original vectors in the buckets — which would significantly increase the size of our LSH index.
As we will see with implementations such as Faiss — the position/order that we added the vector is usually stored. We will use this same approach in our examples.
Now that we have bucketed our three vectors, let’s consider the change in the complexity of our search. Let’s say we introduce a query vector that gets hashed as
With this vector, we compare it to every bucket in our LSH index — which in this case is only two values —
0110. We then use Hamming distance to find the closest match, and this happens to be
Hamming distance, there are four mismatches between the first two vectors — resulting in a Hamming distance of four. The next two contain just one mismatch, giving a Hamming distance of one.
We have taken a linear complexity function, which required us to compute the distance between our query vector and all of the previously indexed vectors — to a sub-linear complexity — as we no longer need to compute the distance for every vector — because they’re grouped into buckets.
Now, at the same time, vectors 1 and 2 are both equal to
0110. So we cannot possibly find which of those is closest to our query vector. This means there is a degree of search quality being lost — however, this is simply the cost of performing an approximate search. We trade quality for speed.
Balancing Quality vs. Speed
As is often the case in similarity search, a good LSH index requires balancing search quality vs. search speed.
We saw in our mini-example that our vectors were not easy to differentiate, because out of a total of three vectors — random projection had hashed two of them to the same binary vector.
Now imagine we scale this same ratio to a dataset containing one million vectors. We introduce our query vector
xq — hash it and calculate the hamming distance between that (
0111) and our two buckets (
Wow, a million sample dataset searched with just two distance calculations? That’s fast.
Unfortunately, we return around 700K samples, all with the same binary vector value of
0110. Fast yes, accurate — not at all.
Now, in reality, using an
nbits value of
4 would produce 16 buckets:
We stuck with the two buckets of
0110 solely for dramatic effect. But even with 16 buckets — 1M vectors split into just 16 buckets still produces massively imprecise buckets.
In reality, we use many more hyperplanes — more hyperplanes mean higher resolution binary vectors — producing much more precise representations of our vectors.
We control this resolution through hyperplanes with the nbits value. A higher
nbits value improves search quality by increasing the resolution of hashed vectors.
Increasing the nbits parameter increases the number of hyperplanes used to build the binary vector representations.
Adding more possible combinations of hash values increases the potential number of buckets , increasing the number of comparisons and, therefore, search time.
It’s worth noting that not all buckets will necessarily be used — particularly with higher
nbits values. We will see through our Faiss implementation that a
nbits value of
128 or more is completely valid and still faster than using a flat index.
There is also this notebook that covers a simple implementation of LSH in Python.
LSH in Faiss
We have discussed Faiss before, but let’s briefly recap. Faiss — or Facebook AI Similarity Search — is an open-source framework built for enabling similarity search.
Faiss has many super-efficient implementations of different indexes that we can use in similarity search. That long list of indexes includes
IndexLSH — an easy-to-use implementation of everything we have covered so far.
We initialize our LSH index and add our Sift1M dataset
wb like so:
Once our index is ready, we can begin searching using
index.search(xq, k) — where
xq is our one-or-more query vectors, and
k is the number of nearest matches we’d like to return.
search method returns two arrays. Index positions
I (e.g., row numbers from
wb) of our
k best matches. And the distances
D between those best matches and our query vector
Because we have those index positions in
I, we can retrieve the original vectors from our array
And from those original vectors, we see if our LSH index is returning relevant results. We do this by measuring the cosine similarity between our query vector
xq0 and the top
There are vectors in this index that should return similarity scores of around 0.8. We’re returning vectors with similarity scores of just 0.2 — why do we see such poor performance?
Diagnosing Performance Issues
We know that the
nbits value controls the number of potential buckets in our index. When initializing this index, we set
nbits == 4 — so all of our vectors must be stored across 4-digit, low-resolution buckets.
If we try to cram 1M vectors into just 16 hash buckets, each of those buckets very likely contain 10–100K+ vectors.
So, when we hash our search query, it matches one of these 16 buckets perfectly — but the index cannot differentiate between the huge number of vectors crammed into that single bucket — they all have the same hashed vector!
We can confirm this by checking our distances
We return a perfect distance score of zero for every item — but why? Well, we know that the Hamming distance can only be zero for perfect matches — meaning all of these hashed vectors must be identical.
If all of these vectors return a perfect match, they must all have the same hash value. Therefore, our index cannot differentiate between them — they all share the same position as far as our LSH index is concerned.
Now, if we were to increase
k until we return a non-zero distance value, we should be able to infer the number of vectors that have been bucketed with this same hash code. Let’s go ahead and try it.
A single bucket containing
172_039 vectors. That means that we are choosing our top
k values at random from those 172K vectors. Clearly, we need to reduce our bucket size.
With 1M samples, which
nbits value gives us enough buckets for a more sparse distribution of vectors? It’s not possible to calculate the exact distribution, but we can take an average:
nbits value of 16 we’re still getting an average of
15.25 vectors within each bucket — which seems better than it is. We must consider that some buckets will be significantly larger than others, as different regions will contain more vectors than others.
nbits values of
32 may be our tipping point towards genuinely effective bucket sizes. Let’s find the
mean cosine similarity for each of these values.
It looks like our estimate is correct — the overall similarity for our top 100 vectors experiences a sudden with each nbits increment before leveling off at the
nbits == 24 point. But what if we run the process with even larger
As we increase vector resolution with nbits, our results will become more precise — here, we can see that a larger nbits value results in higher cosine similarity in our results.
Here, the results are apparent. A fast increase in similarity as we declutter the LSH buckets — followed by a slower increase in similarity.
The latter, slower increase in similarity is thanks to the increasing resolution of our hashed vectors. Our buckets are already very sparse — we have many more potential buckets than we have vectors , so we can find little to no performance increase there.
But — we are increasing the resolution and, therefore, the precision of those buckets, so we pull the additional performance from here.
Extracting The Binary Vectors
While investigating our index and the distribution of vectors across our buckets above, we inferred the problem of our bucket sizes. This is useful as we’re using what we have already learned about LSH and can see the effects of the indexes' properties.
However, we can take a more direct approach. Faiss allows us to indirectly view our buckets by extracting the binary vectors representations of
Let’s revert to our
nbits value of 4 and see what is stored in our LSH index.
From this, we can visualize the distribution of vectors across those 16 buckets — showing the most used buckets and a few that are empty.
Distribution of vectors in different buckets when nbits == 4.
This and our previous logic are all we need to diagnose the aforementioned bucketing issues.
Where to Use LSH
While LSH can be a swift index, it is less accurate than a Flat index. Using increasingly larger portions of our Sift1M dataset, the best recall score was achieved using an
nbits value of
768 (better recall is possible at excessive search times).
Recall against the number of indexed vectors. The recall is measured as the % of matches with exhaustive search results (using IndexFlatL2).
Although it’s worth noting that using an
nbits value of
768 only returns marginally faster results than if using a flat index.
Search time as a factor of search time for IndexFlatL2 at different index sizes and using various nbits values.
More realistic recall rates — while maintaining a reasonable speed increase — are closer to 40%.
However, varying dataset sizes and dimensionality can make a huge difference. An increase in dimensionality means a higher
nbits value must be used to maintain accuracy, but this can still enable faster search speeds. It is simply a case of finding the right balance for each use-case and dataset.
Now, of course, there are many options out there for vector similarity search. Flat indexes and LSH are just two of many options — and choosing the right index is a mix of experimentation and know-how.
As always, similarity search is a balancing act between different indexes and parameters to find the best solutions for our use-cases.
We’ve covered a lot on LSH in this article. Hopefully, this has helped clear up any confusion on one of the biggest algorithms in the world of search.
LSH is a complex topic, with many different approaches — and even more implementations available across several libraries.