# Random Projection for Locality Sensitive Hashing

Locality sensitive hashing (LSH) is a widely popular technique used in *approximate* similarity search. The solution to efficient similarity search is a profitable one — it is at the core of several billion (and even trillion) dollar companies.

The problem with similarity search is *scale*. Many companies deal with millions-to-billions of data points every single day. Given a billion data points, is it feasible to compare all of them with every search?

Further, many companies are not performing single searches — Google deals with more than 3.8 million searches every *minute*^{[1]}.

Billions of data points combined with high-frequency searches are problematic — and we haven’t considered the dimensionality nor the similarity function itself. Clearly, an exhaustive search across all data points is unrealistic for larger datasets.

The solution to searching impossibly huge datasets? *Approximate search.* Rather than *exhaustively* comparing every pair, we *approximate* — restricting the search scope only to high probability matches.

## Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) is one of the most popular approximate nearest neighbors search (ANNS) methods.

At its core, it is a hashing function that allows us to group similar items into the same hash buckets. So, given an impossibly huge dataset — we run all of our items through the hashing function, sorting items into buckets.

Unlike most hashing functions, which aim to *minimize* hashing collisions — LSH algorithms aim to *maximize* hashing collisions.

Two hashing functions, the top (blue) **min**imizes hashing collisions. The bottom (magenta) maximizes hashing collisions — LSH aims to **max**imize collisions between similar items.

This result for LSH is that similar vectors produce the same hash value and are bucketed together. In contrast, dissimilar vectors will *hopefully* not produce the same hash value — being placed in different buckets.

### Search With LSH

Performing search with LSH consists of three steps:

Index all of our vectors into their hashed vectors.

Introduce our query vector (search term). It is hashed using the same LSH function.

Compare our hashed query vector to all other hash buckets via Hamming distance — identifying the nearest.

At a very high level, that is the process of the LSH methodology we will be covering. However, we will explain all of this in much more detail further in the article.

### Effects of Approximation

Before diving into the detail of LSH, we should note that grouping vectors into lower resolution *hashed* vectors also means that our search is not exhaustive (e.g., comparing *every* vector), so we expect a lower search quality.

*We compress potentially huge dense vectors into highly compressed, grouped binary vectors.*

But, the reason for accepting this lower search quality is the potential for *significantly* faster search speeds.

### Which Method?

We’ve been purposely lax on the details here because there are several versions of LSH — each utilizing different hash building and distance/similarity metrics. The two most popular approaches are:

Shingling, MinHashing, and banded LSH (traditional approach)

Random hyperplanes with dot-product and Hamming distance

This article will focus on the random hyperplanes method , which is more commonly used and implemented in various popular libraries such as Faiss.

## Random Hyperplanes

The random hyperplanes (also called random *projection*) approach is deceptively simple — although it can be hard to find details on the method.

Let’s learn through an example — we will be using the Sift1M dataset throughout our examples, which you can download using this script.

Now, given a single query vector `xq`

we want to identify the top `k`

nearest neighbors from the `xb`

array.

Here we are returning the **three** nearest neighbors to our query vector **xq**.

Using the random projection method, we will reduce our highly-dimensional vectors into low-dimensionality binary vectors. Once we have these binary vectors, we can measure the distance between them using Hamming distance.

Let’s work through that in a little more detail.

### Creating Hyperplanes

The hyperplanes in this method are used to split our datapoints and assign a value of *0* for those data points that appear on the negative side of our hyperplane — and a value of *1* for those that appear on the positive side.

We assign a value of **1** to vectors on the +ve side of our hyperplane and a value of **0** to vectors on the -ve side of the hyperplane.

To identify which side of the hyperplane our data point is located, all we need is the normal vector of the plane — e.g., a vector perpendicular to the plane. We feed this normal vector (alongside our datapoint vector) into a dot product function.

If two vectors share the same direction, the resultant dot product is positive. If they do *not* share the same direction, it is negative.

Where our hyperplane normal vector produces a +ve dot-product with another vector, we can view that vector as being in front of the hyperplane. The reverse is true for vectors that produce a -ve dot-product.

*In the unlikely case of both vectors being perfectly perpendicular (sitting on the hyperplane edge), the dot product is 0 — we will group this in with our negative direction vectors.*

A single binary value doesn’t tell us much about the similarity of our vectors, but when we begin adding *more* hyperplanes — the amount of encoded information rapidly increases.

We add more hyperplanes to increase the amount of positional information stored in our binary vectors.

By projecting our vectors into a lower-dimensional space using these hyperplanes, we produce our new *hashed* vectors.

In the image above, we have used two hyperplanes, and realistically we will need many more — a property that we define using the `nbits`

parameter. We will discuss `nbits`

in more detail later, but for now, we will use **four** hyperplanes by setting `nbits = 4`

.

Now, let’s create the normal vectors of our hyperplanes in Python.

```
nbits = 4 # number of hyperplanes and binary vals to produce
d = 2 # vector dimensions
```

```
import numpy as np
# create a set of 4 hyperplanes, with 2 dimensions
plane_norms = np.random.rand(nbits, d) - .5
plane_norms
```

```
array([[-0.26623211, 0.34055181],
```

```
[ 0.3388499 , -0.33368453],
```

```
[ 0.34768572, -0.37184437],
```

` [-0.11170635, -0.0242341 ]])`

Through `np.random.rand`

we create a set of random values in the range *0* → *1*. We then add `-.5`

to center our array values around the origin *(0, 0)*. Visualizing these vectors, we see:

The normal vectors that define the positions of our hyperplanes, which are all centered around the origin (0, 0).

### Hashing Vectors

Now let’s add three vectors — **a**, **b**, and **c** — and work through building our hash values using our four *normal vectors and their hyperplanes*.

```
a = np.asarray([1, 2])
b = np.asarray([2, 1])
c = np.asarray([3, 1])
```

```
# calculate the dot product for each of these
a_dot = np.dot(a, plane_norms.T)
b_dot = np.dot(b, plane_norms.T)
c_dot = np.dot(c, plane_norms.T)
a_dot
```

`array([ 0.41487151, -0.32851916, -0.39600301, -0.16017455])`

```
# we know that a positive dot product == +ve side of hyperplane
# and negative dot product == -ve side of hyperplane
a_dot = a_dot > 0
b_dot = b_dot > 0
c_dot = c_dot > 0
a_dot
```

`array([ True, False, False, False])`

```
# convert our boolean arrays to int arrays to make bucketing
# easier (although is okay to use boolean for Hamming distance)
a_dot = a_dot.astype(int)
b_dot = b_dot.astype(int)
c_dot = c_dot.astype(int)
a_dot
```

`array([1, 0, 0, 0])`

`b_dot`

`array([0, 1, 1, 0])`

`c_dot`

`array([0, 1, 1, 0])`

Visualizing this again, we have our three vectors **a**, **b**, and **c** — alongside our four hyperplanes (perpendicular to their respective normal vectors). Taking the +ve and -ve dot-product values for each gives us:

A **zero** shows that the vector is behind the plane (-ve dot product), and a **one** shows that the vector is in front of the plane (+ve dot product). We combine these to create our binary vectors.

Which produces our hashed vectors. Now, LSH uses these values to create buckets — which will contain some reference back to our vectors (Eg their IDs). Note that we do not store the original vectors in the buckets — which would significantly increase the size of our LSH index.

As we will see with implementations such as Faiss — the position/order that we added the vector is usually stored. We will use this same approach in our examples.

```
vectors = [a_dot, b_dot, c_dot]
buckets = {}
i = 0
for i in range(len(vectors)):
# convert from array to string
hash_str = ''.join(vectors[i].astype(str))
# create bucket if it doesn't exist
if hash_str not in buckets.keys():
buckets[hash_str] = []
# add vector position to bucket
buckets[hash_str].append(i)
print(buckets)
```

```
{'1000': [0], '0110': [1, 2]}
```

Now that we have bucketed our three vectors, let’s consider the change in the complexity of our search. Let’s say we introduce a query vector that gets hashed as `0111`

.

With this vector, we compare it to every bucket in our LSH index — which in this case is only two values — `1000`

and `0110`

. We then use Hamming distance to find the closest match, and this happens to be `0110`

.

Hamming distance, there are **four** mismatches between the first two vectors — resulting in a Hamming distance of four. The next two contain just **one** mismatch, giving a Hamming distance of one.

We have taken a linear complexity function, which required us to compute the distance between our query vector and all of the previously indexed vectors — to a sub-linear complexity — as we no longer need to compute the distance for *every* vector — because they’re grouped into buckets.

Now, at the same time, vectors **1** and **2** are both equal to `0110`

. So we cannot possibly find which of those is closest to our query vector. This means there is a degree of search quality being lost — however, this is simply the cost of performing an approximate search. *We trade quality for speed.*

## Balancing Quality vs. Speed

As is often the case in similarity search, a good LSH index requires balancing search quality vs. search speed.

We saw in our mini-example that our vectors were not easy to differentiate, because out of a total of three vectors — random projection had hashed two of them to the same binary vector.

Now imagine we scale this same ratio to a dataset containing one million vectors. We introduce our query vector `xq`

— hash it and calculate the hamming distance between that (`0111`

) and our *two* buckets (`1000`

and `0110`

).

Wow, a million sample dataset searched with just two distance calculations? That’s *fast*.

Unfortunately, we return around 700K samples, all with the same binary vector value of `0110`

. Fast yes, accurate — *not at all*.

Now, in reality, using an `nbits`

value of `4`

would produce **16** buckets:

```
nbits = 4
# this returns number of binary combos for our nbits val
1 << nbits
```

`16`

```
# we can print every possible bucket given our nbits value like so
for i in range(1 << nbits):
# get binary representation of integer
b = bin(i)[2:]
# pad zeros to start of binary representtion
b = '0' * (nbits - len(b)) + b
print(b, end=' | ')
```

`0000 | 0001 | 0010 | 0011 | 0100 | 0101 | 0110 | 0111 | 1000 | 1001 | 1010 | 1011 | 1100 | 1101 | 1110 | 1111 | `

We stuck with the two buckets of `1000`

and `0110`

solely for dramatic effect. But even with *16* buckets — 1M vectors split into just 16 buckets still produces massively imprecise buckets.

In reality, we use many more hyperplanes — more hyperplanes mean higher resolution binary vectors — producing much more precise representations of our vectors.

We control this resolution through hyperplanes with the nbits value. A higher `nbits`

value improves search quality by increasing the resolution of hashed vectors.

Increasing the **nbits** parameter increases the number of hyperplanes used to build the binary vector representations.

Adding more possible combinations of hash values increases the potential number of buckets , increasing the number of comparisons and, therefore, *search time*.

```
for nbits in [2, 4, 8, 16]:
print(f"nbits: {nbits}, buckets: {1 << nbits}")
```

```
nbits: 2, buckets: 4
```

```
nbits: 4, buckets: 16
```

```
nbits: 8, buckets: 256
```

```
nbits: 16, buckets: 65536
```

It’s worth noting that not *all* buckets will necessarily be used — particularly with higher `nbits`

values. We will see through our Faiss implementation that a `nbits`

value of `128`

or more is completely valid and still faster than using a flat index.

There is also this notebook that covers a simple implementation of LSH in Python.

## LSH in Faiss

We have discussed Faiss before, but let’s briefly recap. Faiss — or Facebook AI Similarity Search — is an open-source framework built for enabling similarity search.

Faiss has many super-efficient implementations of different indexes that we can use in similarity search. That long list of indexes includes `IndexLSH`

— an easy-to-use implementation of everything we have covered so far.

We initialize our LSH index and add our Sift1M dataset `wb`

like so:

```
import faiss
d = wb.shape[1]
nbits = 4
# initialize the index using our vectors dimensionality (128) and nbits
index = faiss.IndexLSH(d, nbits)
# then add the data
index.add(wb)
```

Once our index is ready, we can begin searching using `index.search(xq, k)`

— where `xq`

is our one-or-more query vectors, and `k`

is the number of nearest matches we’d like to return.

```
xq0 = xq[0].reshape(1, d)
# we use the search method to find the k nearest vectors
D, I = index.search(xq0, k=10)
# the indexes of these vectors are returned to I
I
```

`array([[26, 47, 43, 2, 6, 70, 73, 74, 25, 0]])`

The `search`

method returns two arrays. Index positions `I`

(e.g., row numbers from `wb`

) of our `k`

best matches. And the distances `D`

between those best matches and our query vector `xq0`

.

### Measuring Performance

Because we have those index positions in `I`

, we can retrieve the *original vectors* from our array `wb`

.

```
# we can retrieve the original vectors from wb using I
wb[I[0]]
```

```
array([[31., 15., 47., ..., 0., 1., 19.],
```

```
[ 0., 0., 0., ..., 0., 30., 72.],
```

```
[22., 15., 43., ..., 0., 0., 12.],
```

```
...,
```

```
[ 6., 15., 9., ..., 12., 17., 21.],
```

```
[33., 22., 8., ..., 0., 11., 69.],
```

` [ 0., 16., 35., ..., 25., 23., 1.]], dtype=float32)`

```
# and calculate the cosine similarity between these and xq0
cosine_similarity(wb[I[0]], xq0)
```

```
array([[0.7344476 ],
```

```
[0.6316513 ],
```

```
[0.6995599 ],
```

```
[0.20448919],
```

```
[0.3054034 ],
```

```
[0.25432232],
```

```
[0.30497947],
```

```
[0.341374 ],
```

```
[0.6914262 ],
```

` [0.26704744]], dtype=float32)`

And from those original vectors, we see if our LSH index is returning relevant results. We do this by measuring the *cosine similarity* between our query vector `xq0`

and the top `k`

matches.

There are vectors in this index that should return similarity scores of around 0.8. We’re returning vectors with similarity scores of just 0.2 — why do we see such poor performance?

### Diagnosing Performance Issues

We know that the `nbits`

value controls the number of *potential* buckets in our index. When initializing this index, we set `nbits == 4`

— so all of our vectors must be stored across 4-digit, low-resolution buckets.

If we try to cram 1M vectors into just 16 hash buckets, each of those buckets very likely contain 10–100K+ vectors.

So, when we hash our search query, it matches one of these 16 buckets perfectly — but the index cannot differentiate between the huge number of vectors crammed into that single bucket — they all have the same hashed vector!

We can confirm this by checking our distances `D`

:

`D`

`array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)`

We return a perfect distance score of *zero* for every item — but why? Well, we know that the Hamming distance can only be *zero* for *perfect* matches — meaning all of these hashed vectors must be identical.

If all of these vectors return a perfect match, they must all have the same hash value. Therefore, our index cannot differentiate between them — they all share the same position as far as our LSH index is concerned.

Now, if we were to increase `k`

until we return a non-zero distance value, we should be able to infer the number of vectors that have been bucketed with this same hash code. Let’s go ahead and try it.

```
k = 100
xq0 = xq[0].reshape(1, d)
while True:
D, I = index.search(xq0, k=k)
if D.any() != 0:
print(k)
break
k += 100
```

```
172100
```

`D # we will see both 0s, and 1s`

`array([[0., 0., 0., ..., 1., 1., 1.]], dtype=float32)`

`D[:, 172_039:172_041] # we see the hash code switch at position 172_039`

`array([[0., 1.]], dtype=float32)`

A single bucket containing `172_039`

vectors. That means that we are choosing our top `k`

values at random from those 172K vectors. Clearly, we need to reduce our bucket size.

With 1M samples, which `nbits`

value gives us enough buckets for a more sparse distribution of vectors? It’s not possible to calculate the exact distribution, but we can take an average:

```
for nbits in [2, 4, 8, 16, 24, 32]:
buckets = 1 << nbits
print(f"nbits == {nbits}")
print(f"{wb.shape[0]} / {buckets} = {wb.shape[0]/buckets}")
```

```
nbits == 2
```

```
1000000 / 4 = 250000.0
```

```
nbits == 4
```

```
1000000 / 16 = 62500.0
```

```
nbits == 8
```

```
1000000 / 256 = 3906.25
```

```
nbits == 16
```

```
1000000 / 65536 = 15.2587890625
```

```
nbits == 24
```

```
1000000 / 16777216 = 0.059604644775390625
```

```
nbits == 32
```

```
1000000 / 4294967296 = 0.00023283064365386963
```

With an `nbits`

value of 16 we’re still getting an average of `15.25`

vectors within each bucket — which seems better than it is. We must consider that some buckets will be significantly larger than others, as different regions will contain more vectors than others.

Realistically, the `nbits`

values of `24`

and `32`

may be our tipping point towards genuinely effective bucket sizes. Let’s find the `mean`

cosine similarity for each of these values.

```
xq0 = xq[0].reshape(1, d)
k = 100
for nbits in [2, 4, 8, 16, 24, 32]:
index = faiss.IndexLSH(d, nbits)
index.add(wb)
D, I = index.search(xq0, k=k)
cos = cosine_similarity(wb[I[0]], xq0)
print(np.mean(cos))
```

```
0.5424448
```

```
0.560827
```

```
0.6372647
```

```
0.6676912
```

```
0.7162514
```

```
0.7048228
```

It looks like our estimate is correct — the overall similarity for our top 100 vectors experiences a sudden with each nbits increment before leveling off at the `nbits == 24`

point. But what if we run the process with even larger `nbits`

values?

As we increase vector resolution with **nbits**, our results will become more precise — here, we can see that a larger **nbits** value results in higher cosine similarity in our results.

Here, the results are apparent. A fast increase in similarity as we declutter the LSH buckets — followed by a slower increase in similarity.

The latter, slower increase in similarity is thanks to the increasing resolution of our hashed vectors. Our buckets are already very sparse — we have many more *potential* buckets than we have vectors , so we can find little to no performance increase there.

*But* — we are increasing the resolution and, *therefore, the precision* of those buckets, so we pull the additional performance from here.

### Extracting The Binary Vectors

While investigating our index and the distribution of vectors across our buckets above, we inferred the problem of our bucket sizes. This is useful as we’re using what we have already learned about LSH and can see the effects of the indexes' properties.

However, we can take a more direct approach. Faiss allows us to *indirectly* view our buckets by extracting the binary vectors representations of `wb`

.

Let’s revert to our `nbits`

value of **4** and see what is stored in our LSH index.

```
# extract index binary codes (represented as int)
arr = faiss.vector_to_array(index.codes)
arr
```

`array([ 5, 12, 5, ..., 15, 13, 12], dtype=uint8)`

```
# we see that there are 1M of these values, 1 for each vector
arr.shape
```

`(1000000,)`

```
# now translate them into the binary vector format
(((arr[:, None] & (1 << np.arange(nbits)))) > 0).astype(int)
```

```
array([[1, 0, 1, 0],
```

```
[0, 0, 1, 1],
```

```
[1, 0, 1, 0],
```

```
...,
```

```
[1, 1, 1, 1],
```

```
[1, 0, 1, 1],
```

` [0, 0, 1, 1]])`

From this, we can visualize the distribution of vectors across those 16 buckets — showing the most used buckets and a few that are empty.

Distribution of vectors in different buckets when **nbits == 4**.

This and our previous logic are all we need to diagnose the aforementioned bucketing issues.

## Where to Use LSH

While LSH can be a swift index, it is less accurate than a Flat index. Using increasingly larger portions of our Sift1M dataset, the best recall score was achieved using an `nbits`

value of `768`

(better recall is possible at excessive search times).

Recall against the number of indexed vectors. The recall is measured as the % of matches with exhaustive search results (using **IndexFlatL2**).

Although it’s worth noting that using an `nbits`

value of `768`

only returns marginally faster results than if using a flat index.

Search time as a factor of search time for IndexFlatL2 at different index sizes and using various **nbits** values.

More realistic recall rates — while maintaining a reasonable speed increase — are closer to 40%.

However, varying dataset sizes and dimensionality can make a huge difference. An increase in dimensionality means a higher `nbits`

value must be used to maintain accuracy, but this can still enable faster search speeds. It is simply a case of finding the right balance for each use-case and dataset.

Now, of course, there are many options out there for vector similarity search. Flat indexes and LSH are just two of many options — and choosing the right index is a mix of experimentation and know-how.

As always, similarity search is a balancing act between different indexes and parameters to find the best solutions for our use-cases.

We’ve covered *a lot* on LSH in this article. Hopefully, this has helped clear up any confusion on one of the biggest algorithms in the world of search.

LSH is a complex topic, with many different approaches — and even more implementations available across several libraries.

Beyond LSH, we have many more algorithms suited for efficient similarity search, such as HNSW, IVF, and PQ. You can learn more in our overview of vector search indexes.

## References

- Jupyter Notebooks
- [1] Google Searches, Skai.io Blog

Next Chapter:

Product Quantization

## Comments