# Bag of Visual Words

In computer vision, bag of visual words (BoVW) is one of the pre-deep learning methods used for building image embeddings. We can use BoVW for content-based image retrieval, object detection, and image classification.

At a high level, comparing images with the bag of visual words approach consists of five steps:

1. Extract visual features,
2. Create visual words,
3. Build sparse frequency vectors with these visual words,
4. Adjust frequency vectors for relevant with tf-idf,
5. Compare vectors with similarity or distance metrics.

We will start by walking through the theory of how all of this works. In the second half of this article we will look at how to implement all of this in Python.

## How Bag of Visual Words Works

### Visual Features

The model derives from bag of words in natural language processing (NLP), where a chunk of text is split into words or sub-words and those components are collated into an unordered list, the so-called “bag of words” (BoW).

Similarly, in bag of visual words the images are represented by patches, and their unique patterns (or visual features) are extracted from the image.

However, despite the similarity, these visual features are not visual words just yet; we must perform a few more steps. For now, let’s focus on understanding what these visual features are.

Visual features consist of two items:

• Keypoints are points in an image, which do not change if the image is rotated, expanded, or scaled and

• Descriptors are vector representations of an image patch found at a given keypoint.

These visual features can be detected and extracted using a feature detector, such as SIFT (Scale Invariant Feature Transform), ORB (Oriented FAST and Rotated BRIEF), or SURF (Speeded Up Robust Features).

The most common is SIFT as it is invariant to scale, rotation, translation, illumination, and blur. SIFT converts each image patch into a $128$-dimensional vector (i.e., the descriptor of the visual feature).

A single image will be represented by many SIFT vectors. The order of these vectors is not important, only their presence within an image.

### Codebooks and Visual Words

After extracting visual features we build a codebook, also called a dictionary or vocabulary. This codebook acts as a repository of all existing visual words (similar to an actual dictionary, like the Oxford English Dictionary).

We use this codebook as a way to translate a potentially infinite variation of visual features into a predefined set of visual words.

How? The idea is to group similar visual features into clusters. Each cluster is assigned a central point which represents the visual word translation (or mapping) for that group of visual features. The standard approach for grouping visual features into visual words is k-means clustering.

K-means divides the data into $k$ clusters, where $k$ is chosen by us. Once the data is grouped, k-means calculates the mean for each cluster, i.e., a central point between all of the vectors in a group. That central point is a centroid (i.e., a visual word).

After finding the centroids, k-means iterates through each data point (visual feature) and checks which centroid (visual word) is nearest. If the nearest centroid has changed, the data point switches grouping, being assigned to the new nearest centroid.

This process is repeated over a given number of iterations or until the centroid positions have stabilized.

With that in mind, how do we choose the number of centroids, $k$?

It is more of an art than a science, but there are a few things to consider. Primarily, how many visual words can cover the various relevant visual features in the dataset.

That’s not an easy thing to figure out, and it’s always going to require some guesswork. However, we can think of it using the language equivalent, bag of words.

If our language dataset covered several documents about a specific topic in a single language, we would find fewer unique words than if we had thousands of documents, spanning several languages about a range of topics.

The same is true for images; dogs and/or animals could be a topic, and buildings could be another topic. As for the equivalent of different languages, this is not a perfect metaphor but we could think of different photography styles, drawings, or cartoons. All of these added layers of complexity increase the number of visual words needed to accurately represent the dataset.

Here, we could start with choosing a smaller $k$ value (e.g., $100$ or $150$) and re-run the code multiple times changing $k$ until convergence and/or our model seems to be identifying images well.

If we choose $k=150$, k-means will generate $150$ centroids and, therefore, $150$ visual words.

When we perform the mapping from new visual feature vectors to the nearest centroid (i.e., visual word), we categorize visual features into a more limited set of visual words. This process of reducing the number of possible unique vectors is called vector quantization.

Using a limited set of visual words allows us to compress our image descriptions into a set of visual word IDs. And, more importantly, it helps us represent similar features across images using a shared set of visual words.

That means that the visual words shared by two images of churches may be quite large, meaning they’re similar. However, an image of a church and an image of a dog will share far fewer visual words, meaning they’re dissimilar.

After those steps, our images will be represented by a varying number of visual words. From here we move on to the next step of transforming these visual words into image-level frequency vectors.

### Frequency Vectors

We can count the frequency of these visual words and visualize them with histograms.

The x-axis of the histogram is the codebook, while the y-axis is the frequency of each visual word (in the codebook) for that image.

If we consider $2$ images, we can represent the image histograms as follows:

To create these representations, we have converted each image into a sparse vector where each value in the vector represents an item in the codebook (i.e., the x-axis in the histograms). Most of the values in each vector will be zero because most images will only contain a small percentage of total number of visual words, which is why we refer to them as sparse vectors.

As for the non-zero values in our sparse vector, they are calculated in the same way that we calculated our histogram bar heights. They are equal to the frequency of a particular visual word in an image.

This works, but it’s a crude way to create these sparse vector representations. Because many visual words are actually not that important, we add one more step.

### Tf-idf

In language there are some words that are more important than others in that they give us more information. If we used the sentence “the history of Rome” to search through a set of articles, the words “the” and “of” should not be given the same importance as “history” or “Rome”.

These less important words are often very common. If we only consider the frequency of words shared with our “the history of Rome” query, the article with the most “the"s could be scored highest.

This problem is also found in images. A visual word extracted from a patch of sky in an image is unlikely to tell us whether this image is of a church or a dog. Some visual words are more relevant than others.

In the example above, we would expect a visual word representing the sky 1 to be less relevant than a visual word representing the cross on top of the bell tower 2.

That is why it is important to adjust the values of our sparse vector to give more weight to more relevant visual words and less weight to less relevant visual words.

To do that, we can use the tf-idf (term-frequency inverse document frequency) formula, which is calculated as follows:

$$tf\textrm{–}idf_{t,d} = tf_{t,d} * idf_t = tf_{t,d} * log\frac{N}{df_t}$$

Where:

• $tf_{t,d}$ is the term frequency of the visual word $t$ in the image $d$ (the number of times $t$ occurs in $d$),
• $N$ is the total number of images,
• $df_t$ number of images containing visual word $t$,
• $log\frac{N}{df_t}$ measures how common the visual word $t$ is across all images in the database. This is low if the visual word $d$ occurs many times in the image, high otherwise.

After tf-idf, we can visualize the vectors via our histogram again, which will better reflect the image’s features.

Before we were giving the same importance to image’s patches in an image; now they’re adjusted based on relevance and then normalized.

We’ve now trained our codebook and learned how to process each vector for better relevance and normalization. When wanting to embed new images with this pipeline, we repeat the process but avoid retraining the codebook. Meaning we:

1. Extract the visual features,
2. Transform them into visual words using the existing codebook,
3. Use these visual words to create a sparse frequency vector,
4. Adjust the frequency vector based on relevance with tf-idf, giving us our final sparse vector representations.

After that, we’re ready to compare these sparse vectors to find similar or dissimilar images.

### Measuring Similarity

There are several metrics we can use to calculate similarity or distance between two vectors. The most common are:

1. Cosine similarity,

2. Euclidean distance, and

3. Dot product similarity.

We will use cosine similarity which measures the angle between vectors. Vectors pointing in a similar direction have a lower angular separation and therefore higher cosine similarity.

Cosine similarity is calculated as:

$$cossim(A,B)= cos(\theta)=\frac{A \cdot B}{||A|| \space ||B||}$$

Cosine similarity generally gives a value ranging $[-1,1]$. However, if we think about the frequency of visual words, we cannot consider them as negative. Therefore, the angle between two term frequency vectors cannot be greater than $90°$, and cosine similarity ranges between $[0,1]$.

It equals $1$ if the vectors are pointing in the same direction (the angle equals $0$) and $0$ if vectors are perpendicular.

If we consider three different images and we build a matrix based on cosine similarity:

We can see that cosine similarity is $1$ when the image is exactly the same (i.e., in the main diagonal). The cosine similarity approaches $0$ as the images have less in common.

Let’s now move on to implementing bag of visual words with Python.

## Implementing Bag of Visual Words

The next section will work through the implementation of everything we’ve just learned in Python. If you’d like to follow along, use this Colab notebook.

### Imagenette Dataset Preprocessing

First, we want to import a dataset of images to train the model.

Feel free to use any images you like. However, if you’d like to follow along with the same dataset, we will use the frgfm/imagenette dataset from HuggingFace Datasets.

The dataset contains 9469 images, covering a range of images with dogs, radios, fishing, cities, etc. The image feature contains the images themselves stored as PIL object, meaning we can view them in a notebook like so:

To process these images we need to transform them from PIL objects to numpy arrays.

The dataset mostly consists of color images containing three color channels (red, green, and blue), but some are also grayscale containing just a single channel (brightness). To optimize processing time and keep everything as simple as possible, we will transform color images to grayscale.

The arrays in bw_images are what we will be using to create our visual features, visual words, frequency vectors, and tf-idf vectors.

### Visual Features

With our dataset prepared we’re ready to move on to extracting visual features (both keypoints and descriptors). As mentioned earlier, we will use the SIFT feature detection algorithm.

It’s worth noting that if an image doesn’t have any noticeable features (e.g., it is a flat image without any edges, gradients, etc.), extraction with SIFT can return None. We don’t have that problem with this dataset, but it’s something to watch out for with others.

Now that we have extracted the visual features, we can visualize them with matplotlib.

The centre of each circle is the keypoint location, and the lines from the centre of each circle represent keypoint orientation. The size of each circle is the scale at which the features were detected.

With our visual features ready, we can move onto the next step of creating visual words.

### Visual Words and the Codebook

Earlier we described the “codebook”. The codebook acts as a vocabulary where we store all of our visual words. To create the codebook we use k-means clustering to quantize our visual features into a smaller set of visual words.

Our full set of visual features is big, and training k-means with the full set will take some time. So, to avoid that and also emulate a real-world scenario where we are unlikely to train on all images that we’ll ever process, we will use a smaller sample of 1000 images.

Our descriptors_sample contains a single array for each image, and each array can contain a varying number of SIFT feature vectors. When training k-means, we only care about the feature vectors, we don’t care about which image they’re coming from. So, we need to flatten descriptors_sample into a single array containing all descriptors.

From this, we get all_descriptors, a single array containing all feature vectors from our sample. There are ~1.3M of these.

We now want to group similar visual features (descriptors) using k-means. After a few tests, we chose $k=200$ for our model.

After k-means, all images will have been reduced to visual words, and the full set of these visual words become our codebook.

Once built, the codebook does not change. No matter how many more visual features we process, no more visual words are added as we will use it solely as a mapping between new visual features and the existing visual words.

It can be difficult to find the optimal size of our codebook - if too small, visual words could be unrepresentative of all image regions, and if too large, there could be too many visual words with little to no of them being shared between images (making comparisons very hard or impossible).

With our codebook complete, we can use it to transform the full dataset of visual features into visual words.

We can see here that image 0 contains 397 visual words; the first five of those are represented by [84, 22, 45, 172, 172], which are the index values of the visual word vector found in the codebook. This visual word vector shares the same dimensionality as our SIFT feature vectors because it represents a cluster centroid from those feature vectors.

### Sparse Frequency Vectors

After building our codebook and creating our image representations with visual words, we can move on to building sparse vector representations from these visual words.

We do this to compress the many visual word vectors representing our images into a single vector of set dimensionality. By doing this, we are able to directly compare our image representations using metrics like cosine similarity and Euclidean distance.

To create these frequency vectors, we look at how many times each visual word is found in an image. There are only 200 unique visual words (the length of our codebook), so each of these frequency vectors will have dimensionality 200, where each value becomes a count for a specific visual word.

After creating the frequency vectors, we’re left with a single vector representation for each image. We can see an example of the frequency vector for image 0 below:

### Tf-idf

Our frequency vector can already be used for comparing images using our similarity and distance metrics. However, it is not ideal as it does not consider the different levels of relevance of each visual word. So, we must use tf-idf to adjust the frequency vector to consider relevance. $$tf\textrm{-}idf_{t,d} = tf_{t,d} * idf_t = tf_{t,d} * log\frac{N}{df_t}$$ We first calculate $N$ and $df_t$, both of which are shared across the entire dataset as the image $d$ is not considered by either parameter. Naturally, $idf_t$ also produces a single vector shared by the full dataset.

With $idf_t$ calculated, we just need to multiply it by each $tf_{t,d}$ vector to get our $tf \textrm{-} idf_{t,d}$ vectors. Fortunately, we already have the $tf_{t,d}$ vectors, as they are our frequency vectors.

We now have 9469 200-dimensional sparse vector representations of our images.

These sparse vectors have been built in such a way that images that share many similar visual features should share similar sparse vectors. We can use cosine similarity to compare these images and identify similar images.

We will start by searching with image 1200:

![image-20220801174017452](/Users/jamesbriggs/Library/Application Support/typora-user-images/image-20220801174017452.png)

The top image is of course the same image; as they are exact matches we can see the expected cosine similarity score of 1.0. Following this, we have two highly similar results. Interestingly, the fourth image seems to have been pulled through due to similarity in background foliage.

Here are a few more sets of results, showing the varied performance of the approach with different items.

Here we get a good first result, followed by irrelevant images and a final good result in fifth position, a 50% success rate.

Bag of visual words seems to work well with golf balls, identifying 3/4 of relevant images.

If you’re interested in seeing more results, check out the Colab notebook here.

That’s it for this article on bag of visual words, one of the most successful methods for image classification and retrieval without the use of neural networks or other deep learning methods.

One great aspect of this approach is that it is fairly reliable and interpretable. There is no black box of AI here, so when applied to a lot of data we will rarely get too many surprising results.

We know that images with similar edges, textures, and colors are likely to be identified as similar; the features being identified are set by the SIFT (or other) algorithms.

All of this makes bag of visual words a good option for image retrieval or classification, where we need to focus on the features that we know algorithms like SIFT can deal with, i.e. we’re focused on finding similar object edges (that are resistant to scale, noise, and illumination changes). Finally, these well-defined algorithms give us a huge advantage when interpretability is important.

## Resources

Code Notebook, GitHub Examples Repo