Pretrained models dominate the world of machine learning. Very few ML projects begin by training a new model from scratch. Instead, people often start by taking an off-the-shelf model like Resnet or BERT and fine-tuning it for another domain, or using an existing in-house model for the same purpose.
The ecosystem of pretrained models, both external and in-house, has allowed us to push the limits of what is possible. This doesn’t mean, however, that there are no challenges.
Fortunately, we can tackle some of these problems across many different pretrained models, because they often share similar points of failure. One of those is the excessive compute and data needed to fine-tune a pretrained model for classification.
A common scenario will have some model containing a linear layer that outputs a classification. Preceding this linear layer, we can have anything from a small neural network to a billion-parameter language model. In either case, it’s the classification layer producing the final prediction.
That means we can almost ignore the preceeding model layers, and focus on the classification layer alone. This classification layer can become a single point of failure (or success) for accurate predictions.
The classification layer alone can be fine-tuned, and it often is. A common approach for fine-tuning this layer may look like this:
- Collect a dataset that focuses on enabling the model to adapt to a new domain or handle data drift,
- Slog through this dataset, labeling records as per their classification, and
- Once the records have all been labeled, fine-tune the classifier.
This approach works, but it isn’t efficient. There is a better way…
We need to focus fine-tuning efforts on essential samples that would have the greatest impact on the performance of the classifier. Otherwise, we waste time and compute by annotating and fine-tuning on samples that make little-to-no difference to model performance.
The question becomes: How do you determine which samples are essential? That’s where vector search comes in. You can use vector search to identify and focus on the essential records that really make a difference in model performance. This will save valuable time and compute by skipping all non-essential records when fine-tuning the model.
All code covering the content of this article can be found here.
Training with Vector Search
Vector search will play a key role in optimizing our training steps. First, let’s understand where vector search fits into all of this.
Many state-of-the-art (SOTA) models are available for use as pretrained models. That includes models like Google’s BERT and T5, and OpenAI’s CLIP. These models use millions, even billions, of parameters and perform many complex operations. Yet, when applied to classification, these models rely on simple linear or feedforward network layers to make the final prediction.
The reason for this is that these models are not trained to make class predictions; they’re trained to make vector embeddings.
Vectors created by these models are full of helpful information that belong to a learned structure in a high-dimensional vector space. That helpful information is abstracted beyond human comprehension, but the effect is that similar items are located close to one another in vector space, whereas dissimilar items are not.
The result is that each of these models creates a “map” of information. Using this map, they can consume data, like images and text, and output a meaningful vector representation of said data.
In these maps, we will find that sentences, images, or whatever form of data you’re working with belongs to a specific region based on the data’s characteristics.
Pretrained models are very good at producing accurate maps of information. Because of that, all we need to translate these into accurate class predictions is a simple layer that learns to identify the different regions in this map.
Linear Classifiers
A typical architecture for classification consists of a pretrained model followed by a linear layer. A binary linear classifier (that predicts one of two labels) works by taking the dot product between an input vector $X$ and its own internal weights $W$. Based on a threshold, the output of this operation will be categorized as one of two classes.
The dot product of two vectors returns a positive score if they share a similar direction, $0$ if they are orthogonal, and a negative score if they have opposite directions.
There is one key problem with dot product similarity, it considers both direction and magnitude. Magnitude is troublesome because vectors with greater magnitudes often overpower more similar, lower-magnitude vectors. To avoid this, we normalize the vectors being output by our pretrained models.
The result is that a linear classifier must learn to align its internal weights $W$ with the vectors $X$ labeled as $+1$ and push its internal weights away from vectors labeled as $-1$.
Fine-tuning the classifier like this works, but there are some unecessary limitations. First, imagine we return only irrelevant samples for a training batch. They will all be marked as $-1$. The classifier knows to move away from these values but it cannot know which direction to move towards. In high-dimensional spaces, this is problematic and will cause the classifier to move at random.
Second, many training samples may be more or less relevant. “A dog” is more relevant than “a truck” to the query “dogs in the snow”, yet, “a dog in the snow” is not equally relevant as “a dog”.
What we need is a gradient of relevance, a continuous range from -1 to +1. The first problem is solved as the range of scores gives the classifier information on the best direction of movement. And the second problem is solved as we can now be more precise with our relevance scores.
All of this allows a linear classifier to learn where to place itself within the vector space produced by the model layers preceding it.
That describes the fine-tuning process, but we cannot do this across our entire dataset. It would take too much time annotating everything. To do this efficiently, we must capitalize on the idea of identifying relevant vs. irrelevant vectors within proximity of the model’s learned weights.
By identifying the vectors with the highest proximity to the classifier’s learned boundaries, we are able to skip irrelevant samples that make little-to-no impact on the classifier performance. Instead, we hone-in on the critical area of vectors near the target vector space.
Training Efficiently with Vector Search
During training, we need to feed vectors generated by the preceding layers into our linear classifier. Those vectors also need to be labeled. But, if our classifier is already tuned to understand the vector space generated by the previous layers, most training data is unlikely to be helpful.
We need to focus our fine-tuning efforts on records that are similar enough to our target class to confuse our model. For an already trained classifier, these are the false positives and false negatives predicted by the classifier.
However, we don’t usually have a list of false positives and false negatives. But we do know that the solvable errors will be present near the classifiers decision boundary; the line that separates the positive predictions from negative predicitons.
Due to the proximity of these samples, it is harder for the classifier to find the exact boundary that best identifies true positives vs. true negatives.
Vector search allows us to retrieve the high proximity samples most similar to the model weights $W$. We can then label the returned samples and use them for training our model. The model optimizes its internal weights; we extract them again, search, and repeat.
We focus annotation and training on essential samples by retrieving the most similar vectors. Doing this avoids wasting time and compute on samples that make little to no difference to our model performance.
Putting it All Together
Now let’s combine all this to fine-tune a linear classifier with vector search.
There are two parts to our training process:
- Indexing our data: Here we must embed everything as vectors using the “preceding” model layers (BERT, ResNet, CLIP, etc.).
- Fine-tuning the classifier: We will query using model weights $W$, return the most similar (or high scoring) records, annotate, and fine-tune the model.
If you already have an indexed dataset, you can skip ahead to the Fine-tuning section. If not, we’ll work through the indexing steps next.
Indexing
Given a dataset of images (or other formats), we first need to process everything through the preceding model layers to generate a list of vectors to be indexed. These vectors will later be used as the training data for the model.
The terms vectors, embeddings, and vector embeddings will be used interchangeably. When specifying embeddings produced by a specific medium (such as images or text), we will refer to them as “image embeddings” or “text embeddings”.
For our example, we will use a model capable of comparing both text and images called CLIP. OpenAI’s CLIP has been trained to match similar natural language prompts to images. It does this by encoding pairs as closely as possible in a vector space.
Initialization of Dataset and CLIP
We need an image dataset and CLIP (swap these for your dataset and model where relevant). We will use the frgfm/imagenette
dataset found on Hugging Face datasets.
!pip install datasets
from datasets import load_dataset
imagenet = load_dataset(
'frgfm/imagenette',
'full_size',
split='train',
ignore_verifications=False # set to True if seeing splits Error
)
imagenet
Dataset({
features: ['image', 'label'],
num_rows: 9469
})
In the “image” feature of the dataset, we have ~9.4K images of various sizes stored as PIL objects. Inside a Jupyter notebook, we can view them like so:
imagenet[0]['image']
We embed these images using CLIP, which we initialize through the HuggingFace Transformers library.
# !pip install transformers torch
from transformers import CLIPProcessor, CLIPModel
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "openai/clip-vit-base-patch32"
model = CLIPModel.from_pretrained(model_id).to(device)
processor = CLIPProcessor.from_pretrained(model_id)
We can embed an image and transform it into a flat Python list (ready for indexing) like so:
image = processor(
text=None,
images=imagenet[0]['image'],
return_tensors='pt',
padding=True
)['pixel_values'].to(device)
out = model.get_image_features(pixel_values=image)
out.shape
torch.Size([1, 512])
out = out.squeeze(0)
out.shape
torch.Size([512])
emb = out.cpu().detach().numpy()
emb.shape
(512,)
Normalization is Important
The later linear classifier uses dot product to calculate predictions. That means we must also use dot product to measure the similarity between image embeddings during the vector search. Given two similar images of dogs and an image of a radio, we would expect the two dog images to return a higher score.
We would expect two nearby embeddings like a and b to return a higher similarity score than with c. Yet, when we calculate the dot product between these embeddings, the magnitude of c outputs a higher output.
import numpy as np
a = np.array([0.3, 1])
b = np.array([0.4, 1.3])
c = np.array([10, 1])
a @ b # these are the most similar
1.42
a @ c # c is the radio...
4.0
b @ c # the magnitude of c gives it a high dot product
5.3
Dot product is heavily influenced by vector magnitude. This means two very similar vectors with low magnitude can score lower than if they were compared to a dissimilar vector with greater magnitude.
We solve this problem by normalizing all of our vectors beforehand. By doing this, we “flatten” the magnitude across vectors, leaving just the angular difference between them.
Normalization “flattens” the magnitude of our vectors.
a = a / np.linalg.norm(a)
b = b / np.linalg.norm(b)
c = c / np.linalg.norm(c)
a, b, c
(array([0.28734789, 0.95782629]),
array([0.29408585, 0.95577901]),
array([0.99503719, 0.09950372]))
a @ b # these now have a high dot product
0.9999752042549461
a @ c # c is not so similar anymore
0.38122911022229045
b @ c # also here
0.38772992263784645
After normalization of our embedding with emb = emb / np.linalg.norm(emb)
, we can move on to indexing it in our vector database.
Vector Database and Indexing
Here we will use the Pinecone vector database. All we need is a free API key and environment
variable that can be found here. To install the Pinecone Python client, we use pip install pinecone-client
. Finally, we import and initialize the connection.
import pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENV")
# (default env is 'us-east1-gcp')
After connecting to Pinecone, we create a new index where we will store our vectors.
index_name = "imagenet-query-trainer-clip"
pinecone.create_index(
index_name,
dimension=emb.shape[0],
metric="dotproduct",
metadata_config={"indexed": ["seen"]}
)
# connect to the index
index = pinecone.Index(index_name)
We specify four parameters for our index:
index_name
: The name of our vector index, it can be anything.dimensions
: The dimensionality of our vector embeddings. This must match the vector dimensionality output by CLIP. All future vectors must have the same dimensionality. Our vectors have768
dimensions.metric
: This is the similarity metric we will use. Pinecone accepts"euclidean"
,"cosine"
, and"dotproduct"
. As discussed, we will be using"dotproduct"
.metadata_config
: Pinecone has both indexed and non-indexed metadata. Indexed metadata can be used in metadata filtering, and we need this for " exploring “ the image dataset. So, we index a single field called"seen"
.
With this, we have indexed a single vector (emb
) in our Pinecone index. We can check this by running index.describe_index_stats()
which will return:
{'dimension': 512,
'index_fullness': 0.0,
'namespaces': {'': {'vector_count': 1}},
'totalVectorCount': 1.0}
Those are all the steps we need to embed and index an image. Let’s apply these steps to the remainder of the dataset.
Index Everything
There’s little we can do with a single vector, so we will repeat the previous steps on the rest of our dataset. We place the previous logic into a loop, iterate once over the dataset, and we’re done.
from tqdm.auto import tqdm
batch_size = 64
for i in tqdm(range(0, len(imagenet), batch_size)):
# select the batch start and end
i_end = min(i + batch_size, len(imagenet))
# some images are grayscale (mode=='L') we only keep 'RGB' images
images = [img for img in imagenet[i:i_end]['image'] if img.mode == 'RGB']
# process images and extract pytorch tensor pixel values
image = processor(
text=None,
images=images,
return_tensors='pt',
padding=True
)['pixel_values'].to(device)
# feed tensors to model and extract image features
out = model.get_image_features(pixel_values=image)
out = out.squeeze(0)
# take the mean across each dimension to create a single vector embedding
embeds = out.cpu().detach().numpy()
# normalize and convert to list
embeds = embeds / np.linalg.norm(embeds, axis=0)
embeds = embeds.tolist()
# create ID values
ids = [str(i) for i in range(i, i_end)]
# prep metadata
meta = [{'seen': 0} for image in images]
# zip all data together and upsert
to_upsert = zip(ids, embeds, meta)
index.upsert(to_upsert)
There’s a lot of code here, but it’s nothing more than a compact version of the previous steps. We can check the number of records added using the describe_index_stats
method.
{'dimension': 512,
'index_fullness': 0.0,
'namespaces': {'': {'vector_count': 9296}},
'totalVectorCount': 9296.0}
We have slightly fewer records here because we drop grayscale images in the upsert loop (line 8).
Fine-Tuning
With everything indexed, we’re ready to take our classifier model and optimize it on the most relevant samples in our dataset. You can follow along live using this Colab notebook.
You may or may not have a classifier already trained. If you do have a classifier, you can skip ahead a few paragraphs to the Classifier section.
If you do not have a classifier, we can begin by setting the model weights $W$ equal to the vector produced by a relevant query. This is where the text-to-image capabilities of CLIP come into use. Given a natural language prompt like “dogs in the snow”, we can use CLIP to embed this into the same vector space as our image embeddings.
from transformers import CLIPTokenizerFast, CLIPModel
model_id = "openai/clip-vit-base-patch32"
tokenizer = CLIPTokenizerFast.from_pretrained(model_id)
model = CLIPModel.from_pretrained(model_id)
prompt = "dogs in the snow"
inputs = tokenizer(prompt, return_tensors="pt")
out = model.get_text_features(**inputs)
out.shape
torch.Size([1, 512])
out = out.squeeze(0)
out.shape
torch.Size([512])
xq = out.cpu().detach().numpy().tolist()
len(xq)
512
We will set our initial model weights equal to xq
, but first, let’s retrieve the first batch of training samples.
As with the image embeddings, we need to transform the CLIP output into a flat list for querying with Pinecone and retrieving the image idx
and vector values
:
xc = index.query(xq, top_k=10, include_values=True)
# get the index values
idx = [int(match['id']) for match in xc['matches']]
# get the vectors
values = [match['values'] for match in xc['matches']]
The “dogs in the snow” query is mostly accurate, with the two exceptions showing dogs on non-snow yet white backgrounds.
These images and their embeddings act as the training data for our classifier. The embeddings themselves will become the inputs X
. We allow the user to create the labels y
by entering a score from -1
to +1
. All of this will be performed by a function called score_images
, the code for this can be found here.
scores = score_images(idx)