What are Vector Embeddings?

In machine learning, we often represent real-world objects and concepts as a set of continuous numbers, also known as a vector embedding.

This representation makes it possible to translate semantic similarity as perceived by humans to proximity in a vector space. In other words, when we represent real-world objects and concepts such as images, audio recordings, news articles, user profiles, weather patterns, and political views as vector embeddings, the semantic similarity of these objects and concepts can be quantified by how close they are to each other as points in vector spaces. Vector embedding representations are thus suitable for common machine learning tasks such as clustering, recommendation, and classification.

For example, in a clustering task, clustering algorithms assigns points that are similar to the same cluster while keeps points from different clusters as dissimilar as possible. In a recommendation task, when making recommendations for an unseen object, the recommender-system would look for objects that are most similar to the object in question, as measured by their similarity as vector embeddings. In a classification task, we classify the label of an unseen object by the major vote over labels of the most similar objects.

Creating Vector Embeddings

One way of creating vector embeddings is to engineer the vector values using domain knowledge. This is known as feature engineering. For example, in medical imaging, we use medical expertise to quantify a set of features such as shape, color, and regions in an image that capture the semantics. However, engineering vector embeddings requires domain knowledge, and it is too expensive to scale.

Instead of engineering vector embeddings, we often train models to translate objects to vectors. Deep neural network is a common tool for training such models. The resulting embeddings are usually high dimensional (up to two thousand dimensions) and dense (all values are non-zeros). For text data, models such as Word2Vec, GLoVE, and BERT transform words, sentences, or paragraphs into vector embeddings.

Images can be embedded using models such as convolutional neural networks (CNNs), Examples of CNNs include VGG, and Inception. Audio recordings can be transformed into vectors using image embedding transformations over the audio frequencies visual representation (e.g., using its Spectrogram).

Example: Image Embedding with a Convolutional Neural Network

Consider the following example, in which raw images are represented as greyscale pixels. This is equivalent to a matrix (or table) of integer values in the range 0 to 255. Wherein the value 0 corresponds to black color and 255 to white color. The image below depicts a greyscale image and its corresponding matrix.

grayscale image representation

The left sub-image depicts the grayscale pixels, the middle sub-image contains the pixel grayscale values, and the right most sub-image defines the matrix. Notice the matrix values define a vector embedding in which its first coordinate is the matrix upper left cell, then going left-to-right until the last coordinate which corresponds to the lower-right matrix cell.

Such embedding respects a semantic proximity. However, it is very sensitive to shifts, scales and other common image manipulation. Therefore such greyscale representations are usually treated as raw representations and are being used for learning robust embeddings.

Convolutional Neural Network (CNN or ConvNet) is a class of deep learning architectures that are usually applied to visual data transforming images into embeddings.

CNNs are processing the input via a hierarchical small local sub-inputs which are termed receptive-fields. Each neuron in each network layer processes a specific receptive field from the former layer. Each layer either applies a convolution on the receptive field or reduces the input size, which is called subsampling.

The image below depicts a typical CNN structure. Notice the receptive fields, depicted as sub-squares in each layer, service as an input to a single neuron within the preceding layer. Notice also the subsampling operations reduce the layer size, while the convolution operations extend the layer size. The resulting vector embedding is received via a fully connected layer.

Typical CNN architecture

Learning the network weights (i.e., the embedding model) requires a large set of labeled images. The weights are being optimized in a way that images with the same labels are embedded closer compared to images with different labels. Once we learn the CNN embedding model we can transform the images into vectors and store them with a K-Nearest-Neighbor index. Now, given a new unseen image, we can transform it with the CNN model, retrieve its k-most similar vectors and thus the corresponding similar images.