What are Vector Embeddings?
In machine learning, we often represent real-world objects and concepts as vectors of continuous numbers, which are also known as vector embeddings. The vector embedding representation makes it feasible to translate semantic similarity as perceived by humans to proximity in a vector space. In other words, when we represent real-world objects and concepts such as images, audio recordings, news articles, user profiles, weather patterns, and political views as vector embeddings, the semantic similarity of these objects and concepts can be quantified by how close they are to each other as points in vector spaces.
Vector embedding representations are thus suitable for common machine learning tasks such as clustering, recommendation, and classification. For example, in a clustering task, clustering algorithms assigns points that are similar to the same cluster while keeps points from different clusters as dissimilar as possible. In a recommendation task, when making recommendations for an unseen object, the recommender-system would look for objects that are most similar to the object in question, as measured by their similarity as vector embeddings. In a classification task, we classify the label of an unseen object by the major vote over labels of the most similar objects.
One way of creating vector embeddings is to engineer the vector values using domain knowledge. This is known as feature engineering. For example, in medical imaging, we use medical expertise to quantify a set of features such as shape, color, and regions in an image that capture the semantics. However, engineering vector embeddings requires domain knowledge, and it is too expensive to scale.
Instead of engineering vector embeddings, we often train models to translate objects to vectors. Deep neural network is a common tool for training such models. The resulting embeddings are usually high dimensional (up to two thousand dimensions) and dense (all values are non-zeros). For text data, models such as Word2Vec, GLoVE, and BERT transform words, sentences, or paragraphs into vector embeddings. Images can be embedded using models such as convolutional neural networks (CNNs), Examples of CNNs include VGG, and Inception. Audio recordings can be transformed into vectors using image embedding transformations over the audio frequencies visual representation (e.g., using its Spectrogram).
Example: Image Embedding with Convolutional Neural Network
Consider the following example, in which raw images are represented as greyscale pixels.
This is equivalent to a matrix (or table) of integer values in the range
255. Wherein the value
corresponds to black color and
255 to white color. The image below depicts a greyscale image and its corresponding
matrix. The left most sub-image
depicts the grayscale pixels, the middle sub-image contains the pixel grayscale values, and the right most sub-image defines
the matrix. Observe that the matrix values define a vector embedding in which its first coordinate is the matrix upper left cell and
going left to right until the last coordinate that corresponds to the lower right matrix cell.
Such embedding respects a semantic proximity, however, it is very sensitive to shifts, scales and other common image manipulation.
Thus, such greyscale representations are usually treated as raw representations and are being used for learning robust embeddings.
Convolutional Neural Network (CNN or ConvNet) is a class deep learning architectures that are usually applied to visual data transforming images into embeddings. CNNs are processing the input via a hierarchical small local sub-inputs which are termed receptive-fields. Each neuron in each network layer processes a specific receptive field from the former layer. Each layer either applies a convolution on the receptive field or reduce the input size (subsampling). The image below depicts a typical CNN structure. Observe that receptive fields, depicted as sub-squares in each layer, service as an input to a single neuron within the preceding layer. Notice that subsampling operation reduces the layer size, while convolution operations extend the layer size. The resulting vector embedding is received via a fully connected layer.
Learning the network weights (i.e., the embedding model) requires a large set of labeled images. The weights are being optimized in a way that images with the same labels are embedded closer compared to images with different labels. Once we learn the CNN embedding model we can transform the images into vectors and store them with a K-Nearest-Neighbor index. Now, given a new unseen image, we can transform it with the CNN model, retrieve its k-most similar vectors and thus the corresponding similar images.