What is a Vector Database?

The meteoric rise in Machine Learning in the last few years has led to increasing use of vector embeddings. They are fundamental to many models and approaches, and are a potent tool for applications such as semantic search, similarity search, and anomaly detection.

The unique nature, growing volume, and rising importance of vector embeddings make it necessary to find new methods of storage and retrieval. We need a new kind of database.

Traditional vs Vector Data

Let’s explore what makes vector databases unique.

Vectors are Filtered Differently

Let’s begin with a simple database concept, the notion of a filter or a where query. When filtering your records with a where query, you apply a logical predicate to each record. A record should be included in the result set if predicate(query, record)==true

For example, you can think of a simple database record like {“id”:”8472“,” name”:” Joe”, “yob”:”1991”} and the query parameters being something like {"key": “yob”, “op":”<”,” value":2000”}. With operator: “< ” and value: “2000”, we want to fetch records where key “yob” (year of birth) value is less than 2000. The predicate will return true for records, like Joe’s, whose “yob” is 1991 (<2000). Note that the query values act as parameters of the predicate since we want to use the same predicate for all the records.

When replacing a structured record with a vector, the parameterized predicate needs to take in a vector and return a boolean. There is a name in Machine Learning for these kinds of functions. They are called binary classifiers. So, a predicate in a Vector Database is simply a binary classifier.

Let’s see what we get with the simplest possible classifier, the linear classifier. A linear classifier can be defined as $y = f(\vec{q}\vec{x})$, f(x) = True if $q.x > b$. Where $q$ is the weight vector and $x$ is the input vector and $b$ the threshold value. It classifies an input vector $x$ as true (or a match) if $x^T q > b$. Here $x^T$ means the transpose of $x$, and $x^T q$ is the inner dot product $x^T q =\sum_i x_i q_i$.

Think about the geometric representation of linear classifiers (below) for a minute. A linear classifier is defined by a half-space. This query will return all the vectors in the database that fall in the blue region.


Another very useful and well-known vector filter corresponds to a simple similarity/proximity classifier which returns true for all points inside some ball of radius $r$ around the query $q$. Or, return all $x$ in the data such that $||q-x|| < r$. Here $||q-x||^2 = \sum_i (q_i - x_i)^2$.

Note that nearest neighbor search is usually an excellent (and common) way to retrieve similar objects from your database and build semantic search applications if your vectors encode semantic similarity.


Another common and useful filter is based on cosine similarity. A cosine similarity classifier returns true for points $x$ such that the angle between $x$ and $q$ is less than some threshold angle. In other words, the points that lie in a cone such that $x^T q/||x|| ||q|| > t$.


Efficient Filtering of Vectors Requires a Vector Index

One of the key roles of a database is to be able to filter efficiently. Hopefully, much faster than applying the predicate to all the records one after another. Databases, therefore, create indexes. For numeric values, for example, relational indexes can (and do) keep numeric values in a sorted structure (usually a btree) such that range queries can be answered dynamically with a binary search rather than a linear scan.

A vector database does something similar. It “sorts” the vectors in space such that when it gets a query like above, it can quickly retrieve all the relevant points without applying the classifier to all data points.

There are many indexing algorithms for vectors. In low dimensions, space partitioning algorithms work well, such as KDTrees and its variants. Algorithms such as LSH, HNSW, and PQ are better suited for high-dimensional vectors, although they still suffer from reduced accuracy, slow indexing or retrieval, and significant (super linear) memory footprint.

The job of a vector index is to accelerate the retrieval step. To achieve that, vector indexing algorithms take into account the filter type, data characteristics (e.g., cardinality, dimension, distribution), and query load statistics. For the widely used half-space, ball, and cone vector-filters, there exist many vector index strategies. The notable ones that we also mentioned above are LSH, HNSW, and PQ. Each has a tradeoff, and the best choice depends on your use case.

A vector index is the core component of a vector database. A vector database needs to provide good performance consistently, regardless of the choice of indexing algorithm. Even better if it can abstract the algorithm choice from the user.

Arbitrary Filters and the Broken Two-Phase Filtering

Sometimes the filters we want to apply to our data aren’t very straightforward or simple. We can create a filtering model by training a deep neural network with a loss that takes business and domain parameters into account. This can give us a filter that is much better for our data than using something more generic like, say, cosine (cone filter) or Euclidean (ball filter). We call such filters “arbitrary filters.”

Then we need to create an index that accelerates searches using this arbitrary filter. However, the boundary of this filter—its geometric shape—is difficult to determine. What’s more, the filter may change as the data or embedding model changes. So how do we create an index for this unclear and moving target?

The common “solution” is to perform a two-phase filtering process:

  1. First, retrieve a set of candidate vectors using an easy-to-index surrogate filter (e.g., the cone filter).
  2. Second, apply the true (arbitrary) filter over the smaller candidate set on each candidate vector to retrieve the final result.

The former step is a brute-force search. The size of the candidate set should be large enough to ensure good recall and small enough to allow a feasible runtime.

However, such a two-phase process is broken! The decision to use the surrogate filter is driven by the implementation complexity rather than content. What we want is to run the true filter over all database points, rather than a small set of biased candidates.

The figure below depicts an example explaining why a two-phase process is sub-optimal. Assume the true filter is a cone-filter, yet we index the data using a surrogate ball filter and apply the cone-filter on the filtered data only. This could have resulted in a large candidate set containing almost all stored vectors (as depicted on the right-hand side of the figure) or a recall loss because a small ball will cover only a fraction of the cone volume.

Now, when the geometric structure of the true filter is “arbitrary,” why would applying “smooth” surrogate filters like the cone and ball filters make any sense?


The answer to this problem is vector index builders. The index builder is a meta-algorithm that learns how to index. Given a set of vectors, a filter (e.g., two-layer NN), and a set of sample queries the index builder creates an index that is optimized for that specific filter. In query time, the original filter is being applied over the whole database in an efficient way that provides the most accurate results.

The retrieval process is dominated by three main factors: the characteristics of the stored vectors, filter “structure,” and the query statistics. In other words, answering: what is the stored data? How does the filter “work”? and what are the query characteristics?

Combining this information provides a hint on the proper index “structure”. The index builder utilizes these signals to create a tailored index for the specific data-filter combination.

Vector Index to Vector Database

Vector indexes are a core component of a vector database. Vector indexes allow us to store vectors and apply filters to them efficiently, and index builders improve on the process.

Many real-world ML applications involving vector data deal with vectors to the scale of tens of millions to billions, and require excellent performance. This raises a few challenges:

  • Scalability: Vectors can exceed the limits of a single machine. This means having functionalities like sharding to split the load.
  • Latency: When dealing with the scale, query latency, write and update latency have to be great even when performing operations online.
  • Operations: Ensuring availability, monitoring, and overall infrastructure orchestration.
  • Flexibility: Support and possibly abstract the ability to choose different indexes and algorithms for nearest neighbor search and support different computing infrastructures (CPU, GPU, etc).

If a solution has to make the leap from being a vector index to a vector database, it should address these challenges. While an in-memory index will work for many cases, given the scale of data we can run into in machine learning, moving to a more complete and purpose-built solution is necessary.