This post will explain what a Vector Index is, a core component of a Vector Database. We do so by analogy to traditional database indexes.

## Database Predicates and Filters

Let’s begin with a simple database concept, the notion of a filter or a `where`

query. When filtering your records with a `where`

query, you conceptually apply a logical predicate to each record. A record should be included in the result set if `predicate(query, record)==true`

.
For example, you can think of a simple database record like `{“id”:”8472“,”name”:”Joe”, “yob”:”1991”}`

and the query parameters being something like `{"key":“yob”,"op":”<”,”value":2000”}`

. The predicate will return `true`

for records, like Joe’s, whose year or birth (yob) is earlier than `2000`

. Note that the query values act as parametrization of the predicate since they are held fixed for all records.

## Vector Filters Are Classifiers

When replacing a structured record with a vector, the parameterized predicate now needs to take in a vector and return a boolean. There is a name in Machine Learning for these kinds of functions. They are called binary classifiers. So, a predicate in a Vector Database is simply a binary classifier.

Let’s see what we get with the simplest possible classifier, the linear classifier. A linear classifier is parametrized by a weights vector, $q$, and the threshold, $b$. It classifies a vector $x$ as `true`

(or a match) if $x^T q > b$. Here $x^T$ means the transpose of $x$ and $x^T q$ is the inner dot product $x^T q =\sum_i x_i q_i$.

Think about the geometric representation of linear classifiers (below) for a minute. A linear classifier is defined by a half space. This query will return all the vectors in the database that fall in the blue region.

Another very useful and well known vector filter corresponds to a simple similarity/proximity classifier which returns true for all points inside some ball of radius $r$ around the query $q$. Or, return all $x$ in the data such that $||q-x|| < r$. Here $||q-x||^2 = \sum_i (q_i - x_i)^2$. Note that if your vectors encode semantic similarity, nearest neighbor search is usually a very good (and common) way to retrieve similar objects from your database and build semantic search applications.

Another common and useful filter is based on cosine similarity. A cosine similarity classifier returns true for points $x$ such that the angle between $x$ and $q$ is less than some threshold angle. In other words, the points that lie in a cone such that $x^T q/||x|| ||q|| > t$.

## Going Faster with Vector Indexes

One of the key roles of a database is to be able to filter efficiently. Hopefully, much faster than applying the predicate to all the records one after another. Databases therefore create indexes. For numeric values, for example, relational indexes can (and do) keep numeric values in a sorted structure (usually a btree) such that range queries can be answered dynamically with a binary search rather than a linear scan.

A machine learning vector database does something similar. It “sorts” the vectors in space such that when it gets a query like above it can quickly retrieve all the points in the blue region without applying the classifier to all data points. Note, however, that “sorting” high dimensional vectors doesn’t actually mean anything… Indeed, vector indexes that can retrieve arbitrary half spaces (linear classifier filters) are complex data structures that have been intensely studied and analyzed for decades. They are, in fact, still the subject of intense research. Vector databases are therefore tasked with very complex indexing and filtering operations.

Still, there are many indexing algorithms for vectors. In low dimensions, space partitioning algorithms work well. Algorithms like KDTrees and its many variants are good examples. In high dimension though, these algorithms tend to fail. Algorithms like LSH, HNSW, or PQ, for example, deal better with the, so called, curse of dimensionality. Alas, they still suffer from reduced accuracy, slow indexing or retrival, and/or significant (super liear) memory footprint.

As the pioneering vector databases, Pinecone’s engineers and scientists built a new kind of vector index that doesn’t suffer from these shortcomings. Pinecone also runs them at scale in the cloud to provide these capabilities to our customers. If you are interested in working with high dimensional vectors in production, Pinecone is a good place to start.

Author: Edo Liberty