What are Vector Index Builders?
Recall that vector indexes’ job is to accelerate the retrieval step. To achieve that, vector indexing algorithms take into account the filter type, data characteristics (e.g., cardinality, dimension, distribution), and the query load statistics. For the widely used half-space, ball, and cone vector-filters there exist many vector index strategies. The notable ones are LSH, HNSW, and PQ. Each has its pros and cons. Choosing the most suitable vector index algorithm is a data dependant task. A vector database needs to provide good performance consistently and should abstract this choice from the user.
Arbitrary Filters and the Broken Two Phase Filtering
In many practical scenarios we want to filter the data with an arbitrary filtering model. For example, we could create a filtering model by training a deep Neural Network with a loss that takes business and domain parameters into account. How should we index such a filter? Here, the “geometry” of the filter is the decision boundary of Neural Network which is in turn not clear. What happens if we choose to add a new layer to the model, update its training data, or loss-function? Should we really implement a different indexing strategy for every kind of filter “architecture”? This isn’t a sustainable approach.
The common “solution” for such scenarios is to perform a two phase filtering process. First, retrieving a set of candidate vectors using an easy-to-index surrogate filter (e.g., the cone filter). Then, apply the true (arbitrary) filter over each candidate vector to retrieve the final result. Observe that the former step is a brute-force search. The size of the candidate set should be large enough to ensure good recall and small enough to allow a feasible runtime.
However, such a two phase process is broken! The choice of the surrogate filter is driven by implementation complexity rather than content. Clearly, what we really want is to run the true filter over all database points rather than a small set of biased candidates. Moreover, choosing “smooth” geometric surrogate filters dictates having an identical embedding space for the stored and query vectors. Arbitrary filters don’t impose such a restriction. For example, they can filter document embeddings using image embedded queries.
The figure below depicts a motivating example explaining why a two phase process is sub-optimal. Assume the true filter is a cone-filter, yet, we index the data using a surrogate ball-filter, and apply the cone-filter on the filtered data only. Observe that this could be resulted with a large candidate set that contains almost all stored vectors (as depicted in the right hand side of the figure), or, a recall loss because a small ball will cover only a fraction of the cone volume. Now, when the geometric structure of the true filter is “arbitrary” why would applying “smooth” surrogate filters like the cone and ball filters make any sense?
The Solution: Vector Index Builders
Instead we introduce a new concept called vector index builder. The index builder is a meta-algorithm that learns how to index. Given a set of vectors, a filter (e.g., two layer NN), and a set of sample queries the index builder creates an index that is optimized for that specific filter. In query time the original filter is being applied over the whole database in an efficient way that provides the most accurate results.
The retrieval process is dominated by three main factors: the characteristics of the stored vectors, filter “structure,” and the query statistics. In other words, answering: what is the stored data? how does the filter “work”? and what are the query characteristics? Combining these information pieces together gives a solid hint on the proper index “structure”. The index builder utilizes these information signals to create a tailored index for the specific data/filter combination.
Pinecone is the first vector database supporting arbitrary filters the right way. Our database index builders technology handles any filtering model, including filters for non-overlapping query/item embedding-spaces. The resulting index is performant and provides accurate results.