Cascading retrieval with multi-vector representations — ConstBERT model is now in open source - Read the blog

Abstract

We study serving retrieval models, particularly late interaction retrievers like ColBERT, to many concurrent users at once and under a small budget, in which the index may not fit in memory. We present ColBERT-serve, a serving system that applies a memory-mapping strategy to the ColBERT index, reducing RAM usage by 90% and permitting its deployment on cheap servers, and incorporates a multi-stage architecture with hybrid scoring, reducing ColBERT’s query latency and supporting many concurrent queries in parallel

Share:

Start building knowledgeable AI today

Create your first index for free, then pay as you go when you're ready to scale.