Benchmarking a Vector Similarity Search Service

This tutorial walks you through the different performance measures and considerations for assessing your Pinecone’s similarity search deployment quality. We will evaluate the service’s performance aspects as well as accuracy.

This tutorial covers:

  • How to measure indexing runtime
  • How to measure query runtime
  • How replicas affect performance
  • The recall and approximation-loss accuracy measures

We will cover the above upon two search approaches (“exact” and “approximated”) and a single standard benchmarking dataset (“glove”). On that dataset, the “approximated” search achieves 5x query time acceleration and near-optimal accuracies. In addition, we show that query time acceleration grows linearly with the number of replicas. You are encouraged to use this notebook as a “cookbook” and evaluate your own dataset’s performance and accuracy metrics.

Open Notebook View Source

We assume you are familiar with similarity search and Pineconce’s key concepts.


Here we set up our environment and data. We will use the Glove dataset containing vector embeddings of natural language texts. These embeddings are dense 200-dimensional real-value vectors. We will use its standard format for benchmarking approximated nearest neighbor search indexing implementations.

First, we install Pinecone’s client python package and initialize Pinecone’s environment. Get your API key, if you don’t have one.

!pip install -qU pinecone-client
import pinecone
import os

# load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "YOUR-API-KEY"


Next, we install and import relevant python packages.

!pip install -qU h5py pd np
import timeit
import pandas as pd
import concurrent.futures

Let’s download the Glove dataset.

import requests, os
DATA_DIR = 'tmp'
URL = ""
FILE = f"{DATA_DIR}/glove-200-angular.hdf5"

def download_data():
    os.makedirs(DATA_DIR, exist_ok=True)

    if not os.path.exists(FILE):
        r = requests.get(URL)  # create HTTP response object
        with open(FILE, "wb") as f:            f.write(r.content)


Now, we load the vectors that we will index. Note that the number of vectors is 1,183,514 and their dimension is 200.

import h5py
f = h5py.File('tmp/glove-200-angular.hdf5', 'r')
(1183514, 200)

Performance Assessment

We consider two performance measures: indexing time and query time. We will show how controlling the index configuration affects these metrics.

Indexing time is measured by the number of vectors per second (VPS). Fast indexing time enables real-time index updates and makes large uploads feasible, or else you would be limited to batch updates.

Query time defines throughput and is measured by the number of queries per second (QPS).

Pinecone does Sharding and Replication for both scalability and better performance. Users can customize them. Sharding is a technique that controls the load by distributing the data among several servers. Each Pinecone shard handles up to one Gigabyte of data. Service replication supports several needs, such as backup and consistency. It also boosts performance because the load is distributed evenly among the different replicas.

Throughout the demonstration, we will use two indexing configurations. The first is an exact-search index. This indexing approach retrieves the optimal (i.e., exact) top-k query results. In turn, it requires computing the relevancy score (i.e., “metric”) of all item-query pairs. Thus, the exact-search index has low expected query throughput rates.

The second approach is an approximated search indexing. Here the accuracy requirement is loosened up, and the index can “sort” the indexed vectors using a limited number of score computations. We expect the approximated-search query throughputs to be higher than exact-search.

vps_results_df = pd.DataFrame(columns=["config", "measure", "value", "unit"])
qps_results_df = pd.DataFrame(columns=["config", "measure", "value", "unit"])

Creating the Exact and Approximated Indexes

Let’s start our performance exploration. We define exact-search and approximated-search indexes that use cosine-similarity rankings. Glove’s dataset has less than one Gigabytes memory footprint (1,183,514 x 200 x 4 = 881 Megabytes). Thus, it fits a single Pinecone’s shard. See the create_index docs.

for iname in pinecone.list_indexes(): pinecone.delete_index(iname)
index_name = "assess-demo"

pinecone.create_index(name=index_name, metric="cosine", engine_type="approximated", shards=1)
approx_index = pinecone.Index(name=index_name)

exact_index_name = "assess-demo-exact"

pinecone.create_index(name=exact_index_name, metric="cosine", engine_type="exact", shards=1)
exact_index = pinecone.Index(name=exact_index_name)

How to measure indexing runtime

Let’s insert 1,183,514 vectors which count to 1,183,514 x 200 x 4 = 881 Megabytes and therefore suit a single shard. We compute the number of vectors per second by dividing the number of inserted vectors and overlapping time.

def insert_vectors_and_compute_vps(index, X):
  delta_time = timeit.timeit(lambda: index.upsert( ( (f"{i}",x) for i,x in enumerate(X)) ), number=1 )
  insert_vps = round( X.shape[0] / delta_time )
  return insert_vps
vps_results_df.loc[vps_results_df.shape[0]] = "single shard / exact", "insert time", insert_vectors_and_compute_vps(exact_index, f['train']), "VPS"
vps_results_df.loc[vps_results_df.shape[0]] = "single shard / approximated", "insert time", insert_vectors_and_compute_vps(approx_index, f['train']), "VPS"

0single shard / exactinsert time10702VPS
1single shard / approximatedinsert time10549VPS

How to measure query runtime

Let’s compute the query throughput. Naturally, the networking overhead might add a significant factor to the query time. Therefore, we reduce this factor by averaging the throughput over a batch of queries. We run the queries in parallel, emphasizing a real-world scenario with distributed clients.

approx_index = pinecone.Index(name=index_name)
exact_index = pinecone.Index(name=exact_index_name)

query_func = lambda index: lambda i: index.unary_query( f['test'][i], top_k=100)

def emulate_distributed_queries(index):
  with concurrent.futures.ThreadPoolExecutor(10) as executor: 
    result =, range(NUM_TEST_QUERIES))  

query_time = timeit.timeit(lambda: emulate_distributed_queries(exact_index), number=1)
throughput_qps = NUM_TEST_QUERIES / query_time
qps_results_df.loc[qps_results_df.shape[0]] = "single shard / exact", "query throughput", round(throughput_qps), "QPS"

query_time = timeit.timeit(lambda: emulate_distributed_queries(approx_index), number=1)
throughput_qps = NUM_TEST_QUERIES / query_time
qps_results_df.loc[qps_results_df.shape[0]] = "single shard / approximated", "query throughput", round(throughput_qps), "QPS"

0single shard / exactquery throughput10QPS
1single shard / approximatedquery throughput50QPS

How replicas affect performance

Next, we demonstrate how we can increase the query throughput using replication. We will create a new index with three replicas with a single shard and a similar number of items per replica. Then, we will query the same test queries and compute the average throughput.

replicas_index_name = index_name + "-replicas"

pinecone.create_index(name=replicas_index_name, metric="cosine", engine_type="approximated", shards=1, replicas=3)
index_replicas = pinecone.Index(name=replicas_index_name)

# compute VPS
vps_results_df.loc[vps_results_df.shape[0]] = "three replicas / approximated", "insert time", insert_vectors_and_compute_vps(index_replicas, f['train']), "VPS"
0single shard / exactinsert time10702VPS
1single shard / approximatedinsert time10549VPS
2three replicas / approximatedinsert time10757VPS
# compute QPS
index_replicas = pinecone.Index(name=replicas_index_name)
delta_time = timeit.timeit(lambda: emulate_distributed_queries(index_replicas), number=1)
throughput_qps = NUM_TEST_QUERIES / delta_time

qps_results_df.loc[qps_results_df.shape[0]] = "three replicas / approximated", "query throughput", round(throughput_qps), "QPS"

0single shard / exactquery throughput10QPS
1single shard / approximatedquery throughput49QPS
2three replicas / approximatedquery throughput150QPS

Performance Assessment Summary

In this section, we explored the query throughput and indexing time performance measures. We used a single shard with single and multiple replicas deployments.

Retrieval time depends on a combination of the service’s throughput and networking overhead. In other words, under low network connectivity, the query throughput will be doomed. We suggest running performance tests over the cloud and reducing the networking overhead using a batch of queries in parallel. The latter also matches a real-world scenario in which distributed clients query the service in parallel.

Replicas allow supporting a theoretical “infinite” throughput. We demonstrated that the insertion time kept stable using three replicas while the query throughput was (almost) tripled. It means the replication-throughput ratio is linear, making it easy to determine the number of replicas you need for any desired throughput.

# delete the index with replications

Accuracy Assessment

In this section, we assess the accuracy of our deployed, single replica index. Achieving exact, optimal query results requires complete iteration over all stored vectors. It has a runtime overhead that is not suitable for typical applications. In everyday applications, we will deploy an approximated search index. Such indexes trade accuracy for performance.

The accuracy of an approximated solution is relative to the exact (optimal) solution. We introduce two accuracy measures and evaluate our approximated index accuracy performance.

Evaluating the Approximation: the recall and approximation-loss accuracy measures

We evaluate our approximated index deployment using rank-k recall and approximation loss.

Rank-k recall is widespread due to its usage in a standard approximated nearest neighbor search benchmarking tool. It calculates the fraction of approximated (top-k) results with a score (i.e., “metric” score) at least as high as the optimal, exact search, rank-k score. Usually, we robustify the threshold by adding a small “epsilon” number to the rank-k score.

Observe that the recall is a loose measure because high recall results might deviate significantly from the optimal solution. For example, when the optimal rank-k score is “significantly” smaller than the rank-k-1 score and the approximated solution contains rank-k scores only.

Approximation-loss computes the median top-k ordered score differences between the optimal and approximate-search scores for each test query. In other words, the differences between the optimal and approximate-search top-1 scores, top-2 scores, up to the top-k scores. The final measure is the average of these median values over a set of test queries.

Note the approximation-loss is more refined than the recall. It gives a better sense of the overall quality of the results.

index = pinecone.Index(name=index_name)
with concurrent.futures.ThreadPoolExecutor() as executor: 
    approx_res = i: index.unary_query( f['test'][i], top_k=100), range(NUM_TEST_QUERIES))  
exact_index = pinecone.Index(name=exact_index_name)
with concurrent.futures.ThreadPoolExecutor() as executor:     
    exact_res = i: exact_index.unary_query( f['test'][i], top_k=100), range(NUM_TEST_QUERIES))  
import numpy as np
def anns_recall(r_exact, r):
  assert(len(r_exact.scores) == len(r.scores))
  exact_rank_k_score = r_exact.scores[-1]
  indicator = [s >= exact_rank_k_score for s in r.scores]
  return sum(indicator) / len(indicator)

def approx_loss(r_exact, r):
  return np.quantile([ abs(ext_s - apprx_s) for ext_s, apprx_s in zip(r_exact.scores, r.scores)], 0.5)

recalls = []
a_loss = []
for exact_r, r in zip(exact_res, approx_res):
  recalls.append( anns_recall(exact_r, r) )
  a_loss.append(approx_loss(exact_r, r))

print("Accuracy results over 1000 test queries:")
print(f"The average recall @rank-k is {sum(recalls)/len(recalls)}")
print(f"The median approximation loss is {np.quantile(a_loss, 0.5)}")
Accuracy results over 1000 test queries:
The average recall @rank-k is 0.9577500000000042
The median approximation loss is 5.960464477539063e-08

The approximation-loss measure indicates that the approximated top-k items have nearly optimal scores. Note the recall value doesn’t reflect that. If we used the recall only, we’d conclude the approximated solution is not highly accurate.


Similarity search services like Pinecone can be benchmarked by performance and accuracy for any specific dataset. Performance is measured by indexing speed (VPS) and query throughput (QPS), and accuracy is measured by rank-k recall and approximation-loss.

The selected dataset, index type (exact vs approximate), number of shards, and number of replicas all affect the results, and can be configured to achieve the desired performance.

In this benchmarking tutorial we tested a sample dataset across both index types, one shard, and one or three replicas. It turned out the approximate-search index provided accurate results at high throughputs, and the throughput can be increased linearly by increasing the number of replicas.

You can run this notebook with your own dataset (of vector embeddings) and Pinecone API key to see expected performance for your own use case and configuration.

Contact us for help with benchmarking, optimizing, and deploying Pinecone.

Delete the Indexes

for name in pinecone.list_indexes():