Create Indexes

Overview

Creating a Pinecone index is easy. In this example, we will create an index with Euclidean distance as the measure of similarity.

import pinecone

pinecone.init(">>>YOUR_API_KEY<<<")

pinecone.create_index("pinecone-index", metric="euclidean",engine_type='approximated',shards=1,replicas=1)

Once the index is created, you can start inserting vectors and getting query results.

import pandas

df = pd.DataFrame(data={
    "id": [f"id-{ii}" for ii in range(10000)],
    "vector": [ii + np.zeros(2) for ii in range(10000)]
})

# connect to the index
index = pinecone.Index("pinecone-index")

# insert vectors
index.upsert(items=zip(df.id, df.vector))

# query the index and get similar vectors
index.query(queries=[[0, 1]], top_k=3)

When your similarity search service is no longer needed, you can delete the index and all of the data.

pinecone.delete_index("pinecone-index")

Creating an index with default settings is usually sufficient for millions of low-dimentional vectors with a moderate Queries Per Second (QPS) requirement.

An index can be partitioned into namespaces during upserts.

Keep reading to learn about how to scale up your index, or use different measures of similarity.


Parameters

Engine types (Optional)

One of approximated or exact.

Pinecone currently supports two types of index search algorithms: approximate nearest neighbor search and exact nearest neighbor search.

The approximated engine uses fast approximate search algorithms developed by Pinecone; it is fast and highly accurate.

The exact engine uses exact search algorithms that performs exhaustive searches and thus it is usually slower than the approximated engine.

Metrics

One of cosine, dotproduct, or euclidean. Defaults to cosine.

Use cosine for cosine similarity, dotproduct for max-dot-product, and euclidean for Euclidean distance.

Depending on your application, some metrics have better recall and precision performance than others.

Shards (Optional)

By intelligently sharding your data, a Pinecone index can store billions of vectors and still achieve high accuracy and low latency.

As a general guideline, add 1 shard to the index for each additional GB of data.

For example, one million 32-dimensional vectors would take about 150MB of storage.

Replicas (Optional)

Replicas duplicate your index to help with concurrent access. Increasing the number of replicas increases throughput (QPS). We recommend using at least 2 replicas if your application needs high availability (99.99% uptime) for querying.


Example

This is an example of a simple nearest-neighbor classifier. The data are sampled from two multivariate normal distributions.

Given an unknown vector, we will build a classifier to determine which multivariate normal this vector is more likely to belong to, using the majority class label of its nearest neighbors.



"""Generate data from multivariate normal distributions"""

import numpy as np
import pandas as pd
from collections import Counter

sample_size = 50000
dim = 10
A_mean = 0
B_mean = 2

# Create multivariate normal samples
A_vectors = A_mean + np.random.randn(sample_size, dim)
B_vectors = B_mean + np.random.randn(sample_size, dim)

# Query data generated from A distribution
query_size = 20
A_queries = A_mean + np.random.randn(query_size, dim)


"""Build a classifier using Pinecone"""

import pinecone

pinecone.init(">>>YOUR_API_KEY<<<")

# Create an index
index_name = 'simple-knn-classifier'
pinecone.create_index(index_name, metric="euclidean")

# Connect to the index
index = pinecone.Index(index_name)

# Upload the sample data formatted as (id, vector) tuples.
A_df = pd.DataFrame(data={
    "id": [f"A-{ii}" for ii in range(len(A_vectors))],
    "vector": A_vectors
})
B_df = pd.DataFrame(data={
    "id": [f"B-{ii}" for ii in range(len(B_vectors))],
    "vector": B_vectors
})
acks = index.upsert(items=zip(A_df.id, A_df.vector))
acks = index.upsert(items=zip(B_df.id, B_df.vector))

# We expect most of a query's nearest neighbors to come from the A distribution
for result in index.query(queries=A_queries, top_k=10):
    cc = Counter(id_.split("-")[0] for id_ in result.ids)
    print(f"Count nearest neighbors' class labels: A={cc['A']}, B={cc['B']}")

# Delete the index
pinecone.delete_index(index_name)