Insert data

After creating a Pinecone index, you can start inserting data (vector embeddings) into the index.

Learn more

Our Learn section explains the basics of vector databases and similarity search as a service.

Preparing the data

When you insert data to a Pinecone index, you upload (id, vector) pairs in the same way as you would upload a key-value store:

import pandas as pd
import numpy as np

df = pd.DataFrame(data={
    "id": ["A", "B", "C", "D", "E"],
    "vector": [ii + np.ones(2) for ii in range(5)]
})

Inserting the vectors

  1. Connect to the index:
pythoncurl
index = pinecone.Index("pinecone-index")
# Not applicable
  1. Insert the data as a list of (id, vector) tuples. Use the Upsert operation to write vectors into a namespace:
pythoncurl
acks = index.upsert(vectors=zip(df.id, df.vector))
print(acks[:2])
curl -i -X POST \
  'https://{index_name}-{project_name}.svc.{environment}.pinecone.io/vectors/upsert' \
  -H 'Api-Key: YOUR_API_KEY_HERE' \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": [
      {
        "id": "id-0",
        "values": [
          1.22,
          2.234,
          3.73
        ],
        "metadata": {"mykey" : "myvalue"}
      }
    ]
  }'
note

UpsertResult(id='A') is an acknowledgement that the vector with id="A" has been inserted successfully.

Immediately after the upsert response is received, vectors may not be visible to queries yet. In most situations, you can check if the vectors have been received by checking for the vector counts returned by describe_index_stats() to be updated. This technique may not work if the index has multiple replicas. The database is eventually consistent.

Batching upserts

For clients upserting larger amounts of data, you should insert data into an index in batches, over multiple upsert requests. For example, this can be done as follows:

import random
import itertools

def chunks(iterable, batch_size=100):
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

vector_dim = 128
vector_count = 10000

# Example generator that generates many (id, vector) pairs
example_data = map(lambda i: (f'vec{i}', [random.random() for _ in range(vector_dim)]), range(vector_count))

# Upsert data with 100 vectors per upsert request
for ids_vectors_chunk in chunks(example_data_generator, batch_size=100):
    index.upsert(vectors=ids_vectors_chunk)  # Assuming `index` defined elsewhere

If you experience slow uploads, see Performance tuning for advice.

Partitioning an index into namespaces

You can organize the vectors added to an index into a small set of partitions, or "namespaces", in order to limit queries and other vector operations to only one such namespace at a time. For more information, see: Namespaces.

Inserting vectors with metadata

You can insert vectors that contain metadata key-value pairs.

You can then use the metadata to filter for those criteria when sending the query. Pinecone will search for similar vector embeddings only among those items that match the filter. For more information, see: Metadata Filtering.

For example:

# upsert with metadata

index = pinecone.Index("example-index-name")

upsert_response = index.upsert(
    vectors=[
        ("vec1", [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8], {"genre": "drama", "year": 2020}),
        ("vec2", [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], {"genre": "action", "year": 2021}),
    ],
    namespace="example-namespace",
)

# next sample...