Metadata filtering

You can limit your vector search based on vector metadata. Pinecone lets you:

  • Attach "metadata" key-value pairs to vectors in an index, and
  • Specify filter expressions when you query the data.

Video overview

Background: What is metadata filtering?

Let's suppose you want to do vector search through vector embeddings of documents (i.e., a semantic search), but you only want to include documents labeled as "finance" from the past two years.

You can add the metadata to those document embeddings within Pinecone, and then filter for those criteria when sending the query. Pinecone will search for similar vector embeddings only among those items that match the filter.

The metadata filtering accepts arbitrary filters on metadata, and it retrieves exactly the number of nearest-neighbor results that match the filters. For most cases, the search latency will be even lower than unfiltered searches.

For more information on the benefits of using metadata filtering, see The Missing WHERE Clause in Vector Search.

Adding metadata in Pinecone indexes

You can associate a metadata payload with each vector in an index.

The metadata takes the form of key-value pairs in a JSON object, where:

The keys are strings.

The values are simple types. One of:

  • String, or
  • Number (integer or floating point, gets converted to floating point)
  • List of String
  • List of Number
info

Aim to use less than 10 fields and less than 2kB per vector for performance. Fewer fields and less data improve performance and should be reserved for filtering needs. You might want to use an external key-value store to store larger amounts of data associated with the vectors that you'd like to fetch.

warning

Null metadata values are not supported. Instead of setting a key to hold a null value, we recommend you remove that key from the metadata payload.

For example, the following would be valid metadata payloads:

{
    "genre": "action",
    "year": 2020,
    "length_hrs": 1.5
}

{
    "color": "blue",
    "fit": "straight",
    "price": 29.99
}

Vector search capabilities

info

Pinecone's filtering query language is based on MongoDB's query and projection operators. We currently support a subset of those selectors.

The metadata filters can be combined with AND and OR:

  • $eq - Equal to (number, string)
  • $ne - Not equal to (number, string)
  • $gt - Greater than (number)
  • $gte - Greater than or equal to (number)
  • $lt - Less than (number)
  • $lte - Less than or equal to (number)
  • $in - In array (string)
  • $nin - Not in array (string)

Using arrays of strings as metadata values or as metadata filters

A vector with metadata payload...

{"genre":["comedy","documentary"]}

...means the "genre" takes on both values.

For example, queries with the following filters will match the vector:

{"genre":"comedy"}

{"genre": {"$in":["documentary","action"]}}

{"$and": [{"genre": "comedy"}, {"genre":"documentary"}]}

Queries with the following filter will not match the vector:

{"$and": [{"genre": "comedy"}, {"genre":"drama"}]}

And queries with the following filters will not match the vector because they are invalid. They will result in a query compilation error:

# INVALID QUERY:
{"genre": ["comedy", "documentary"]}
# INVALID QUERY:
{"genre": {"$eq": ["comedy", "documentary"]}}

Example: Search for documentaries

In this example, we search a movie database for documentaries from a particular year.

First, let's insert vectors with metadata into an index:

pythoncurl
import pinecone

pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
index = pinecone.Index("example-index")

ids = ["A", "B", "C", "D", "E"]
vectors = [
    [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
    [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2],
    [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3],
    [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4],
    [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
]
metadata = [
    {"genre": "comedy", "year": 2020},
    {"genre": "documentary", "year": 2019},
    {"genre": "comedy", "year": 2019},
    {"genre": "drama"},
    {"genre": "drama"}
]
index.upsert(vectors=zip(ids, vectors, metadata))
curl -i -X POST https://YOUR_INDEX-YOUR_PROJECT.svc.us-west1-gcp.pinecone.io/vectors/upsert \
  -H 'Api-Key: YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": [
      {
        "id": "A",
        "values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
        "metadata": {"genre": "comedy", "year": 2020}
      },
      {
        "id": "B",
        "values": [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2],
        "metadata": {"genre": "documentary", "year": 2019}
      },
      {
        "id": "C",
        "values": [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3],
        "metadata": {"genre": "comedy", "year": 2019}
      },
      {
        "id": "D",
        "values": [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4],
        "metadata": {"genre": "drama"}
      },
      {
        "id": "E",
        "values": [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
        "metadata": {"genre": "drama"}
      }
    ]
  }'

Then we can submit a query for documentaries from 2019. This also uses the include_metadata flag so that vector metadata is included in the response.

warning

For performance reasons, do not return vector data and metadata when top_k>1000. Queries with top_k over 1000 should not contain include_metadata=True or include_data=True.

pythoncurl
index.query(
    queries=[([0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1])],
    filter={
        "genre": {"$eq": "documentary"},
        "year": 2019
    },
    top_k=1,
    include_metadata=True
)

# Returns:
# {'results': [{'matches': [{'id': 'B',
#                            'metadata': {'genre': 'documentary', 'year': 2019.0},
#                            'score': 0.0800000429,
#                            'values': []}],
#               'namespace': ''}]}
curl -i -X POST https://YOUR_INDEX-YOUR_PROJECT.svc.us-west1-gcp.pinecone.io/query \
  -H 'Api-Key: YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "queries": [
      {"values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
    ],
    "filter": {"genre": {"$in": ["comedy", "documentary", "drama"]}},
    "topK": 1,
    "includeMetadata": true
  }'

# Output:
# {
#   "results": [
#     {
#       "matches": [
#         {
#           "id": "B",
#           "score": 0.0800000429,
#           "values": [],
#           "metadata": {
#             "genre": "documentary",
#             "year": 2019
#           }
#         }
#       ],
#       "namespace": ""
#     }
#   ]
# }

More example filter expressions

A comedy, documentary, or drama:

{
    "genre": {"$in": ["comedy", "documentary", "drama"]}
}

A drama from 2020:

{
    "genre": {"$eq": "drama"},
    "year": {"$gte": 2020}
}

A drama from 2020 (equivalent to the previous example):

{
    "$and": [
        {"genre": {"$eq": "drama"}},
        {"year": {"$gte": 2020}}
    ]
}

A drama or a movie from 2020:

{
    "$or": [
        {"genre": {"$eq": "drama"}},
        {"year": {"$gte": 2020}}
    ]
}

Example: Adding a per-query-vector metadata filter

In this example, the metadata filtering performed on the first query vector’s result set can be set differently from the metadata filtering performed on the second query vector’s result set:

pythoncurl
import pinecone

pinecone.init(api_key="YOUR_API_KEY", environment="us-west1-gcp")
index = pinecone.Index("example-index")

query_response = index.query(
    queries=[
        ([0.1, 0.2, 0.3, 0.4], {"genre": {"$in": ["comedy", "documentary", "drama"]}}),
        ([0.2, 0.3, 0.4, 0.5], {"genre": {"$nin": ["documentary", "drama"]}})
    ],
    namespace="example-namespace",
    top_k=10,
    include_values=True,
    include_metadata=True
)
curl -i -X POST \
  'https://{index_name}-{project_name}.svc.{environment}.pinecone.io/query' \
  -H 'Api-Key: YOUR_API_KEY_HERE' \
  -H 'Content-Type: application/json' \
  -d '{
    "namespace": "example-namespace",
    "topK": 10,
    "filter": {
      "genre": {
        "$in": [
          "comedy",
          "documentary",
          "drama"
        ]
      },
      "year": {
        "$eq": 2019
      }
    },
    "includeValues": true,
    "includeMetadata": true,
    "queries": [
      {
        "values": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
        "topK": 10,
        "namespace": "example-namespace",
        "filter": {
          "genre": {
            "$in": [
              "comedy",
              "documentary",
              "drama"
            ]
          },
          "year": {
            "$eq": 2019
          }
        }
      }
    ]
  }'