Metadata filtering

You can limit your search, based on vector metadata. Pinecone lets you:

  • Attach "metadata" key-value pairs to vectors in an index, and
  • Specify filter expressions when you query the data.

Our Learn section explains the basics of vector databases and similarity search as a service.

For more information on querying, see: Query data.

Background: What is metadata filtering?

Let's suppose you want to do vector search through vector embeddings of documents (i.e., a semantic search), but you only want to include documents labeled as “finance” from the past two years.

You can add the metadata to those document embeddings within Pinecone, and then filter for those criteria when sending the query. Pinecone will search for similar vector embeddings only among those items that match the filter.

The metadata filtering accepts arbitrary filters on metadata, and it retrieves exactly the number of nearest-neighbor results that match the filters. For most cases, the search latency will be even lower than unfiltered searches.

Why do you need it?

To provide more relevant results, there are often situations where you want to combine a vector similarity search with an arbitrary filter.

For example: You might want to make a semantic search on a corpus of documents, but only from certain categories. You might also want to exclude certain authors.

In the past, you had two options:

  • Pre-filtering

    • This uses metadata to filter records first and then searches through all the matched vectors.
    • This makes the vector index unusable. It also requires a brute-force search through the matched vectors, which is very inefficient.
  • Post-filtering

    • You would first retrieve a large set of nearest neighbors and then apply your metadata filters on the results.
    • There is a high latency penalty for retrieving more items than needed, and there is no guarantee the result set would include all the items you actually wanted.

For many organizations, this meant there wasn't a good way to do this.

Adding metadata in Pinecone indexes

You can associate a metadata payload with each vector in an index.

The metadata takes the form of key-value pairs in a JSON object, where:

  • The keys are strings
  • The values are simple types. Either:

    • String, or
    • Number (integer or floating point).
warning

Currently, null metadata values are not supported. Instead of setting a key to hold a null value, we recommend you remove that key from the metadata payload.

For example, the following would be valid metadata payloads:

{
    "genre": "action",
    "year": 2020,
    "length_hrs": 1.5
}
{
    "color": "blue",
    "fit": "straight",
    "price": 29.99
}

Vector search capabilities

Metadata filters can be combined with AND and OR:

  • $eq - Equal to (number, string)
  • $ne - Not equal to (number, string)
  • $gt - Greater than (number)
  • $gte - Greater than or equal to (number)
  • $lt - Less than (number)
  • $lte - Less than or equal to (number)
  • $in - In array (string)
  • $nin - Not in array (string)

Example: Search for documentaries

In this example, we search a movie database for documentaries from a particular year.

First, let's insert vectors with metadata into an index:

pythoncurl
import pinecone

pinecone.init(api_key="your-api-key")
index = pinecone.Index("example-index-name")

df = pd.DataFrame(data={"id": ["A", "B", "C", "D", "E"], "vector": [[0.1]*128, [0.2]*128, [0.3]*128, [0.4]*128, [0.5]*128], "metadata": [{"genre": "comedy", "year": 2020}, {"genre": "documentary", "year": 2019}, {"genre": "comedy", "year": 2019}, {"genre": "drama"}, {"genre": "drama"}]})
index.upsert(vectors=zip(df.id, df.vector, df.metadata))
curl -i -X POST \
  -H 'Api-Key: YOUR_API_KEY_HERE' \
  -H 'Content-Type: application/json' \
  'https://hello-pinecone-example-project.svc.beta.pinecone.io/vectors/upsert' \
  -d '{
    "vectors": [
      {
        "id": "A",
        "values": [
          1.22, 2.23, 3.73  // ...
        ],
        "metadata": {"genre": "comedy", "year": 2020}
      },
      {
        "id": "B",
        "values": [
          2.23, 3.73, 4.84  // ...
        ],
        "metadata": {"genre": "documentary", "year": 2019}
      },
      {
        "id": "C",
        "values": [
          3.73, 4.84, 5.95  // ...
        ],
        "metadata": {"genre": "documentary"}
      }
    ]
  }'

Then we can submit a query for documentaries from 2019. This also uses the include_metadata flag so that vector metadata is included in the response.

pythoncurl
query_response = index.query(
    queries=[([0.1, 0.2, 0.3, 0.4])],
    filter={
        "genre": {"$eq": "documentary"},
        "year": 2019
    },
    top_k=3,
    include_metadata=True
)
curl -i -X POST \
  -H 'Api-Key: YOUR_API_KEY_HERE' \
  -H 'Content-Type: application/json' \
  'https://example-index-name-example-project.svc.beta.pinecone.io/query' \
  -d '{
    "queries": [
      {"values": [
          0.1, 0.2, 0.3, 0.4  // ...
      ]},
    ],
    "filter": {"genre": {"$in": ["comedy", "documentary", "drama"]}},
    "topK": 3,
    "includeMetadata": true
  }'

More example filter expressions

A comedy, documentary, or drama:

{
    "genre": {"$in": ["comedy", "documentary", "drama"]}
}

A drama from 2020:

{
    "genre": {"$eq": "drama"},
    "year": {"$gte": 2020}
}

A drama or a movie from 2020:

{
    "$or": [
        {"genre": {"$eq": "drama"},
        {"year": {"$gte": 2020}}
    ]
}

A movie from the 1980s or 2000s:

{
    "year": {
        "$or": [
            {"$gte": 1980, "$lt": 1990},
            {"$gte": 2000, "$lt": 2010}
        ]
    }
}