AnnouncementPinecone serverless on AWS is now generally availableLearn more


Open source multi-modal model from OpenAI trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image.
Dimension:Size of a single vector
supported by this model.
768 or 2048
Distance Metric:Used to measure similarity
between vectors.
cosine or dot product
Max Seq. Length:Number of tokens the model
can process at once.


CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. It learns from unfiltered, highly varied, and highly noisy data, and is intended to be used in a zero-shot manner. CLIP struggles on more abstract or systematic tasks such as counting the number of objects in an image and on more complex tasks such as predicting how close the nearest car is in a photo.

The model allows people to design their own classifiers and removes the need for task-specific training data.

Using the Model

Sample Multimodal Data:

Instantiate Model:

Text Embeddings:

Image Embeddings:

Learn more about CLIP