Using Semantic Search to Find GIFs

Vector search powers some of the most popular services in the world. It serves your Google results, delivers the best podcasts on Spotify, and accounts for at least 35% of consumer purchases on Amazon [1][2].

In this article, we will use vector search applied to language, called semantic search, to build a GIF search engine. Unlike more traditional search where we rely on keyword matching, semantic search enables search based on the human meaning behind text and images. That means we can find highly relevant GIFs with natural language prompts.

Preview of the GIF search app available here.

The pipeline for a project like this is simple, yet powerful. It can easily be adapted to tasks as diverse as video search or answering Super Bowl questions, or as we’ll see, finding GIFs.

All supporting notebooks and scripts can be found here.

GIF Dataset

We will be using the TGIF dataset found on GitHub here. To get the dataset we can use wget (alternatively, download it manually), and unzip.

wget https://github.com/raingo/TGIF-Release/archive/master.zip

unzip master.zip

In these unzipped files we should be able to find a file named tgif-v1.0.tsv inside the data directory. We’ll use pandas to load the file using \t as the field delimiter.

The dataset contains GIF URLs and their descriptions in natural language. We can take a look at the first five GIFs.

We will find that there are some duplicate URLs, but these do not necessarily indicate duplicate records as a single GIF can be assigned multiple descriptions.

With our data, we can move on to building the search pipeline.

Search

The search pipeline will at a high level take our natural language query like “a dog talking on the phone” and search through the existing GIF descriptions for anything that has a similar meaning to this query.

In this context, we describe meaning as semantic similarity, both of which are loaded terms and could refer to many things. For example, are the two phrases "the dog eats lunch" and "the dog does not eat lunch" similar? In this case, it would depend very much on our use case.

Another example: which of the following sentences which two are the most similar?

A: the stock market took a turn for the worse
B: how did the stock market do today?
C: the stock market performed worse than expected

If we wanted to find phrases with similar meaning then the obvious choice would be A and C. Matching those with B would make little sense. However, this is not the case if we are searching for similar question-answer pairs; in that case, B should match very closely with A and C.

It’s important to identify what your use case requires in it’s definition of “semantic similarity”. For us, we really want to identify generic similarity. That is, we want A and C to match, and B to not match either of those.

To do this we will transform our phrases into dense vector embeddings. These dense vectors can be stored in a vector database where we can very quickly compare vectors and identify those that are most similar based on metrics like Euclidean distance and cosine similarity.

euclidean-cosine Both of these metrics identify the similarity (proximity) of vectors, but they do it based on distance (left) or angular similarity (right).

The vector database handles the storage and fast search of our vector embeddings, but we still need a way to create these embeddings. To do that we use NLP transformer models called retrievers that are fine-tuned for creating sentence embeddings. These sentence embeddings/vectors are able to numerically represent the meaning behind the text that they represent.

retriever-to-vector-space Retriever models are able to take two semantically similar phrases and encode them as similar vectors.

Putting these two components together gives us a semantic search pipeline that we can use to retrieve semantically similar GIF descriptions given a query.

GIF search pipeline covering the one-time indexing step (left) and querying (right).

Let’s take a look at how we can put all of this together.

Initializing Components

We will start by initializing our retriever model. Many of the most powerful retrievers use a sentence transformer architecture, which are best supported via the sentence-transformers library, installed via a pip install sentence-transformers.

To find sentence transformer models we go to huggingface.co/models and search for sentence-transformers for the official sentence transformer models. However, there are other models we can use like the all-MiniLM-L6-v2 sentence transformer trained during a special event on over 1B training pairs. We will use this model.

There are a couple of important details here:

  • max_sequence_length=128 means the model can read up to 128 input tokens.
  • word_embedding_size=384 actually refers to the sentence embedding size. This means the model will output a 384-dimensional vector representation of the input text.

For the short several-word GIF descriptions of our dataset a maximum sequence length of 128 is more than enough.

We need to use the sentence embedding size when initializing our vector database, so we store that in the embed_dim variable above.

To initialize our vector database we first need to sign up for a free Pinecone API key and install the Pinecone Python client via pip install pinecone-client. Once ready, we initialize:

Here, we are specifying an index name of 'gif-search'; feel free to choose anything you like. It is simply a name. The metric is more important and depends on the model being used. For our chosen model we can see in its model card that it has been trained to use cosine similarity, hence why we have specified metric='cosine'. Alternative metrics include euclidean and dotproduct.

We’ve initialized both our vector database and retriever components, so we can move on to embedding and indexing our data.

Indexing

The embedding and indexing process is much faster when we perform these steps for multiple records in parallel. However, we cannot process all of our records at once as the retriever model must shift everything it is embedding into on-chip memory, which is limited.

To avoid this limit, while keeping indexing times as fast as possible, we process everything in batches of 64.

Here we are extracting the batch from our data df. We encode the descriptions via our retriever model, create metadata (covering both descriptions and url), and create some string format IDs. From this we have everything we need to create documents, which will look like this:

(
       "some-id-value",
   [0.1, 0.2, 0.1, 0.4 ...],
   {
       'description': "something descriptive",
       'url': "https://xyz.com"
   }
)

When we upsert these documents to the Pinecone index, we do so in batches of 64. After all of this, we use index.describe_index_stats() to check that we have inserted all 125,782 documents, which we have.

Querying

The final step of querying our data covers:

  1. Encoding a query like “dogs talking on the phone” to create a query vector,
  2. Retrieval of similar context vectors from Pinecone,
  3. Getting relevant GIFs from the URLs found in our metadata fields.
The querying pipeline.

Steps one and two will be performed by a function named search_gif:

To display the GIFs we display HTML <img> elements using the metadata URLs to point to the correct GIFs. We do this using the display_gif function:

Let’s test some queries.

That looks pretty accurate so we’ve managed to put this GIF search pipeline together very easily. With a little added effort we can translate these steps in creating a web app using something like Streamlit.

We’ve successfully built a GIF search tool using a simple semantic search pipeline with out-of-the-box models and Pinecone. This same pipeline can be applied across a variety of domains with very little tweaking.

The ease-of-use and potential of both vector and semantic search have led to a lot of research and applications of both technologies beyond the world of big tech. If you’re interested in seeing other applications of this technology, or would like to share your own, considering joining the Pinecone community.

Resources

Article Notebooks and Scripts

[1] M. Osborne, How Retail Brands Can Compete And Win Using Amazon’s Tactics (2017), Forbes

[2] L. Hardesty, The history of Amazon’s recommendation algorithm (2019), Amazon Science


Comments

What will you build?

Upgrade your search or recommendation systems with just a few lines of code, or contact us for help.