# Using Semantic Search to Find GIFs > In this article, we will use vector search applied to language, called semantic search, to build a GIF search engine. Unlike more traditional search where we rely on keyword matching, semantic search enables search based on the human meaning behind text and images. That means we can find highly relevant GIFs with natural language prompts. James Briggs · 2023-06-30 Vector search powers some of the most popular services in the world. It serves your Google results, delivers the [best podcasts on Spotify](https://www.pinecone.io/learn/series/wild/spotify-podcast-search/), and accounts for _at least_ 35% of consumer purchases on Amazon [1][2]. In this article, we will use vector search applied to language, called _semantic_ search, to build a GIF search engine. Unlike more traditional search where we rely on _keyword_ matching, semantic search enables search based on the _human meaning_ behind text and images. That means we can find highly relevant GIFs with natural language prompts. [Video](https://d33wubrfki0l68.cloudfront.net/6f2e76bbfaef3019cbc629e3bff67c9bb0d4f44b/e2898/images/gif-search-1.mp4) Preview of the GIF search app [available here](https://share.streamlit.io/pinecone-io/playground/gif-search/src/server.py). The pipeline for a project like this is simple, yet powerful. It can easily be adapted to tasks as diverse as [video search](https://www.pinecone.io/learn/youtube-search/) or [answering Super Bowl questions](https://www.pinecone.io/learn/series/nlp/question-answering/), or as we’ll see, finding GIFs. [All supporting notebooks and scripts can be found here](https://github.com/pinecone-io/examples/blob/master/learn/search/semantic-search/gif-search/). ## GIF Dataset [Video](https://www.youtube.com/watch?v=xXsDIK9z_fg) We will be using the TGIF dataset found [on GitHub here](https://github.com/raingo/TGIF-Release). To get the dataset we can use `wget` (alternatively, download it manually), and unzip. ```python wget https://github.com/raingo/TGIF-Release/archive/master.zip unzip master.zip ``` In these unzipped files we should be able to find a file named `tgif-v1.0.tsv` inside the `data` directory. We’ll use _pandas_ to load the file using `\t` as the field delimiter. ```json { "_key": "bfa8c8f918a5", "_type": "colabBlock", "jsonContent": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 3,\n \"metadata\": {\n \"id\": \"K8RvBSYSbvUb\"\n },\n \"outputs\": [\n {\n \"data\": {\n \"text/html\": [\n \"

\\n\",\n \"\\n\",\n \"\\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \" \\n\",\n \"

	url	description
0	https://38.media.tumblr.com/9f6c25cc350f12aa74...	a man is glaring, and someone with sunglasses ...
1	https://38.media.tumblr.com/9ead028ef62004ef6a...	a cat tries to catch a mouse on a tablet
2	https://38.media.tumblr.com/9f43dc410be85b1159...	a man dressed in red is dancing.
3	https://38.media.tumblr.com/9f659499c8754e40cf...	an animal comes close to another in the jungle
4	https://38.media.tumblr.com/9ed1c99afa7d714118...	a man in a hat adjusts his tie and makes a wei...

\\n\",\n \"

\"\n ],\n \"text/plain\": [\n \" url \\\\\\n\",\n \"0 https://38.media.tumblr.com/9f6c25cc350f12aa74... \\n\",\n \"1 https://38.media.tumblr.com/9ead028ef62004ef6a... \\n\",\n \"2 https://38.media.tumblr.com/9f43dc410be85b1159... \\n\",\n \"3 https://38.media.tumblr.com/9f659499c8754e40cf... \\n\",\n \"4 https://38.media.tumblr.com/9ed1c99afa7d714118... \\n\",\n \"\\n\",\n \" description \\n\",\n \"0 a man is glaring, and someone with sunglasses ... \\n\",\n \"1 a cat tries to catch a mouse on a tablet \\n\",\n \"2 a man dressed in red is dancing. \\n\",\n \"3 an animal comes close to another in the jungle \\n\",\n \"4 a man in a hat adjusts his tie and makes a wei... \"\n ]\n },\n \"execution_count\": 3,\n \"metadata\": {},\n \"output_type\": \"execute_result\"\n }\n ],\n \"source\": [\n \"import pandas as pd\\n\",\n \"\\n\",\n \"# Load dataset to a pandas dataframe\\n\",\n \"df = pd.read_csv(\\n\",\n \" \\\"./TGIF-Release-master/data/tgif-v1.0.tsv\\\",\\n\",\n \" delimiter=\\\"\\\\t\\\",\\n\",\n \" names=['url', 'description']\\n\",\n \")\\n\",\n \"df.head()\"\n ]\n }\n ],\n \"metadata\": {\n \"accelerator\": \"GPU\",\n \"colab\": {\n \"collapsed_sections\": [],\n \"name\": \"gif_search.ipynb\",\n \"provenance\": []\n },\n \"interpreter\": {\n \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n },\n \"kernelspec\": {\n \"display_name\": \"Python 3.9.12 ('ml')\",\n \"language\": \"python\",\n \"name\": \"python3\"\n },\n \"language_info\": {\n \"codemirror_mode\": {\n \"name\": \"ipython\",\n \"version\": 3\n },\n \"file_extension\": \".py\",\n \"mimetype\": \"text/x-python\",\n \"name\": \"python\",\n \"nbconvert_exporter\": \"python\",\n \"pygments_lexer\": \"ipython3\",\n \"version\": \"3.9.12\"\n },\n \"widgets\": {}\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 1\n}" } ``` The dataset contains GIF URLs and their descriptions in natural language. We can take a look at the first five GIFs. ```json { "_key": "563994d4e182", "_type": "colabBlock", "jsonContent": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 55,\n \"metadata\": {\n \"colab\": {\n \"base_uri\": \"https://localhost:8080/\",\n \"height\": 577\n },\n \"id\": \"m0_jfDW6hl4C\",\n \"outputId\": \"bcfb0ae3-4c44-4354-e42d-93a3ee35ff2d\"\n },\n \"outputs\": [\n {\n \"data\": {\n \"text/html\": [\n \"

\"\n ],\n \"text/plain\": [\n \"\"\n ]\n },\n \"execution_count\": 55,\n \"metadata\": {},\n \"output_type\": \"execute_result\"\n },\n {\n \"name\": \"stdout\",\n \"output_type\": \"stream\",\n \"text\": [\n \"a man is glaring, and someone with sunglasses appears.\\n\"\n ]\n },\n {\n \"data\": {\n \"text/html\": [\n \"

\"\n ],\n \"text/plain\": [\n \"\"\n ]\n },\n \"execution_count\": 55,\n \"metadata\": {},\n \"output_type\": \"execute_result\"\n },\n {\n \"name\": \"stdout\",\n \"output_type\": \"stream\",\n \"text\": [\n \"a cat tries to catch a mouse on a tablet\\n\"\n ]\n },\n {\n \"data\": {\n \"text/html\": [\n \"

\"\n ],\n \"text/plain\": [\n \"\"\n ]\n },\n \"execution_count\": 55,\n \"metadata\": {},\n \"output_type\": \"execute_result\"\n },\n {\n \"name\": \"stdout\",\n \"output_type\": \"stream\",\n \"text\": [\n \"a man dressed in red is dancing.\\n\"\n ]\n },\n {\n \"data\": {\n \"text/html\": [\n \"

\"\n ],\n \"text/plain\": [\n \"\"\n ]\n },\n \"execution_count\": 55,\n \"metadata\": {},\n \"output_type\": \"execute_result\"\n },\n {\n \"name\": \"stdout\",\n \"output_type\": \"stream\",\n \"text\": [\n \"an animal comes close to another in the jungle\\n\"\n ]\n },\n {\n \"data\": {\n \"text/html\": [\n \"

\"\n ],\n \"text/plain\": [\n \"\"\n ]\n },\n \"execution_count\": 55,\n \"metadata\": {},\n \"output_type\": \"execute_result\"\n },\n {\n \"name\": \"stdout\",\n \"output_type\": \"stream\",\n \"text\": [\n \"a man in a hat adjusts his tie and makes a weird face.\\n\"\n ]\n }\n ],\n \"source\": [\n \"for _, gif in df[:5].iterrows():\\n\",\n \" HTML(f\\\"

\"\n ],\n \"text/plain\": [\n \"\"\n ]\n },\n \"execution_count\": 7,\n \"metadata\": {},\n \"output_type\": \"execute_result\"\n },\n {\n \"name\": \"stdout\",\n \"output_type\": \"stream\",\n \"text\": [\n \"two girls dressed in black are singing.\\n\"\n ]\n }\n ],\n \"source\": [\n \"dupe_url = \\\"https://33.media.tumblr.com/88235b43b48e9823eeb3e7890f3d46ef/tumblr_nkg5leY4e21sof15vo1_500.gif\\\"\\n\",\n \"dupe_df = df[df['url'] == dupe_url]\\n\",\n \"\\n\",\n \"# let's take a look at this GIF and it's duplicated descriptions\\n\",\n \"for _, gif in dupe_df.iterrows():\\n\",\n \" HTML(f\\\"

\\\")\\n\",\n \" print(gif[\\\"description\\\"])\"\n ]\n }\n ],\n \"metadata\": {\n \"accelerator\": \"GPU\",\n \"colab\": {\n \"collapsed_sections\": [],\n \"name\": \"gif_search.ipynb\",\n \"provenance\": []\n },\n \"interpreter\": {\n \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n },\n \"kernelspec\": {\n \"display_name\": \"Python 3.9.12 ('ml')\",\n \"language\": \"python\",\n \"name\": \"python3\"\n },\n \"language_info\": {\n \"codemirror_mode\": {\n \"name\": \"ipython\",\n \"version\": 3\n },\n \"file_extension\": \".py\",\n \"mimetype\": \"text/x-python\",\n \"name\": \"python\",\n \"nbconvert_exporter\": \"python\",\n \"pygments_lexer\": \"ipython3\",\n \"version\": \"3.9.12\"\n },\n \"widgets\": {}\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 1\n}" } ``` With our data, we can move on to building the search pipeline. ## Search The search pipeline will at a high level take our natural language query like _“a dog talking on the phone”_ and search through the existing GIF descriptions for anything that has a similar _meaning_ to this query. In this context, we describe _meaning_ as _semantic similarity_, both of which are loaded terms and could refer to many things. For example, are the two phrases `"the dog eats lunch"` and `"the dog does not eat lunch"` similar? In this case, it would depend very much on our use case. Another example: which of the following two sentences are the most similar? A: the stock market took a turn for the worse B: how did the stock market do today? C: the stock market performed worse than expected If we wanted to find phrases with similar meaning then the obvious choice would be `A` and `C`. Matching those with `B` would make little sense. However, this is not the case if we are searching for similar _question-answer pairs_; in that case, `B` should match very closely with `A` and `C`. It’s important to identify what your use case requires in it’s definition of _“semantic similarity”_. For us, we really want to identify generic similarity. That is, we want `A` and `C` to match, and `B` to not match either of those. To do this, we will transform our phrases into [dense vector embeddings](https://www.pinecone.io/learn/series/nlp/dense-vector-embeddings-nlp/). These dense vectors can be stored in a [vector database](https://www.pinecone.io/learn/vector-database/) where we can very quickly compare vectors and identify those that are most similar based on metrics like Euclidean distance and cosine similarity. ![Euclidean Cosine](https://cdn.sanity.io/images/vr8gru94/production/9190296c44ceecc2ad22c1ae8e5efae71b956385-1725x724.png) The vector database handles the storage and fast search of our vector embeddings, but we still need a way to create these embeddings. To do that we use NLP transformer models called _retrievers_ that are [fine-tuned for creating](https://www.pinecone.io/learn/series/nlp/sentence-embeddings/) [sentence embeddings](https://www.pinecone.io/learn/series/nlp/sentence-embeddings/). These sentence embeddings/vectors are able to _numerically represent_ the _meaning_ behind the text that they represent. ![Retriever to Vector Space](https://cdn.sanity.io/images/vr8gru94/production/de76f1da638a1e4941ab3ff000512f2f7446f736-2457x904.png) Putting these two components together gives us a semantic search pipeline that we can use to retrieve semantically similar GIF descriptions given a query. [Video](https://d33wubrfki0l68.cloudfront.net/49ec0b831cc1f2c4f3a80863b5c24da15e3c723b/fef88/images/gif-search-4.mp4) GIF search pipeline covering the one-time indexing step (left) and querying (right). Let’s take a look at how we can put all of this together. ### Initializing Components We will start by initializing our retriever model. Many of the most powerful retrievers use a _sentence transformer_ architecture, which are best supported via the `sentence-transformers` library, installed via a `pip install sentence-transformers`. To find sentence transformer models we go to [huggingface.co/models](https://huggingface.co/models) and search for `sentence-transformers` for the _official_ sentence transformer models. However, there are other models we can use like the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) sentence transformer trained during a special event on [over 1B training pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We will use this model. ```json { "_key": "6038cfe7d221", "_type": "colabBlock", "jsonContent": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 11,\n \"metadata\": {\n \"colab\": {\n \"base_uri\": \"https://localhost:8080/\",\n \"height\": 465,\n \"referenced_widgets\": [\n \"3bbdb7ac9ddb4e61acbf10e0e322b464\",\n \"425825b26a384e158608f34c327e7be7\",\n \"9df1cc108ad74d84a52b25ca0e835197\",\n \"d9e3f2ddbf5e47ebb86fb20c39079354\",\n \"fe02dbbb561147198c3278388cb40d04\",\n \"02851f413fa34d12b412e036d785d38e\",\n \"1215630e8b53423390353cd56a044c6e\",\n \"beeffeb083574f0a8c702b6035474073\",\n \"a40d7b1c14944323937291f6a834061b\",\n \"2063122bcf15496987749c8cca733a8b\",\n \"02796d5819bd4b77b56c6f9cab93c908\",\n \"b69d8608edce436c826f87e269a3b1ec\",\n \"5e30b575e33843e9b68bc34a84c6e6b2\",\n \"a6941924044b4956afa6c9b458141007\",\n \"a2d42ce089df49b896ba9223f6df2ac2\",\n \"b329d7774a1a4817a4e9cea7adfb288d\",\n \"4bfd979f8e1a4bd889b846d6affb6d3f\",\n \"d2088a6432ed4403b9602c5fecb09246\",\n \"071f6d3e71694ddab08360cabf5e7ecd\",\n \"f8c2899ce8264b7fa7d5552783c71ad3\",\n \"850040d37e2b45ecaa53c82958dd9d6a\",\n \"08779212fc3c46629ea7b5ae778b282b\",\n \"97b11b05410f4b0682a223efd6eeb570\",\n \"b5732b4221f448a8b8ae2363849214eb\",\n \"c7c897d4905f4af28a4e0e8e72897f74\",\n \"9066e6a9f5384ceaba6d58353c3e0a6a\",\n \"5085c2de59b34ffe8902285e42f32401\",\n \"8387420e74e94768893a7415243d4a12\",\n \"b3fa9e20d0524f0bbbe65575ccd75126\",\n \"24d25ddd24084a49b47f02138db4b592\",\n \"b4e62b032b884838b46bcc6b62c0c3ca\",\n \"a54f0fc3dc184e20bf7660ddbf890a76\",\n \"b66d0aedfb414d4c8cac45e0f8d157ac\",\n \"d6e028d4612241fba005c9dd3fe12d30\",\n \"a765f4490ae44f959bc594bb3b3b719a\",\n \"0496b4ff5eec45f08c5a1fa1ef106777\",\n \"976ba55ccfc0419d9b357d83eb4e4ab9\",\n \"de5a4a6c3495413fa200ab92d876f509\",\n \"4fb7a30442f94068bdb7c9e434cd03b6\",\n \"242c1203c62b42b9ae32af653267f8c7\",\n \"780b68a97f184b8cb7a04e3412d050ab\",\n \"f6685ce8e5a843f0b856c64a5b71c7ee\",\n \"6d8f66edc20543b9baae3b474201861e\",\n \"7d4faf89e1a34c3592a8d8b5c332cf9a\",\n \"aaff54fb4ed64fbba356616ebad675c3\",\n \"1b85eb15f8584fd5afceddd1ea3ea2e0\",\n \"ba4b82c10e804d3ea76592f2d2a3832c\",\n \"6d2b705a90a447579864e91618f70676\",\n \"a0d67c99dee248f280cbe34fc3de48a6\",\n \"42ca2987c03f46e9985a98b4a137f95a\",\n \"3d044c94cbd741adbf826306bd882971\",\n \"9a612735a6b94182bafe9a0b42648455\",\n \"3965cf39aef641dbb2d0b2a362e1c8c2\",\n \"369f1ee2f1844327b3074a31c0e12519\",\n \"0e1839f9d1fc43b18f84d0f554a54cc2\",\n \"272faddd67724961a8bdcccba6b196b0\",\n \"922e13c3b6944e65a04d96147dd3c9ad\",\n \"aac3d96ceb444486b6f7f760b43ffc79\",\n \"04a7db113b11418ab275879f0f8bf162\",\n \"fd337d66d3a14867ae76908cd58e313b\",\n \"462563ff268e47d5a8c9efffe09604ed\",\n \"abbefd5572204fb4b3c0bf7c5597eb8e\",\n \"69e75d6d62494209b2ee828d1710a59e\",\n \"45240e5ad3a6426bbfc5e3c1880ac781\",\n \"67dfd1d8ff4043239f31bb3a53ce4ccc\",\n \"6aa46bc05e6b4a139009a4124b0f80ec\",\n \"9ab75ac0bf404362ab25e898b172110c\",\n \"005d8360dc824d1f869d0eca3b3ec9fd\",\n \"aa556238d9174c8e83187d079f229b97\",\n \"65e7a017905442ae98c6415bcdfe0bb4\",\n \"97302900f6604d7991df8c9854d95728\",\n \"0ebe49597de647a5a4da64d03c471b76\",\n \"40e00bdafe664e6082d5fc4392cbc4bc\",\n \"32719d55aa0e49d7bb65e7755fc2f572\",\n \"24eb84536c324805bf159da81d8ad509\",\n \"6565c0393fee4725abb04e245a9a6fb0\",\n \"2fe6b2e1c56b48fe9b056ee8d0d3697c\",\n \"e6e9f5860acc4fdda68eafe6abbc99db\",\n \"7f84fd3a40bb40dbb742101403e671c9\",\n \"48e79feebdc043adae21f571319493db\",\n \"2d07707632dc4c60bca7afbbeb928ca6\",\n \"71600400fa14406bb81e2e410d116681\",\n \"2540fac951cd4883833b586558c9081e\",\n \"759f82584cc84eedbbf8fb6142b0d657\",\n \"693abe44ba9a463ca9d08e1b6c54673b\",\n \"024be013b90d4a309b2f54d979f1269a\",\n \"0097b88603d8424f8fc1696b17bd165e\",\n \"a1d9893a7c7345a49cd336610e0ba3f5\",\n \"fefd20a4306b4dad840a79038db4c0a3\",\n \"c985a33ed0064c589e78a72080c382e7\",\n \"4e37e65238634a97a91e8449e24a99c2\",\n \"0bb8029a6b2e49de98586fa091033387\",\n \"fd631c1a5515475a873dceb33e21ac7f\",\n \"2b86f4f2dbf34220ab607cb010ad3ef7\",\n \"8da6bd57b0f44a65be01086f33313f2a\",\n \"24f2ae769e2c4d829dffa14ca8928d3b\",\n \"1e64de324773474d994258dbe9ed736f\",\n \"c7f564bd62324a048511ff82b8d4c369\",\n \"71e6d768200b48ddab9375d0266c6756\",\n \"fbd414ab7af64c2f840d19d1315085c4\",\n \"f2983bd01cab40469306b9bdea3b3b19\",\n \"825ae4cc8891473f8feb1b2820b0c0aa\",\n \"8d8f963d9b6c409d8c21e307958013d8\",\n \"8f2d54ebde5341808e65a705f15ba036\",\n \"1f7356b83d474a0f8ed49a4a5fae5d22\",\n \"55e099fba0fb41f999f690884f80843a\",\n \"2f0edbcbec1d44378f4bc17535a87b76\",\n \"6c709e1a7d174633b603df7bf0804756\",\n \"8b88e836a5884b3db7748d5971463e62\",\n \"9cb993dbaa5746249fed2df7d8db677c\",\n \"87b1f938065644cba745c0b0952526b5\",\n \"9d46acccbf1347c1bfd7b61e84591812\",\n \"8f240b5dc5274d63bb23ba9979d47163\",\n \"783b444819ab4454bceb16578c71df39\",\n \"b1c876108e6b4150bf3638fad1a56157\",\n \"6afdd5eede004470a8c7eb8709f6182b\",\n \"f4ae6887074f443299e95e6cebf971ce\",\n \"b3fbdd74fd764c65a25817690c961de6\",\n \"d141eee9a0f14702a82cfe1b194f9d89\",\n \"eeffd24564624122afc795be6bd1131b\",\n \"8075cad347314feabdb19d6902752b27\",\n \"b3c2f3d5b805460496a74f2009aaf81c\",\n \"4247478ac3b742a7a9407bc69ed179e6\",\n \"d3abd49739b2493b8d52a062e90ae3d6\",\n \"af1cdb5fdb31467aa54bc5b9c1ec255d\",\n \"4124c49ef7d9497187b8e29af694d54f\",\n \"1e7f1c9aa28e4dcd80f6b9c6b9c06fd6\",\n \"73c159f67bec448aa102470f1abd0ca7\",\n \"393b7df64cbd4072acf090f3dee2108c\",\n \"f2c52012dfd84e39851e297d56669213\",\n \"665cc77211d0466cbb2d4a2a95547547\",\n \"70ca64c48bad4c42ba0208eecfc3e40e\",\n \"176a01c7622143c38482371fe2f6cfaf\",\n \"3bf92cfae4b547ce9f91e6b4dd94f25a\",\n \"493c065a83ad48fc921b5b11c59c8fed\",\n \"c0b2ebd2a44845e885863ee18a4e121e\",\n \"c070c05c63ef48908126fac009b33ea4\",\n \"33c35634d79f4848a32e6bd46cb1a75b\",\n \"c0caab84bfcb4bb69ed70c73e5d20b59\",\n \"95fa10da3a1742ea9fb3ae3bf948618a\",\n \"00ba06783c7b484f91aceefd8be27782\",\n \"8b9224015a1a466f982f6d51e1a8591d\",\n \"e71eff0e7d2d4e6a9949c29395ac947d\",\n \"a05a15c5b5a34a5cbc7a0647a01826fb\",\n \"b4f4e484c3c94424a37f6746a590fffd\",\n \"e993d4821d984becbb06458f62252f1d\",\n \"93ba13309f954c669989147c0823c06b\",\n \"4c5388f0293f4bf384332ba0bc775b1c\",\n \"4d2d0d8ce7594980aef48885259dafef\",\n \"da474d5e97ef4947936ca0dd34a30126\",\n \"d0f5166f81f141a88d282b4eb322ec4c\",\n \"98bd1a3b82e74f8ab86645815d8ca526\",\n \"8a06eaff96f94f2fb8f95fc92ef87651\",\n \"123ad9cff4a74609a452474d73c6738c\"\n ]\n },\n \"id\": \"UB0rVxmppnkm\",\n \"outputId\": \"cd8ed4e8-69a0-4ce9-c974-cda0dca998a8\",\n \"scrolled\": true\n },\n \"outputs\": [\n {\n \"data\": {\n \"text/plain\": [\n \"SentenceTransformer(\\n\",\n \" (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel \\n\",\n \" (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\\n\",\n \" (2): Normalize()\\n\",\n \")\"\n ]\n },\n \"execution_count\": 11,\n \"metadata\": {},\n \"output_type\": \"execute_result\"\n }\n ],\n \"source\": [\n \"from sentence_transformers import SentenceTransformer\\n\",\n \"\\n\",\n \"# Initialize retriever with SentenceTransformer model \\n\",\n \"retriever = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')\\n\",\n \"retriever\"\n ]\n }\n ],\n \"metadata\": {\n \"accelerator\": \"GPU\",\n \"colab\": {\n \"collapsed_sections\": [],\n \"name\": \"gif_search.ipynb\",\n \"provenance\": []\n },\n \"interpreter\": {\n \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n },\n \"kernelspec\": {\n \"display_name\": \"Python 3.9.12 ('ml')\",\n \"language\": \"python\",\n \"name\": \"python3\"\n },\n \"language_info\": {\n \"codemirror_mode\": {\n \"name\": \"ipython\",\n \"version\": 3\n },\n \"file_extension\": \".py\",\n \"mimetype\": \"text/x-python\",\n \"name\": \"python\",\n \"nbconvert_exporter\": \"python\",\n \"pygments_lexer\": \"ipython3\",\n \"version\": \"3.9.12\"\n },\n \"widgets\": {}\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 1\n}" } ``` There are a couple of important details here: - `max_sequence_length=128` means the model can read up to _128_ input tokens. - `word_embedding_size=384` actually refers to the _sentence_ embedding size. This means the model will output a 384-dimensional vector representation of the input text. For the short several-word GIF descriptions of our dataset a maximum sequence length of _128_ is _more than enough_. We need to use the _sentence_ embedding size when initializing our vector database, so we store that in the `embed_dim` variable above. To initialize our vector database we first need to sign up for a [free Pinecone API key](https://app.pinecone.io/) and install the Pinecone Python client via `pip install pinecone-client`. Once ready, we initialize: ```json { "_key": "6ca8db46b2b7", "_type": "colabBlock", "jsonContent": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 38,\n \"metadata\": {\n \"id\": \"Ngbs8wQQoePL\"\n },\n \"outputs\": [],\n \"source\": [\n \"import pinecone\\n\",\n \"\\n\",\n \"# Connect to pinecone environment\\n\",\n \"pinecone.init(\\n\",\n \" api_key=\\\"YOUR_API_KEY\\\",\\n\",\n \" environment=\\\"YOUR_ENV\\\" # find next to API key\\n\",\n \")\\n\",\n \"\\n\",\n \"index_name = 'gif-search'\\n\",\n \"\\n\",\n \"# check if the gif-search exists\\n\",\n \"if index_name not in pinecone.list_indexes():\\n\",\n \" # create the index if it does not exist\\n\",\n \" pinecone.create_index(\\n\",\n \" index_name,\\n\",\n \" dimension=384,\\n\",\n \" metric=\\\"cosine\\\"\\n\",\n \" )\\n\",\n \"\\n\",\n \"# Connect to gif-search index we created\\n\",\n \"index = pinecone.Index(index_name)\"\n ]\n }\n ],\n \"metadata\": {\n \"accelerator\": \"GPU\",\n \"colab\": {\n \"collapsed_sections\": [],\n \"name\": \"gif_search.ipynb\",\n \"provenance\": []\n },\n \"interpreter\": {\n \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n },\n \"kernelspec\": {\n \"display_name\": \"Python 3.9.12 ('ml')\",\n \"language\": \"python\",\n \"name\": \"python3\"\n },\n \"language_info\": {\n \"codemirror_mode\": {\n \"name\": \"ipython\",\n \"version\": 3\n },\n \"file_extension\": \".py\",\n \"mimetype\": \"text/x-python\",\n \"name\": \"python\",\n \"nbconvert_exporter\": \"python\",\n \"pygments_lexer\": \"ipython3\",\n \"version\": \"3.9.12\"\n },\n \"widgets\": {}\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 1\n}" } ``` Here, we are specifying an index name of `'gif-search'`; feel free to choose anything you like. It is simply a name. The `metric` is more important and depends on the model being used. For our chosen model we can see in its [model card](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) that it has been trained to use _cosine similarity_, hence why we have specified `metric='cosine'`. Alternative metrics include `euclidean` and `dotproduct`. We’ve initialized both our vector database and retriever components, so we can move on to embedding and indexing our data. ### Indexing The embedding and indexing process is much faster when we perform these steps for multiple records _in parallel_. However, we cannot process all of our records at once as the retriever model must shift everything it is embedding into on-chip memory, which is limited. To avoid this limit, while keeping indexing times as fast as possible, we process everything in batches of `64`. ```json { "_key": "7471ce824ee2", "_type": "colabBlock", "jsonContent": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 18,\n \"metadata\": {\n \"colab\": {\n \"base_uri\": \"https://localhost:8080/\",\n \"height\": 49,\n \"referenced_widgets\": [\n \"fd9832ba430449d6aefdf4f56603751a\",\n \"f18bf9eeb43f4dbcb3ad0a237da6308b\",\n \"8493032e71ed4437b1f221c6fcaac709\",\n \"702ec6e219d74d99a6c929dd65f549ec\",\n \"94843930b2ab44438afe2875c44fff96\",\n \"304c08deba6248da82971665e0caa20b\",\n \"10e79e8c05db4b36b5103116d1210587\",\n \"3cc568020c5a4f34992eec0d195ae4f9\",\n \"08d4c47abb954cc7b277769cb86bb9a9\",\n \"ef45e711623c4f9ba27743de2db32e07\",\n \"9752dafd1edb478fba9060f29882c47b\"\n ]\n },\n \"id\": \"TdwtDfIxw7Ik\",\n \"outputId\": \"32811e16-1934-474d-bd82-a12fccf29cf9\"\n },\n \"outputs\": [\n {\n \"data\": {\n \"application/vnd.jupyter.widget-view+json\": {\n \"model_id\": \"e1de609cb50e40cf953d312813e3c9d0\",\n \"version_major\": 2,\n \"version_minor\": 0\n },\n \"text/plain\": [\n \" 0%| | 0/1966 [00:00` elements using the metadata URLs to point to the correct GIFs. We do this using the `display_gif` function: ```python def display_gif(urls): figures = [] for url in urls: figures.append(f'''

''') return HTML(data=f'''

{''.join(figures)}

''') ``` Let’s test some queries. ```json { "_key": "101edb0d9e68", "_type": "colabBlock", "jsonContent": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 52,\n \"metadata\": {},\n \"outputs\": [\n {\n \"data\": {\n \"text/html\": [\n \"\\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \"\n ],\n \"text/plain\": [\n \"\"\n ]\n },\n \"execution_count\": 52,\n \"metadata\": {},\n \"output_type\": \"execute_result\"\n }\n ],\n \"source\": [\n \"gifs = search_gif(\\\"a dog being confused\\\")\\n\",\n \"display_gif(gifs)\"\n ]\n },\n {\n \"cell_type\": \"code\",\n \"execution_count\": 40,\n \"metadata\": {},\n \"outputs\": [\n {\n \"data\": {\n \"text/html\": [\n \"\\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \"\n ],\n \"text/plain\": [\n \"\"\n ]\n },\n \"execution_count\": 40,\n \"metadata\": {},\n \"output_type\": \"execute_result\"\n }\n ],\n \"source\": [\n \"gifs = search_gif(\\\"animals being cute\\\")\\n\",\n \"display_gif(gifs)\"\n ]\n },\n {\n \"cell_type\": \"code\",\n \"execution_count\": 49,\n \"metadata\": {},\n \"outputs\": [\n {\n \"data\": {\n \"text/html\": [\n \"\\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \\n\",\n \"

\\n\",\n \" \"\n ],\n \"text/plain\": [\n \"\"\n ]\n },\n \"execution_count\": 49,\n \"metadata\": {},\n \"output_type\": \"execute_result\"\n }\n ],\n \"source\": [\n \"gifs = search_gif(\\\"a fluffy dog being cute and dancing like a person\\\")\\n\",\n \"display_gif(gifs)\"\n ]\n }\n ],\n \"metadata\": {\n \"accelerator\": \"GPU\",\n \"colab\": {\n \"collapsed_sections\": [],\n \"name\": \"gif_search.ipynb\",\n \"provenance\": []\n },\n \"interpreter\": {\n \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n },\n \"kernelspec\": {\n \"display_name\": \"Python 3.9.12 ('ml')\",\n \"language\": \"python\",\n \"name\": \"python3\"\n },\n \"language_info\": {\n \"codemirror_mode\": {\n \"name\": \"ipython\",\n \"version\": 3\n },\n \"file_extension\": \".py\",\n \"mimetype\": \"text/x-python\",\n \"name\": \"python\",\n \"nbconvert_exporter\": \"python\",\n \"pygments_lexer\": \"ipython3\",\n \"version\": \"3.9.12\"\n },\n \"widgets\": {}\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 1\n}" } ``` That looks pretty accurate so we’ve managed to put this GIF search pipeline together very easily. With a [little added effort](https://github.com/pinecone-io/examples/blob/master/learn/search/semantic-search/gif-search/app.py) we can translate these steps in creating a web app using something like [Streamlit](https://www.youtube.com/watch?v=QpISF8gMsjQ). We’ve successfully built a GIF search tool using a simple semantic search pipeline with out-of-the-box models and Pinecone. This same pipeline can be applied across a variety of domains with very little tweaking. The ease-of-use and potential of both vector and semantic search have led to a lot of research and applications of both technologies beyond the world of big tech. If you’re interested in seeing other applications of this technology, or would like to share your own, considering joining the [Pinecone community](https://community.pinecone.io/). ## Resources [Article Notebooks and Scripts](https://github.com/pinecone-io/examples/blob/master/learn/search/semantic-search/gif-search/) [1] M. Osborne, [How Retail Brands Can Compete And Win Using Amazon’s Tactics](https://www.forbes.com/sites/forbesagencycouncil/2017/12/21/how-retail-brands-can-compete-and-win-using-amazons-tactics/) (2017), Forbes [2] L. Hardesty, [The history of Amazon’s recommendation algorithm](https://www.amazon.science/the-history-of-amazons-recommendation-algorithm) (2019), Amazon Science