# Making YouTube Search Better with NLP

> Technology and culture have advanced and become ever more entangled. Some of the most significant technological breakthroughs are integrated so tightly into our culture that we never even notice they’re there. One of those is AI-powered search. It powers your Google results, Netflix recommendations, and ads you see everywhere. It is being rapidly weaved throughout all aspects of our lives. Further, this is a new technology; its full potential is unknown.

James Briggs · 2023-06-30

YouTube is a cultural phenomenon. The first video _“Me at the zoo”_ was uploaded in 2005. It is a 19 second clip of YouTube’s co-founder Jawed Karim at the zoo. This was a uniquely ordinary insight into another person’s life, and, back then, this type of content had not really been seen before.

Today’s world is different. 30,000 hours of video are uploaded to YouTube _every hour_, and more than one _billion_ hours of video are watched daily [1][2].

Technology and culture have advanced and become ever more entangled. Some of the most significant technological breakthroughs are integrated so tightly into our culture that we never even notice they’re there.

One of those is AI-powered search. It powers your Google results, Netflix recommendations, and ads you see everywhere. It is being rapidly weaved throughout all aspects of our lives. Further, this is a new technology; its full potential is unknown.

This technology weaves directly into the cultural phenomenon of YouTube. Imagine a search engine like Google that allows you to rapidly access the billions of hours of YouTube content. There is no comparison to that level of highly engaging video content in the world [3].

[All supporting notebooks and scripts can be found here](https://github.com/pinecone-io/examples/tree/master/learn/search/semantic-search/yt-search).

### Data for Search

[Video](https://www.youtube.com/watch?v=FzLIIwiaXSU)


To power this technology, we will need data. We will use the [YTTTS Speech Collection dataset from Kaggle](https://www.kaggle.com/datasets/ryanrudes/yttts-speech?resource=download). The dataset is organized into a set of directories containing folders named by video IDs.

Inside each video ID directory, we find more directories where each represents a timestamp start and end. Those timestamp directories contain a _subtitles.txt_ file containing the text from that timestamp range.

![yttts-dataset-structure](https://cdn.sanity.io/images/vr8gru94/production/389e919c361b74f5b587ef3b9ccc834a9abbe830-1580x902.png)


We can extract the transcriptions, their start/end timestamps, and even the video URL (using the ID).

The original dataset is excellent, but we do need to make some changes for it to better suit our use case. The code for downloading and processing this [dataset can be found here](https://github.com/pinecone-io/examples/tree/master/learn/search/semantic-search/yt-search/00-data-build.ipynb).

---

_If you prefer, this step can be skipped by downloading the processed dataset with:_

```python
from datasets import load_dataset  # pip install datasets

ytt = load_dataset(
    "pinecone/yt-transcriptions",
    split="train",
    revision="926a45"
)
```

First, we need to extract the data from the _subtitles.txt_ files. We do this by iterating through the directory names, structured by video IDs and timestamps.

```json
{
  "_key": "31d3ed67a99b",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"id\": \"d70e2a72\",\n   \"metadata\": {\n    \"id\": \"d70e2a72\",\n    \"outputId\": \"f49e13bc-5d4b-45d2-a37e-0f36a81978a8\"\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 127/127 [00:19<00:00,  6.60it/s]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"documents = []\\n\",\n    \"for video_id in tqdm(video_ids):\\n\",\n    \"    splits = sorted(os.listdir(f\\\"data/{video_id}\\\"))\\n\",\n    \"    # we start at 00:00:00\\n\",\n    \"    start_timestamp = \\\"00:00:00\\\"\\n\",\n    \"    passage = \\\"\\\"\\n\",\n    \"    for i, s in enumerate(splits):\\n\",\n    \"        with open(f\\\"data/{video_id}/{s}/subtitles.txt\\\") as f:\\n\",\n    \"            # append tect to current chunk\\n\",\n    \"            out = f.read()\\n\",\n    \"            passage += \\\" \\\" + out\\n\",\n    \"        # average sentence length is 75-100 characters so we will cut off\\n\",\n    \"        # around 3-4 sentences\\n\",\n    \"        if len(passage) > 360:\\n\",\n    \"            # now we've hit the needed length create a record\\n\",\n    \"            # extract the end timestamp from the filename\\n\",\n    \"            end_timestamp = s.split(\\\"-\\\")[1].split(\\\",\\\")[0]\\n\",\n    \"            # extract string timestamps to actual datetime objects\\n\",\n    \"            start = time.strptime(start_timestamp,\\\"%H:%M:%S\\\")\\n\",\n    \"            end = time.strptime(end_timestamp,\\\"%H:%M:%S\\\")\\n\",\n    \"            # now we extract the second/minute/hour values and convert\\n\",\n    \"            # to total number of seconds\\n\",\n    \"            start_second = start.tm_sec + start.tm_min*60 + start.tm_hour*3600\\n\",\n    \"            end_second = end.tm_sec + end.tm_min*60 + end.tm_hour*3600\\n\",\n    \"            # save this to the documents list\\n\",\n    \"            documents.append({\\n\",\n    \"                \\\"video_id\\\": video_id,\\n\",\n    \"                \\\"text\\\": passage,\\n\",\n    \"                \\\"start_second\\\": start_second,\\n\",\n    \"                \\\"end_second\\\": end_second,\\n\",\n    \"                \\\"url\\\": f\\\"https://www.youtube.com/watch?v={video_id}&t={start_second}s\\\",\\n\",\n    \"            })\\n\",\n    \"            # now we update the start_timestamp for the next chunk\\n\",\n    \"            start_timestamp = end_timestamp\\n\",\n    \"            # refresh passage\\n\",\n    \"            passage = \\\"\\\"\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"id\": \"1669785d\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'video_id': 'ZPewmEu7644',\\n\",\n       \"  'text': \\\" hi this is Jeff Dean welcome to applications of deep neural networks of Washington University in this video we're going to look at how we can use ganz to generate additional training data for the latest on my a I course and projects click subscribe in the bell next to it to be notified of every new video Dan's have a wide array of uses beyond just the face generation that you\\\",\\n\",\n       \"  'start_second': 0,\\n\",\n       \"  'end_second': 20,\\n\",\n       \"  'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=0s'},\\n\",\n       \" {'video_id': 'ZPewmEu7644',\\n\",\n       \"  'text': ' often see them use for they can definitely generate other types of images but they can also work on tabular data and really any sort of data where you are attempting to have a neural network that is generating data that should be real or should or could be classified as fake the key element to having something as again is having that discriminator that tells the difference',\\n\",\n       \"  'start_second': 20,\\n\",\n       \"  'end_second': 41,\\n\",\n       \"  'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=20s'},\\n\",\n       \" {'video_id': 'ZPewmEu7644',\\n\",\n       \"  'text': \\\" in the generator that actually generates the data another area that we are seeing ganz use for a great deal is in the area of semi supervised training so let's first talk about what semi-supervised training actually is and see how again can be used to implement this first let's talk about supervised training and unsupervised training which you've probably seen in previous machine\\\",\\n\",\n       \"  'start_second': 41,\\n\",\n       \"  'end_second': 64,\\n\",\n       \"  'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=41s'}]\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"documents[:3]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"accelerator\": \"GPU\",\n  \"colab\": {\n   \"collapsed_sections\": [],\n   \"name\": \"build_dataset.ipynb\",\n   \"provenance\": []\n  },\n  \"interpreter\": {\n   \"hash\": \"e81a84c338879f0412495ea98350e80595740634d3ce0fba8d30f35c60f1a4c3\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.12 ('stoic')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.12\"\n  },\n  \"widgets\": {}\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}"
}
```

We now have the _core_ data for building our search tool, but it would be nice to include video titles and thumbnails in search results.

Retrieving this data is as simple as scraping the title and thumbnail for each record using the `url` feature and Python’s _BeautifulSoup_ package.

```json
{
  "_key": "ea725369dcf6",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"id\": \"eed85411\",\n   \"metadata\": {\n    \"id\": \"eed85411\",\n    \"outputId\": \"f6e79a97-555d-433c-b6dc-184ddae060fd\"\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \" 51%|█████     | 65/127 [02:56<02:01,  1.96s/it]\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"'NoneType' object has no attribute 'get'\\n\",\n      \"fpDaQxG5w4o\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \" 52%|█████▏    | 66/127 [03:00<02:42,  2.67s/it]\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"'NoneType' object has no attribute 'get'\\n\",\n      \"arbbhHyRP90\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 127/127 [05:21<00:00,  2.54s/it]\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"127\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import lxml  # if on mac, pip/conda install lxml\\n\",\n    \"\\n\",\n    \"metadata = {}\\n\",\n    \"for _id in tqdm(video_ids):\\n\",\n    \"    r = requests.get(f\\\"https://www.youtube.com/watch?v={_id}\\\")\\n\",\n    \"    soup = BeautifulSoup(r.content, 'lxml')  # lxml package is used here\\n\",\n    \"    try:\\n\",\n    \"        title = soup.find(\\\"meta\\\", property=\\\"og:title\\\").get(\\\"content\\\")\\n\",\n    \"        thumbnail = soup.find(\\\"meta\\\", property=\\\"og:image\\\").get(\\\"content\\\")\\n\",\n    \"        metadata[_id] = {\\\"title\\\": title, \\\"thumbnail\\\": thumbnail}\\n\",\n    \"    except Exception as e:\\n\",\n    \"        print(e)\\n\",\n    \"        print(_id)\\n\",\n    \"        metadata[_id] = {\\\"title\\\": \\\"\\\", \\\"thumbnail\\\": \\\"\\\"}\\n\",\n    \"\\n\",\n    \"len(metadata)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"id\": \"fa95e454\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'video_id': 'ZPewmEu7644',\\n\",\n       \" 'text': \\\" hi this is Jeff Dean welcome to applications of deep neural networks of Washington University in this video we're going to look at how we can use ganz to generate additional training data for the latest on my a I course and projects click subscribe in the bell next to it to be notified of every new video Dan's have a wide array of uses beyond just the face generation that you\\\",\\n\",\n       \" 'start_second': 0,\\n\",\n       \" 'end_second': 20,\\n\",\n       \" 'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=0s'}\"\n      ]\n     },\n     \"execution_count\": 8,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"documents[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"id\": \"4d41bda8\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'title': 'GANS for Semi-Supervised Learning in Keras (7.4)',\\n\",\n       \" 'thumbnail': 'https://i.ytimg.com/vi/ZPewmEu7644/maxresdefault.jpg'}\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"metadata['ZPewmEu7644']\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"accelerator\": \"GPU\",\n  \"colab\": {\n   \"collapsed_sections\": [],\n   \"name\": \"build_dataset.ipynb\",\n   \"provenance\": []\n  },\n  \"interpreter\": {\n   \"hash\": \"e81a84c338879f0412495ea98350e80595740634d3ce0fba8d30f35c60f1a4c3\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.12 ('stoic')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.12\"\n  },\n  \"widgets\": {}\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}"
}
```

We need to merge the data we pulled from the YTTTS dataset and this metadata.

```json
{
  "_key": "8ab915e5ec25",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"id\": \"f7d121f0\",\n   \"metadata\": {\n    \"id\": \"f7d121f0\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"for i, doc in enumerate(documents):\\n\",\n    \"    _id = doc['video_id']\\n\",\n    \"    meta = metadata[_id]\\n\",\n    \"    # add metadata to existing doc\\n\",\n    \"    documents[i] = {**doc, **meta}\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"id\": \"e52402d8\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'video_id': 'ZPewmEu7644',\\n\",\n       \" 'text': \\\" hi this is Jeff Dean welcome to applications of deep neural networks of Washington University in this video we're going to look at how we can use ganz to generate additional training data for the latest on my a I course and projects click subscribe in the bell next to it to be notified of every new video Dan's have a wide array of uses beyond just the face generation that you\\\",\\n\",\n       \" 'start_second': 0,\\n\",\n       \" 'end_second': 20,\\n\",\n       \" 'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=0s',\\n\",\n       \" 'title': 'GANS for Semi-Supervised Learning in Keras (7.4)',\\n\",\n       \" 'thumbnail': 'https://i.ytimg.com/vi/ZPewmEu7644/maxresdefault.jpg'}\"\n      ]\n     },\n     \"execution_count\": 11,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"documents[0]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"accelerator\": \"GPU\",\n  \"colab\": {\n   \"collapsed_sections\": [],\n   \"name\": \"build_dataset.ipynb\",\n   \"provenance\": []\n  },\n  \"interpreter\": {\n   \"hash\": \"e81a84c338879f0412495ea98350e80595740634d3ce0fba8d30f35c60f1a4c3\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.12 ('stoic')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.12\"\n  },\n  \"widgets\": {}\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}"
}
```

That leaves us with _11298_ sentence-to-paragraph length video transcriptions. Using this, we’re now ready to move on to developing the video search pipeline.

## Retrieval Pipeline

Our video search relies on a subdomain of NLP called semantic search. There are many approaches to semantic search, at a high-level this is the retrieval of _contexts_ (sentences/paragraphs) that seem to answer a _query_.

![indexing-querying](https://cdn.sanity.io/images/vr8gru94/production/f4944a9c5abde6c54872b09e04ff2d536b7e4b5f-3360x1897.png)


Retrieving contexts requires two components, a _vector database_ and a _retriever_ model, both of which are used for indexing and retrieving data.

### Vector Database

The vector database acts as our data storage and retrieval component. It stores vector representations of our text data that can be retrieved using another vector. We will use the Pinecone vector database.

Although we use a small sample here, any meaningful coverage of YouTube would require us to scale to billions of records. Pinecone’s vector database allows this through **A**pproximate **N**earest **N**eighbors **S**earch (ANNS). Using ANNS, we can restrict our search scope to a small subset of the index, avoiding the excessive complexity of comparing (potentially) billions of vectors.

To initialize the database, we sign up for a [free Pinecone API key](https://app.pinecone.io/) and `pip install pinecone-client`. Once ready, we initialize our index with:

```python
import pinecone  # pip install pinecone-client

# connect to pinecone (get API key and env at app.pinecone.io)
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENV")
# create index
pinecone.create_index(
	'youtube-search',
  	dimension=768, metric='cosine'
)
# connect to the new index
index = pinecone.Index('youtube-search')
```

When creating the index, we pass:

- The index name, here we use `'youtube-search'` but it can be anything.
- Vector `dimension`, the dimensionality of vector embeddings stored in the index, must align with the _retriever_ dimensionality (more on this soon).
- Retrieval `metric`, describing the method for calculating the proximity of vectors here we use `'cosine'` similarity, which aligns to the retriever output (again, more later).

We have our index, but we’re missing a key detail. How do we go from the transcription text we have now to vector representations for our vector database? We need a retriever model.

### Retriever Model

The retriever is a transformer model specially trained to embed sentences/paragraphs into a meaningful vector space. By meaningful, we expect sentences with similar semantic meaning (like question-answer pairs) to be placed into the model and embedded into a similar vector space.

![Retriever vectors](https://cdn.sanity.io/images/vr8gru94/production/9fe3d60bdfd5505fc5b97e5d12ab92d1b3073a13-2030x806.png)


From this, we can place these vectors into our vector database. When we have a query, we use the same retriever model to create a query vector. This query vector is used to retrieve the most similar (already indexed) context vectors.

![Similarity search](https://cdn.sanity.io/images/vr8gru94/production/65a145a375fcae632c722e5e894804ff5ab638e5-1662x800.png)


We can load a [pre-existing retriever model](https://huggingface.co/flax-sentence-embeddings/all_datasets_v3_mpnet-base) from the _sentence-transformers_ library (`pip install sentence-transformers`).

```json
{
  "_key": "9509382a0e44",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"id\": \"7f6b8e87\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"SentenceTransformer(\\n\",\n       \"  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel \\n\",\n       \"  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\\n\",\n       \"  (2): Normalize()\\n\",\n       \")\"\n      ]\n     },\n     \"execution_count\": 3,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from sentence_transformers import SentenceTransformer\\n\",\n    \"\\n\",\n    \"retriever = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')\\n\",\n    \"retriever\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"accelerator\": \"GPU\",\n  \"colab\": {\n   \"collapsed_sections\": [],\n   \"name\": \"build_dataset.ipynb\",\n   \"provenance\": []\n  },\n  \"interpreter\": {\n   \"hash\": \"e81a84c338879f0412495ea98350e80595740634d3ce0fba8d30f35c60f1a4c3\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.12 ('stoic')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.12\"\n  },\n  \"widgets\": {}\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}"
}
```

Now we can see the model details, including that it outputs vectors of dimensionality `768`. This does not include the similarity metric that the model is optimized to use. That information can often be found via the [model card] (TK link) (if in doubt, cosine is most common).

### Indexing

We can begin embedding and inserting our vectors into the vector database with both our vector database and retriever initialized. We will do this in batches of `32`.

```json
{
  "_key": "dbf8beeb6aff",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"id\": \"a71fca5f\",\n   \"metadata\": {\n    \"colab\": {\n     \"base_uri\": \"https://localhost:8080/\"\n    },\n    \"id\": \"a71fca5f\",\n    \"outputId\": \"11b97ee1-ee4e-4ad9-fdde-9a2e2a13e048\"\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"application/vnd.jupyter.widget-view+json\": {\n       \"model_id\": \"428965e30ec84e56b9e8b7be96be8320\",\n       \"version_major\": 2,\n       \"version_minor\": 0\n      },\n      \"text/plain\": [\n       \"  0%|          | 0/177 [00:00<?, ?it/s]\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"from tqdm.auto import tqdm\\n\",\n    \"\\n\",\n    \"docs = []  # this will store IDs, embeddings, and metadata\\n\",\n    \"\\n\",\n    \"batch_size = 32\\n\",\n    \"\\n\",\n    \"for i in tqdm(range(0, len(ytt), batch_size)):\\n\",\n    \"    i_end = min(i+batch_size, len(ytt))\\n\",\n    \"    # extract batch from YT transactions data\\n\",\n    \"    batch = ytt[i:i_end]\\n\",\n    \"    # encode batch of text\\n\",\n    \"    embeds = retriever.encode(batch['text']).tolist()\\n\",\n    \"    # each snippet needs a unique ID\\n\",\n    \"    # we will merge video ID and start_seconds for this\\n\",\n    \"    ids = [f\\\"{x[0]}-{x[1]}\\\" for x in zip(batch['video_id'], batch['start_second'])]\\n\",\n    \"    # create metadata records\\n\",\n    \"    meta = [{\\n\",\n    \"        'video_id': x[0],\\n\",\n    \"        'title': x[1],\\n\",\n    \"        'text': x[2],\\n\",\n    \"        'start_second': x[3],\\n\",\n    \"        'end_second': x[4],\\n\",\n    \"        'url': x[5],\\n\",\n    \"        'thumbnail': x[6]\\n\",\n    \"    } for x in zip(\\n\",\n    \"        batch['video_id'],\\n\",\n    \"        batch['title'],\\n\",\n    \"        batch['text'],\\n\",\n    \"        batch['start_second'],\\n\",\n    \"        batch['end_second'],\\n\",\n    \"        batch['url'],\\n\",\n    \"        batch['thumbnail']\\n\",\n    \"    )]\\n\",\n    \"    # create list of (IDs, vectors, metadata) to upsert\\n\",\n    \"    to_upsert = list(zip(ids, embeds, meta))\\n\",\n    \"    # add to pinecone\\n\",\n    \"    index.upsert(vectors=to_upsert)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"id\": \"82c936e3\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'dimension': 768,\\n\",\n       \" 'index_fullness': 0.01,\\n\",\n       \" 'namespaces': {'': {'vector_count': 11298}}}\"\n      ]\n     },\n     \"execution_count\": 13,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"index.describe_index_stats()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"accelerator\": \"GPU\",\n  \"colab\": {\n   \"collapsed_sections\": [],\n   \"name\": \"build_dataset.ipynb\",\n   \"provenance\": []\n  },\n  \"interpreter\": {\n   \"hash\": \"e81a84c338879f0412495ea98350e80595740634d3ce0fba8d30f35c60f1a4c3\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.12 ('stoic')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.12\"\n  },\n  \"widgets\": {}\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}"
}
```

Once we’re finished indexing our data, we can check that all records have been added using `index.describe_index_stats()` or via the [Pinecone dashboard](https://app.pinecone.io/).

![Pinecone dashboard](https://cdn.sanity.io/images/vr8gru94/production/5623204a5841f51093f58895ae237e8817c83442-1920x1080.png)


## Querying

Everything has been initialized and indexed. All that is left to do is query. To do this, we create a query like `"what is deep learning?"`, embed it using our retriever, and query via `index.query`.

```json
{
  "_key": "fd2f8d9dd529",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"id\": \"15668c1b\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"query = \\\"What is deep learning?\\\"\\n\",\n    \"\\n\",\n    \"xq = retriever.encode([query]).tolist()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"id\": \"c98c5714\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \" terms of optimization but what's the algorithm for updating the parameters or updating whatever the state of the network is and then the the last part is the the data set like how do you actually represent the world as it comes into your machine learning system so I think of deep learning as telling us something about what does the model look like and basically to qualify as deep I\\n\",\n      \"---\\n\",\n      \" any theoretical components any theoretical things that you need to understand about deep learning can be sick later for that link again just watched the word doc file again in that I mentioned the link also the second channel is my channel because deep learning might be complete deep learning playlist that I have created is completely in order okay to the other\\n\",\n      \"---\\n\",\n      \" under a rock for the last few years you have heard of the deep networks and how they have revolutionised computer vision and kind of the standard classic way of doing this is it's basically a classic supervised learning problem you are giving a network which you can think of as a big black box a pairs of input images and output labels XY pairs okay and this big black box essentially you\\n\",\n      \"---\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"xc = index.query(xq, top_k=3,\\n\",\n    \"                 include_metadata=True)\\n\",\n    \"for context in xc['results'][0]['matches']:\\n\",\n    \"    print(context['metadata']['text'], end=\\\"\\\\n---\\\\n\\\")\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"accelerator\": \"GPU\",\n  \"colab\": {\n   \"collapsed_sections\": [],\n   \"name\": \"build_dataset.ipynb\",\n   \"provenance\": []\n  },\n  \"interpreter\": {\n   \"hash\": \"e81a84c338879f0412495ea98350e80595740634d3ce0fba8d30f35c60f1a4c3\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.12 ('stoic')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.12\"\n  },\n  \"widgets\": {}\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}"
}
```

Within the `index.query` method, we pass our query vector `xq`, the _top_k_ number of similar context vectors to return, and that we’d like to return metadata.

Inside that metadata, we have several important features: `title`, `url`, `thumbnail`, and `start_second`. We can build a user-friendly interface using these features and a framework like Streamlit with [straightforward code](https://github.com/pinecone-io/examples/tree/master/learn/search/semantic-search/yt-search/app.py).

[Video](https://d33wubrfki0l68.cloudfront.net/7401d33b8ebaffda495786422a02c621d59fbcaa/da439/images/youtube-search-6.mp4)


Streamlit built YouTube search demo, try it yourself [here](https://share.streamlit.io/pinecone-io/playground/yt-search/src/server.py).

The fields of NLP and vector search are experiencing a renaissance as increasing interest and application generate more research, which fuels even greater interest and application of the technology.

In this walkthrough, we have demoed one use case that, despite its simplicity, can be incredibly useful and engaging. As the adoption of NLP and vector search continues to grow, more use cases will appear and embed themselves into our daily lives, just as Google search and Netflix recommendations have done in the past, becoming an ever-greater influence in the world.

## Resources

[Article Notebooks and Scripts](https://github.com/pinecone-io/examples/tree/master/learn/search/semantic-search/yt-search)

[1] L. Ceci, [Hours of video uploaded to YouTube every minute](https://www.statista.com/statistics/259477/hours-of-video-uploaded-to-youtube-every-minute/) (2022), Statistica

[2] C. Goodrow, [You know what’s cool? A billion hours](https://blog.youtube/news-and-events/you-know-whats-cool-billion-hours/) (2017), YouTube Blog

[3] A. Hayes, [State of Video Marketing report](https://www.wyzowl.com/video-marketing-statistics/) (2022), Wyzowl