YouTube is a cultural phenomenon. The first video “Me at the zoo” was uploaded in 2005. It is a 19 second clip of YouTube’s co-founder Jawed Karim at the zoo. This was a uniquely ordinary insight into another person’s life, and, back then, this type of content had not really been seen before.
Today’s world is different. 30,000 hours of video are uploaded to YouTube every hour, and more than one billion hours of video are watched daily .
Technology and culture have advanced and become ever more entangled. Some of the most significant technological breakthroughs are integrated so tightly into our culture that we never even notice they’re there.
One of those is AI-powered search. It powers your Google results, Netflix recommendations, and ads you see everywhere. It is being rapidly weaved throughout all aspects of our lives. Further, this is a new technology; its full potential is unknown.
This technology weaves directly into the cultural phenomenon of YouTube. Imagine a search engine like Google that allows you to rapidly access the billions of hours of YouTube content. There is no comparison to that level of highly engaging video content in the world .
Data for Search
To power this technology, we will need data. We will use the YTTTS Speech Collection dataset from Kaggle. The dataset is organized into a set of directories containing folders named by video IDs.
Inside each video ID directory, we find more directories where each represents a timestamp start and end. Those timestamp directories contain a subtitles.txt file containing the text from that timestamp range.
Dataset directory structure. Containing video IDs > timestamps > subtitles.
We can extract the transcriptions, their start/end timestamps, and even the video URL (using the ID).
The original dataset is excellent, but we do need to make some changes for it to better suit our use case. The code for downloading and processing this dataset can be found here.
If you prefer, this step can be skipped by downloading the processed dataset with:
First, we need to extract the data from the subtitles.txt files. We do this by iterating through the directory names, structured by video IDs and timestamps.
We now have the core data for building our search tool, but it would be nice to include video titles and thumbnails in search results.
Retrieving this data is as simple as scraping the title and thumbnail for each record using the
url feature and Python’s BeautifulSoup package.
We need to merge the data we pulled from the YTTTS dataset and this metadata.
That leaves us with 11298 sentence-to-paragraph length video transcriptions. Using this, we’re now ready to move on to developing the video search pipeline.
Our video search relies on a subdomain of NLP called semantic search. There are many approaches to semantic search, at a high-level this is the retrieval of contexts (sentences/paragraphs) that seem to answer a query.
Indexing and querying pipeline with the retriever and vector database components.
Retrieving contexts requires two components, a vector database and a retriever model, both of which are used for indexing and retrieving data.
The vector database acts as our data storage and retrieval component. It stores vector representations of our text data that can be retrieved using another vector. We will use the Pinecone vector database.
Although we use a small sample here, any meaningful coverage of YouTube would require us to scale to billions of records. Pinecone’s vector database allows this through Approximate Nearest Neighbors Search (ANNS). Using ANNS, we can restrict our search scope to a small subset of the index, avoiding the excessive complexity of comparing (potentially) billions of vectors.
To initialize the database, we sign up for a free Pinecone API key and
pip install pinecone-client. Once ready, we initialize our index with:
When creating the index, we pass:
- The index name, here we use
'youtube-search'but it can be anything.
dimension, the dimensionality of vector embeddings stored in the index, must align with the retriever dimensionality (more on this soon).
metric, describing the method for calculating the proximity of vectors here we use
'cosine'similarity, which aligns to the retriever output (again, more later).
We have our index, but we’re missing a key detail. How do we go from the transcription text we have now to vector representations for our vector database? We need a retriever model.
The retriever is a transformer model specially trained to embed sentences/paragraphs into a meaningful vector space. By meaningful, we expect sentences with similar semantic meaning (like question-answer pairs) to be placed into the model and embedded into a similar vector space.
The retriever model encodes semantically related phrases into a similar vector space.
From this, we can place these vectors into our vector database. When we have a query, we use the same retriever model to create a query vector. This query vector is used to retrieve the most similar (already indexed) context vectors.
When given a query vector, the vector database handles the search and retrieval of similar context vectors.
We can load a pre-existing retriever model from the sentence-transformers library (
pip install sentence-transformers).
Now we can see the model details, including that it outputs vectors of dimensionality
768. This does not include the similarity metric that the model is optimized to use. That information can often be found via the [model card] (TK link) (if in doubt, cosine is most common).
We can begin embedding and inserting our vectors into the vector database with both our vector database and retriever initialized. We will do this in batches of
Once we’re finished indexing our data, we can check that all records have been added using
index.describe_index_stats() or via the Pinecone dashboard.
We can see the index details from the Pinecone dashboard.
Everything has been initialized and indexed. All that is left to do is query. To do this, we create a query like
"what is deep learning?", embed it using our retriever, and query via
index.query method, we pass our query vector
xq, the top_k number of similar context vectors to return, and that we’d like to return metadata.
Inside that metadata, we have several important features:
start_second. We can build a user-friendly interface using these features and a framework like Streamlit with straightforward code.
The fields of NLP and vector search are experiencing a renaissance as increasing interest and application generate more research, which fuels even greater interest and application of the technology.
In this walkthrough, we have demoed one use case that, despite its simplicity, can be incredibly useful and engaging. As the adoption of NLP and vector search continues to grow, more use cases will appear and embed themselves into our daily lives, just as Google search and Netflix recommendations have done in the past, becoming an ever-greater influence in the world.
 L. Ceci, Hours of video uploaded to YouTube every minute (2022), Statistica
 C. Goodrow, You know what’s cool? A billion hours (2017), YouTube Blog
 A. Hayes, State of Video Marketing report (2022), Wyzowl