Audio Search (Example)

This notebook shows how to use Pinecone's similarity search as a service to build an audio search application. Audio search could be used for things like finding songs and metadata within a catalog based on a sample, finding similar sounds in an audio library, or detecting who's speaking in some audio file.

We will index a set of audio recordings from YouTube videos in the form of vector embeddings. These vector embeddings are rich, mathematical representations of the audio recordings which make it possible to determine how similar are the recordings to one another using algorithms. We will then take some new (unseen) audio recordings and search through the index to find the most similar matches, along with their YouTube links.

Dependencies

Copy
Copied
!pip install --quiet tensorflow
!pip install --quiet tensorflow_hub
!pip install --quiet progressbar2
!pip install --quiet tf_slim 
!pip install --quiet soundfile
!pip install --quiet resampy
!pip install --quiet pandas
Copy
Copied
import os
import json
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from progressbar import progressbar
from datetime import timedelta
from IPython.display import Audio
from IPython.display import YouTubeVideo
import platform
import requests
Copy
Copied
# SoundFile depends on the system library libsndfile. 
# In case you are using Linux, this package needs to be installed.
if platform.system() == 'Linux':
    !sudo apt-get -q -y install libsndfile1

Load data

We will use the AudioSet dataset and premade embeddings that are generated using the VGG-inspired acoustic model.

Copy
Copied
!wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/features/features.tar.gz -q --show-progress
features.tar.gz     100%[===================>]   2.41G  74.9MB/s    in 35s     
Copy
Copied
!tar xzf features.tar.gz

Here you can decide whether you want to use a smaller balanced dataset, which contains 22,176 segments from distinct videos, or the larger unbalanced dataset with 2,042,985 segments.

If you want a quick demo of how the Pinecone index works on audio search, we suggest using the balanced dataset. Unbalanced dataset requires more time and local memory to complete several steps in the notebook, and more pods to index all audios.

Copy
Copied
USE_FULL_DATASET = False

if USE_FULL_DATASET:
    DIR = "audioset_v1_embeddings/unbal_train/*"
else:
    DIR = "audioset_v1_embeddings/bal_train/*"

Pinecone Installation and Setup

Copy
Copied
!pip install --quiet -U pinecone-client
Copy
Copied
import pinecone

# Load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pinecone.init(api_key=api_key, environment='us-west1-gcp')

Get a Pinecone API key if you don’t have one already.

Define a New Pinecone Index

Copy
Copied
# Pick a name for the new index
index_name = 'audio-search-demo'
Copy
Copied
# Check whether the index with the same name already exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

Create an index

Pinecone distributes the indexed items over a set of pods. Each p1 pod, that we use here, can handle 1GB of data. Thus, depending on the amount of data, we will create an index with an appropriate number of pods.

Copy
Copied
if USE_FULL_DATASET:
    NUM_OF_PODS = 7
else:
    NUM_OF_PODS = 1
    
pinecone.create_index(name=index_name, dimension=1280, pods=NUM_OF_PODS)

Connect to the new index

Copy
Copied
index = pinecone.Index(index_name)
index.describe_index_stats()
{'dimension': 1280, 'namespaces': {}}

Upload

As the majority of files contain 10 frames (more than 96% in both balanced and unbalanced datasets), we will exclude all the records with different number of frames. Pinecone's vector index requires all vector embeddings to have the same dimension.

Copy
Copied
raw_dataset = tf.data.TFRecordDataset(tf.data.Dataset.list_files(DIR))
Copy
Copied
items_to_upload = []

for _, raw_record in progressbar(enumerate(raw_dataset)):
    record = tf.train.SequenceExample()
    record.ParseFromString(raw_record.numpy())

    video_id = "{}_{}_{}".format(record.context.feature['video_id'].bytes_list.value[0].decode("utf-8") ,
                                 record.context.feature['start_time_seconds'].float_list.value[0],
                                 record.context.feature['end_time_seconds'].float_list.value[0])
    
    n_frames = len(record.feature_lists.feature_list['audio_embedding'].feature)
    if n_frames == 10:
        audio_frame = []
        for i in range(n_frames):
            audio_frame.append(np.frombuffer(record.feature_lists.feature_list["audio_embedding"].feature[i].bytes_list.value[0], dtype=np.int8))
      
        audio_frame = np.array(audio_frame,  dtype=np.float32)

        items_to_upload.append((video_id, audio_frame.flatten().tolist()))
| |                        #                      | 22159 Elapsed Time: 0:00:21

Uploading the items to the index. We perform upsert operation on the index, which means it updates any vector id if it already exists.

Copy
Copied
import itertools

def chunks(iterable, batch_size=50):
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))
Copy
Copied
# Upload items
for batch in chunks(items_to_upload, 100):
    index.upsert(vectors=batch)
Copy
Copied
index.describe_index_stats()
{'dimension': 1280, 'namespaces': {'': {'vector_count': 21782}}}

Search

Let's search for a few audio recordings. First, we will pick three vector embeddings from our index collection. We use these vectors as queries and retrieve the most relevant items and present the related YouTube videos. Later, we will pick arbitrary audio, transform it into vector embeddings, and query our index.

Copy
Copied
# Sort items to escape randomness when testing
items_to_upload = sorted(items_to_upload, key=lambda item: item[0], reverse=True)
Copy
Copied
# Define test audios
test_audios = []

if USE_FULL_DATASET:
    test_audios = items_to_upload[1000::20000][:3]
else:
    test_audios = items_to_upload[::2500][:3]
Copy
Copied
def check_video_url(video_id):
    checker_url = "https://www.youtube.com/oembed?url=http://www.youtube.com/watch?v="
    video_url = checker_url + video_id
    response = requests.get(video_url)
    return response.status_code == 200 and '"status":"UNPLAYABLE"' not in requests.get("http://www.youtube.com/watch?v="+video_id).text

def play_video(video_id, start_time):
    start=int(timedelta(seconds=start_time).total_seconds())
    return YouTubeVideo(video_id, start=start, autoplay=0, theme="light", color="red", width=400, height=300)
    
def make_clickable(val):
    # target _blank to open new window
    return '<a target="_blank" href="{}">{}</a>'.format(val, val)
    
def get_similar_audios(audio_embedding: np.array):
    # Query the vector index
    query_results = index.query(queries=[audio_embedding], top_k=10)
    result = query_results.results[0].matches
    ids = [res.id for res in result]
    scores = [res.score for res in result]
    df_result = pd.DataFrame({
              'id':['_'.join(id.split('_', -2)[:-2]) for id in ids], 
              'start_time': [str(id.split('_')[-2]) for id in ids],
              'end_time': [str(id.split('_')[-1]) for id in ids],
              'score': scores,
              'url': ['https://www.youtube.com/watch?v={}&t={}'.format('_'.join(id.split('_')[:-2]), 
                                                                  int(float(id.split('_')[-2]))) for id in ids]})
    
    # Exlude all the videos that are not available or private
    # Keep top 4 videos that can be played
    # Exclude first result as the same record exists in vector index
    df_result = df_result[df_result['id'].map(check_video_url) == True].reset_index(drop=True)[1:4]

    # Make url column clickable
    df_result_styler = df_result.style.format({'url': make_clickable})
      
    print('\n\n Most similar audios based on Pinecone vector search: \n')
    display(df_result_styler)
      
    for i, row in df_result.iterrows():
        print(f'\n{(i)}.')
        display(play_video(row.id, int(float(row.start_time))))
Copy
Copied
for i, test_audio in enumerate(test_audios):
    vn = '_'.join(test_audio[0].split('_', -2)[:-2])
    vs = test_audio[0].split('_', -2)[-2]

    vs = int(float(vs))
    print(f'\n\n\n (Example {i+1})\n Audio example: https://www.youtube.com/watch?v={vn}&t={vs}\n')
    display(play_video(vn, vs))
    get_similar_audios(test_audio[1])
 (Example 1)
 Audio example: https://www.youtube.com/watch?v=zzya4dDVRLk&t=30
 Most similar audios based on Pinecone vector search: 
  id start_time end_time score url
1 WwFYNmTS41I 10.0 20.0 0.239653 https://www.youtube.com/watch?v=WwFYNmTS41I&t=10
2 JoP-iqBMmi4 10.0 20.0 0.236945 https://www.youtube.com/watch?v=JoP-iqBMmi4&t=10
3 jllMYE8-NVE 280.0 290.0 0.228751 https://www.youtube.com/watch?v=jllMYE8-NVE&t=280
1.
2.
3.
 (Example 2)
 Audio example: https://www.youtube.com/watch?v=rTbY6xcjV34&t=510
 Most similar audios based on Pinecone vector search: 
  id start_time end_time score url
1 3eEeMSPta40 150.0 160.0 0.195225 https://www.youtube.com/watch?v=3eEeMSPta40&t=150
2 4fwUzavktVI 420.0 430.0 0.168778 https://www.youtube.com/watch?v=4fwUzavktVI&t=420
3 SQIGFcCMVKo 60.0 70.0 0.159955 https://www.youtube.com/watch?v=SQIGFcCMVKo&t=60
1.
2.
3.
 (Example 3)
 Audio example: https://www.youtube.com/watch?v=jfxTOlXF3Kk&t=100
 Most similar audios based on Pinecone vector search: 
  id start_time end_time score url
1 zSq2D_GF00o 90.0 100.0 0.192909 https://www.youtube.com/watch?v=zSq2D_GF00o&t=90
2 sGM6xX5laFU 30.0 40.0 0.191072 https://www.youtube.com/watch?v=sGM6xX5laFU&t=30
3 Q4pQKIHhsJk 130.0 140.0 0.186766 https://www.youtube.com/watch?v=Q4pQKIHhsJk&t=130
1.
2.
3.

Once finished with testing, delete the embeddings list to free up the RAM.

Copy
Copied
items_to_upload.clear()

Test on an arbitrary wav file

Here we pick arbitrary audio recording, transform it into vector embeddings, query our index, and present the related YouTube videos.

Clone a repository which we will use to preprocess a wav file.

Copy
Copied
!git clone https://github.com/tensorflow/models.git
Cloning into 'models'...
remote: Enumerating objects: 68944, done.
remote: Counting objects: 100% (84/84), done.
remote: Compressing objects: 100% (42/42), done.
remote: Total 68944 (delta 50), reused 75 (delta 42), pack-reused 68860
Receiving objects: 100% (68944/68944), 577.23 MiB | 31.86 MiB/s, done.
Resolving deltas: 100% (48579/48579), done.

Download models and parameters needed for preprocessing.

Copy
Copied
!wget https://storage.googleapis.com/audioset/vggish_model.ckpt -q --show-progress
!wget https://storage.googleapis.com/audioset/vggish_pca_params.npz -q --show-progress
vggish_model.ckpt   100%[===================>] 277.62M  27.3MB/s    in 9.0s    
vggish_pca_params.n 100%[===================>]  71.31K  --.-KB/s    in 0.01s   

Download the test file.

Copy
Copied
!curl -o sample-file.wav https://storage.googleapis.com/audioset/yamalyzer/audio/acoustic-guitar.wav
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2411k  100 2411k    0     0  7932k      0 --:--:-- --:--:-- --:--:-- 7906k
Copy
Copied
# Define wav file name parameter
wav_file_name = 'sample-file.wav'
Copy
Copied
# Listen to the wav file
Audio(wav_file_name)