Audio Similarity Search

This notebook shows how to use Pinecone’s similarity search as a service to build an audio search application. Audio search could be used for things like finding songs and metadata within a catalog based on a sample, finding similar sounds in an audio library, or detecting who’s speaking in some audio file.

We will index a set of audio recordings from YouTube videos in the form of vector embeddings. These vector embeddings are rich, mathematical representations of the audio recordings which make it possible to determine how similar are the recordings to one another using algorithms.

We will then take some new (unseen) audio recordings and search through the index to find the most similar matches, along with their YouTube links.

Open Notebook in Google Colab

Dependencies

!pip install --quiet tensorflow
!pip install --quiet tensorflow_hub
!pip install --quiet progressbar2
!pip install --quiet tf_slim 
!pip install --quiet soundfile
!pip install --quiet resampy
!pip install --quiet pandas
     |████████████████████████████████| 358kB 12.7MB/s eta 0:00:01
[?25h
import os
import json
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from progressbar import progressbar
from datetime import timedelta
from IPython.display import Audio
from IPython.display import YouTubeVideo
import platform
import requests
# SoundFile depends on the system library libsndfile. 
# In case you are using Linux, this package needs to be installed.
if platform.system() == 'Linux':
    !sudo apt-get -q -y install libsndfile1
Reading package lists...
Building dependency tree...
Reading state information...
libsndfile1 is already the newest version (1.0.28-4ubuntu0.18.04.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.

Load data

We will use the AudioSet dataset and premade embeddings that are generated using the VGG-inspired acoustic model.

!wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/features/features.tar.gz -q --show-progress
features.tar.gz     100%[===================>]   2.41G  51.9MB/s    in 50s     
!tar xzf features.tar.gz

Here you can decide whether you want to use a smaller balanced dataset, which contains 22,176 segments from distinct videos, or the larger unbalanced dataset with 2,042,985 segments.

If you want a quick demo of how the Pinecone index works on audio search, we suggest using the balanced dataset. Unbalanced dataset requires more time and local memory to complete several steps in the notebook, and more shards to index all audios.

USE_FULL_DATASET = False

if USE_FULL_DATASET:
    DIR = "audioset_v1_embeddings/unbal_train/*"
else:
    DIR = "audioset_v1_embeddings/bal_train/*"

Pinecone Installation and Setup

!pip install --quiet -U pinecone-client
     |████████████████████████████████| 102kB 4.7MB/s ta 0:00:011
     |████████████████████████████████| 81kB 5.7MB/s 
     |████████████████████████████████| 645kB 9.5MB/s 
     |████████████████████████████████| 1.6MB 29.6MB/s 
     |████████████████████████████████| 1.8MB 61.5MB/s 
     |████████████████████████████████| 61kB 6.0MB/s 
     |████████████████████████████████| 1.3MB 79.2MB/s 
     |████████████████████████████████| 133kB 74.3MB/s 
     |████████████████████████████████| 1.1MB 38.6MB/s 
     |████████████████████████████████| 153kB 68.2MB/s 
     |████████████████████████████████| 2.1MB 64.1MB/s 
     |████████████████████████████████| 245kB 65.4MB/s 
     |████████████████████████████████| 2.5MB 61.0MB/s 
     |████████████████████████████████| 81kB 7.3MB/s 
     |████████████████████████████████| 71kB 6.5MB/s 
[?25h  Building wheel for fire (setup.py) ... [?25l[?25hdone
ERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.
import pinecone

# Load Pinecone API key
api_key = '<<< YOUR API KEY>>>'
pinecone.init(api_key=api_key)

Get a Pinecone API key if you don’t have one already.

Define a New Pinecone Index

# Pick a name for the new index
index_name = 'audio-search-demo'
# Check whether the index with the same name already exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)
  0%|          | 0/1 [00:00<?, ?it/s]

Create an index

Pinecone distributes the indexed items over a set of nodes, which are called index shards. Each index shard can handle 1GB of data. Thus, depending on the amount of data, we will create an index with an appropriate number of shards.

if USE_FULL_DATASET:
    NUM_OF_SHARDS = 7
else:
    NUM_OF_SHARDS = 1
    
pinecone.create_index(name=index_name,metric='cosine', shards=NUM_OF_SHARDS)
  0%|          | 0/3 [00:00<?, ?it/s]





{'msg': '', 'success': True}

Connect to the new index

index = pinecone.Index(name = index_name, response_timeout=300)
index.info()
InfoResult(index_size=0)

Upload

As the majority of files contain 10 frames (more than 96% in both balanced and unbalanced datasets), we will exclude all the records with different number of frames. Pinecone’s vector index requires all vector embeddings to have the same dimension.

raw_dataset = tf.data.TFRecordDataset(tf.data.Dataset.list_files(DIR))
items_to_upload = []

for _, raw_record in progressbar(enumerate(raw_dataset)):
    record = tf.train.SequenceExample()
    record.ParseFromString(raw_record.numpy())

    video_id = "{}_{}_{}".format(record.context.feature['video_id'].bytes_list.value[0].decode("utf-8") ,
                                 record.context.feature['start_time_seconds'].float_list.value[0],
                                 record.context.feature['end_time_seconds'].float_list.value[0])
    
    n_frames = len(record.feature_lists.feature_list['audio_embedding'].feature)
    if n_frames == 10:
        audio_frame = []
        for i in range(n_frames):
            audio_frame.append(np.frombuffer(record.feature_lists.feature_list["audio_embedding"].feature[i].bytes_list.value[0], dtype=np.int8))
      
        audio_frame = np.array(audio_frame,  dtype=np.float32)

        items_to_upload.append((video_id, audio_frame.flatten()))
| |                                  #            | 22159 Elapsed Time: 0:00:05

Uploading the items to the index. We perform upsert operation on the index, which means it updates any vector id if it already exists.

# Upload items
acks = index.upsert(items=items_to_upload)
0it [00:00, ?it/s]
index.info()
InfoResult(index_size=21782)

Let’s search for a few audio recordings. First, we will pick three vector embeddings from our index collection. We use these vectors as queries and retrieve the most relevant items and present the related YouTube videos. Later, we will pick arbitrary audio, transform it into vector embeddings, and query our index.

# Sort items to escape randomness when testing
items_to_upload = sorted(items_to_upload, key=lambda item: item[0], reverse=True)
# Define test audios
test_audios = []

if USE_FULL_DATASET:
    test_audios = items_to_upload[1000::20000][:3]
else:
    test_audios = items_to_upload[::2500][:3]
test_audios
[('zzya4dDVRLk_30.0_40.0',
  array([-100.,   92.,   66., ...,  100.,   51., -125.], dtype=float32)),
 ('rTbY6xcjV34_510.0_520.0',
  array([  54., -124.,  -70., ...,  106.,  -75.,   -1.], dtype=float32)),
 ('jfxTOlXF3Kk_100.0_110.0',
  array([-101.,   77.,  112., ..., -118.,   -1., -111.], dtype=float32))]
def check_video_url(video_id):
    checker_url = "https://www.youtube.com/oembed?url=http://www.youtube.com/watch?v="
    video_url = checker_url + video_id
    response = requests.get(video_url)
    return response.status_code == 200 and '"status":"UNPLAYABLE"' not in requests.get("http://www.youtube.com/watch?v="+video_id).text

def play_video(video_id, start_time):
    start=int(timedelta(seconds=start_time).total_seconds())
    return YouTubeVideo(video_id, start=start, autoplay=0, theme="light", color="red", width=400, height=300)
    
def make_clickable(val):
    # target _blank to open new window
    return '<a target="_blank" href="{}">{}</a>'.format(val, val)
    
def get_similar_audios(audio_embedding: np.array):
    # Query the vector index
    query_results = index.query(queries=[audio_embedding], top_k=10, disable_progress_bar=True)

    for res in query_results:
        df_result = pd.DataFrame({
              'id':['_'.join(id.split('_', -2)[:-2]) for id in res.ids], 
              'start_time': [str(id.split('_')[-2]) for id in res.ids],
              'end_time': [str(id.split('_')[-1]) for id in res.ids],
              'score':res.scores,
              'url': ['https://www.youtube.com/watch?v={}&t={}'.format('_'.join(id.split('_')[:-2]), 
                                                                  int(float(id.split('_')[-2]))) for id in res.ids]})
      
        # Exlude all the videos that are not available or private
        # Keep top 4 videos that can be played
        # Exclude first result as the same record exists in vector index
        df_result = df_result[df_result['id'].map(check_video_url) == True].reset_index(drop=True)[1:4]

        # Make url column clickable
        df_result_styler = df_result.style.format({'url': make_clickable})
      
        print('\n\n Most similar audios based on Pinecone vector search: \n')
        display(df_result_styler)
      
        for i, row in df_result.iterrows():
            print(f'\n{(i)}.')
            display(play_video(row.id, int(float(row.start_time))))
for i, test_audio in enumerate(test_audios):
    vn = '_'.join(test_audio[0].split('_', -2)[:-2])
    vs = test_audio[0].split('_', -2)[-2]

    vs = int(float(vs))
    print(f'\n\n\n (Example {i+1})\n Audio example: https://www.youtube.com/watch?v={vn}&t={vs}\n')
    display(play_video(vn, vs))
    get_similar_audios(test_audio[1])
 (Example 1)
 Audio example: https://www.youtube.com/watch?v=zzya4dDVRLk&t=30
 Most similar audios based on Pinecone vector search: 
idstart_timeend_timescoreurl
1WwFYNmTS41I10.020.00.239653https://www.youtube.com/watch?v=WwFYNmTS41I&t=10
2JoP-iqBMmi410.020.00.236945https://www.youtube.com/watch?v=JoP-iqBMmi4&t=10
3jllMYE8-NVE280.0290.00.228751https://www.youtube.com/watch?v=jllMYE8-NVE&t=280
1.
2.
3.
 (Example 2)
 Audio example: https://www.youtube.com/watch?v=rTbY6xcjV34&t=510
 Most similar audios based on Pinecone vector search: 
idstart_timeend_timescoreurl
13eEeMSPta40150.0160.00.195225https://www.youtube.com/watch?v=3eEeMSPta40&t=150
24fwUzavktVI420.0430.00.168778https://www.youtube.com/watch?v=4fwUzavktVI&t=420
3SQIGFcCMVKo60.070.00.159955https://www.youtube.com/watch?v=SQIGFcCMVKo&t=60
1.
2.
3.
 (Example 3)
 Audio example: https://www.youtube.com/watch?v=jfxTOlXF3Kk&t=100
 Most similar audios based on Pinecone vector search: 
idstart_timeend_timescoreurl
1zSq2D_GF00o90.0100.00.192909https://www.youtube.com/watch?v=zSq2D_GF00o&t=90
2sGM6xX5laFU30.040.00.191072https://www.youtube.com/watch?v=sGM6xX5laFU&t=30
3LNoDqTBH4QU30.040.00.186682https://www.youtube.com/watch?v=LNoDqTBH4QU&t=30
1.
2.
3.

Once finished with testing, delete the embeddings list to free up the RAM.

items_to_upload.clear()

Test on Arbitrary WAV File

Here we pick arbitrary audio recording, transform it into vector embeddings, query our index, and present the related YouTube videos.

Clone a repository which we will use to preprocess a wav file.

!git clone https://github.com/tensorflow/models.git
Cloning into 'models'...
remote: Enumerating objects: 56913, done.
remote: Counting objects: 100% (54/54), done.
remote: Compressing objects: 100% (51/51), done.
remote: Total 56913 (delta 11), reused 32 (delta 0), pack-reused 56859
Receiving objects: 100% (56913/56913), 572.46 MiB | 29.07 MiB/s, done.
Resolving deltas: 100% (39330/39330), done.

Download models and parameters needed for preprocessing.

!wget https://storage.googleapis.com/audioset/vggish_model.ckpt -q --show-progress
!wget https://storage.googleapis.com/audioset/vggish_pca_params.npz -q --show-progress
vggish_model.ckpt   100%[===================>] 277.62M  44.5MB/s    in 6.2s    
vggish_pca_params.n 100%[===================>]  71.31K  --.-KB/s    in 0s      

Download the test file.

!curl -o sample-file.wav https://storage.googleapis.com/audioset/yamalyzer/audio/acoustic-guitar.wav
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2411k  100 2411k    0     0  5013k      0 --:--:-- --:--:-- --:--:-- 5013k
# Define wav file name parameter
wav_file_name = 'sample-file.wav'
# Listening to the wav file
Audio(wav_file_name)

Create a .tfrecord file that contains embeddings for our sample wav file.

%%capture
!python models/research/audioset/vggish/vggish_inference_demo.py --wav_file "$wav_file_name" --tfrecord_file "sample-audio.tfrecord"

Create embeddings

# Create embeddings from tfrecord file
raw_dataset = tf.data.TFRecordDataset("sample-audio.tfrecord")

for raw_record in raw_dataset.take(1):
    example = tf.train.SequenceExample()
    example.ParseFromString(raw_record.numpy())
audio_frame = []

# We used 10 frames in uploaded audios
# Query vector must have the same dimensions
for i in range(10):
    audio_frame.append(np.frombuffer(example.feature_lists.feature_list["audio_embedding"].feature[i].bytes_list.value[0], dtype=np.int8))

audio_frame = np.array(audio_frame,  dtype=np.float32)
sample_embedding = audio_frame.flatten()

Query

Here we perform the query using the recording’s vector embeddings and present the related YouTube videos. Recall that we care about the audio of these videos and observe the audios' similarity to the query’s audio.

# Query the vector index and display the results
query_results = index.query(queries=[sample_embedding], top_k=10)

print('\n Test audio file:\n')
display(Audio(wav_file_name))

for res in query_results:
    df_result = pd.DataFrame({
        'id':['_'.join(id.split('_', -2)[:-2]) for id in res.ids], 
        'start_time': [str(id.split('_')[-2]) for id in res.ids],
        'end_time': [str(id.split('_')[-1]) for id in res.ids],
        'score':res.scores,
        'url': ['https://www.youtube.com/watch?v={}&t={}'.format('_'.join(id.split('_')[:-2]), 
                                                                int(float(id.split('_')[-2]))) for id in res.ids]})
    
    df_result = df_result[df_result['id'].map(check_video_url) == True].reset_index(drop=True)[:5]
    df_result.index += 1 
    df_result_styler = df_result.style.format({'url': make_clickable})
    print('\n\n Most similar audios based on Pinecone vector search: \n')
    display(df_result_styler)
        
    for i, row in df_result.iterrows():
        print(f'\n{(i)}.')
        display(play_video(row.id, int(float(row.start_time))))
0it [00:00, ?it/s]



 Test audio file: