Audio Similarity Search

This notebook shows how to use Pinecone’s similarity search as a service to build an audio search application. Audio search could be used for things like finding songs and metadata within a catalog based on a sample, finding similar sounds in an audio library, or detecting who’s speaking in some audio file.

We will index a set of audio recordings from YouTube videos in the form of vector embeddings. These vector embeddings are rich, mathematical representations of the audio recordings which make it possible to determine how similar are the recordings to one another using algorithms.

We will then take some new (unseen) audio recordings and search through the index to find the most similar matches, along with their YouTube links.

Open Notebook in Google Colab

Dependencies

!pip install --quiet tensorflow
!pip install --quiet tensorflow_hub
!pip install --quiet progressbar2
!pip install --quiet tf_slim 
!pip install --quiet soundfile
!pip install --quiet resampy
!pip install --quiet pandas
import os
import json
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from progressbar import progressbar
from datetime import timedelta
from IPython.display import Audio
from IPython.display import YouTubeVideo
import platform
import requests

# SoundFile depends on the system library libsndfile.
# In case you are using Linux, this package needs to be installed.
if platform.system() == 'Linux':
    !sudo apt-get -q -y install libsndfile1

Load data

We will use the AudioSet dataset and premade embeddings that are generated using the VGG-inspired acoustic model.

!wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/features/features.tar.gz -q --show-progress

!tar xzf features.tar.gz

Here you can decide whether you want to use a smaller balanced dataset, which contains 22,176 segments from distinct videos, or the larger unbalanced dataset with 2,042,985 segments.

If you want a quick demo of how the Pinecone index works on audio search, we suggest using the balanced dataset. Unbalanced dataset requires more time and local memory to complete several steps in the notebook, and more shards to index all audios.

USE_FULL_DATASET = False

if USE_FULL_DATASET:
    DIR = "audioset_v1_embeddings/unbal_train/*"
else:
    DIR = "audioset_v1_embeddings/bal_train/*"

Pinecone Installation and Setup

!pip install --quiet -U pinecone-client
import pinecone

# Load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pinecone.init(api_key=api_key, environment='us-west1-gcp')

Get a Pinecone API key if you don’t have one already.

Define a New Pinecone Index

# Pick a name for the new index
index_name = 'audio-search-demo'
# Check whether the index with the same name already exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

Create an index

Pinecone distributes the indexed items over a set of nodes, which are called index shards. Each index shard can handle 1GB of data. Thus, depending on the amount of data, we will create an index with an appropriate number of shards.

if USE_FULL_DATASET:
    NUM_OF_SHARDS = 7
else:
    NUM_OF_SHARDS = 1

pinecone.create_index(name=index_name, dimension=1280, metric='cosine', shards=NUM_OF_SHARDS)

Connect to the new index

index = pinecone.Index(index_name)
index.describe_index_stats()
InfoResult(index_size=0)

Upload

As the majority of files contain 10 frames (more than 96% in both balanced and unbalanced datasets), we will exclude all the records with different number of frames. Pinecone’s vector index requires all vector embeddings to have the same dimension.

raw_dataset = tf.data.TFRecordDataset(tf.data.Dataset.list_files(DIR))

items_to_upload = []

for _, raw_record in progressbar(enumerate(raw_dataset)):
    record = tf.train.SequenceExample()
    record.ParseFromString(raw_record.numpy())

    video_id = "{}_{}_{}".format(record.context.feature['video_id'].bytes_list.value[0].decode("utf-8") ,
                                 record.context.feature['start_time_seconds'].float_list.value[0],
                                 record.context.feature['end_time_seconds'].float_list.value[0])
    
    n_frames = len(record.feature_lists.feature_list['audio_embedding'].feature)
    if n_frames == 10:
        audio_frame = []
        for i in range(n_frames):
            audio_frame.append(np.frombuffer(record.feature_lists.feature_list["audio_embedding"].feature[i].bytes_list.value[0], dtype=np.int8))
      
        audio_frame = np.array(audio_frame,  dtype=np.float32)

        items_to_upload.append((video_id, audio_frame.flatten().tolist()))

Uploading the items to the index. We perform upsert operation on the index, which means it updates any vector id if it already exists.

import itertools

def chunks(iterable, batch_size=50):
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

# Upload items
for batch in chunks(items_to_upload, 100):
    index.upsert(vectors=batch)

index.describe_index_stats()

Let’s search for a few audio recordings. First, we will pick three vector embeddings from our index collection. We use these vectors as queries and retrieve the most relevant items and present the related YouTube videos. Later, we will pick arbitrary audio, transform it into vector embeddings, and query our index.

# Sort items to escape randomness when testing
items_to_upload = sorted(items_to_upload, key=lambda item: item[0], reverse=True)

# Define test audios
test_audios = []

if USE_FULL_DATASET:
    test_audios = items_to_upload[1000::20000][:3]
else:
    test_audios = items_to_upload[::2500][:3]

def check_video_url(video_id):
    checker_url = "https://www.youtube.com/oembed?url=http://www.youtube.com/watch?v="
    video_url = checker_url + video_id
    response = requests.get(video_url)
    return response.status_code == 200 and '"status":"UNPLAYABLE"' not in requests.get("http://www.youtube.com/watch?v="+video_id).text

def play_video(video_id, start_time):
    start=int(timedelta(seconds=start_time).total_seconds())
    return YouTubeVideo(video_id, start=start, autoplay=0, theme="light", color="red", width=400, height=300)
    
def make_clickable(val):
    # target _blank to open new window
    return '<a target="_blank" href="{}">{}</a>'.format(val, val)
    
def get_similar_audios(audio_embedding: np.array):
    # Query the vector index
    query_results = index.query(queries=[audio_embedding], top_k=10)
    result = query_results.results[0].matches
    ids = [res.id for res in result]
    scores = [res.score for res in result]
    df_result = pd.DataFrame({
              'id':['_'.join(id.split('_', -2)[:-2]) for id in ids], 
              'start_time': [str(id.split('_')[-2]) for id in ids],
              'end_time': [str(id.split('_')[-1]) for id in ids],
              'score': scores,
              'url': ['https://www.youtube.com/watch?v={}&t={}'.format('_'.join(id.split('_')[:-2]), 
                                                                  int(float(id.split('_')[-2]))) for id in ids]})
    
    # Exlude all the videos that are not available or private
    # Keep top 4 videos that can be played
    # Exclude first result as the same record exists in vector index
    df_result = df_result[df_result['id'].map(check_video_url) == True].reset_index(drop=True)[1:4]

    # Make url column clickable
    df_result_styler = df_result.style.format({'url': make_clickable})
      
    print('\n\n Most similar audios based on Pinecone vector search: \n')
    display(df_result_styler)
      
    for i, row in df_result.iterrows():
        print(f'\n{(i)}.')
        display(play_video(row.id, int(float(row.start_time))))

for i, test_audio in enumerate(test_audios):
    vn = '_'.join(test_audio[0].split('_', -2)[:-2])
    vs = test_audio[0].split('_', -2)[-2]

    vs = int(float(vs))
    print(f'\n\n\n (Example {i+1})\n Audio example: https://www.youtube.com/watch?v={vn}&t={vs}\n')
    display(play_video(vn, vs))
    get_similar_audios(test_audio[1])
 (Example 1)
 Audio example: https://www.youtube.com/watch?v=zzya4dDVRLk&t=30


 Most similar audios based on Pinecone vector search:
id start_time end_time score url
WwFYNmTS41I 10.0 20.0 0.239653 https://www.youtube.com/watch?v=WwFYNmTS41I&t=10
JoP-iqBMmi4 10.0 20.0 0.236945 https://www.youtube.com/watch?v=JoP-iqBMmi4&t=10
jllMYE8-NVE 280.0 290.0 0.228751 https://www.youtube.com/watch?v=jllMYE8-NVE&t=280
 (Example 2)
 Audio example: https://www.youtube.com/watch?v=rTbY6xcjV34&t=510

 Most similar audios based on Pinecone vector search:
id start_time end_time score url
3eEeMSPta40 150.0 160.0 0.195225 https://www.youtube.com/watch?v=3eEeMSPta40&t=150
4fwUzavktVI 420.0 430.0 0.168778 https://www.youtube.com/watch?v=4fwUzavktVI&t=420
SQIGFcCMVKo 60.0 70.0 0.159955 https://www.youtube.com/watch?v=SQIGFcCMVKo&t=60
 (Example 3)
 Audio example: https://www.youtube.com/watch?v=jfxTOlXF3Kk&t=100


 Most similar audios based on Pinecone vector search:
id start_time end_time score url
zSq2D_GF00o 90.0 100.0 0.192909 https://www.youtube.com/watch?v=zSq2D_GF00o&t=90
sGM6xX5laFU 30.0 40.0 0.191072 https://www.youtube.com/watch?v=sGM6xX5laFU&t=30
LNoDqTBH4QU 30.0 40.0 0.186682 https://www.youtube.com/watch?v=LNoDqTBH4QU&t=30

Once finished with testing, delete the embeddings list to free up the RAM.

items_to_upload.clear()

Test on Arbitrary WAV File

Here we pick arbitrary audio recording, transform it into vector embeddings, query our index, and present the related YouTube videos.

Clone a repository which we will use to preprocess a wav file.

!git clone https://github.com/tensorflow/models.git

Download models and parameters needed for preprocessing.

!wget https://storage.googleapis.com/audioset/vggish_model.ckpt -q --show-progress
!wget https://storage.googleapis.com/audioset/vggish_pca_params.npz -q --show-progress

Download the test file.

!curl -o sample-file.wav https://storage.googleapis.com/audioset/yamalyzer/audio/acoustic-guitar.wav

# Define wav file name parameter
wav_file_name = 'sample-file.wav'

# Listening to the wav file
Audio(wav_file_name)

Create a .tfrecord file that contains embeddings for our sample wav file.

%%capture
!python models/research/audioset/vggish/vggish_inference_demo.py --wav_file "$wav_file_name" --tfrecord_file "sample-audio.tfrecord"

Create embeddings

# Create embeddings from tfrecord file
raw_dataset = tf.data.TFRecordDataset("sample-audio.tfrecord")

for raw_record in raw_dataset.take(1):
    example = tf.train.SequenceExample()
    example.ParseFromString(raw_record.numpy())

audio_frame = []

# We used 10 frames in uploaded audios
# Query vector must have the same dimensions
for i in range(10):
    audio_frame.append(np.frombuffer(example.feature_lists.feature_list["audio_embedding"].feature[i].bytes_list.value[0], dtype=np.int8))

audio_frame = np.array(audio_frame,  dtype=np.float32)
sample_embedding = audio_frame.flatten().tolist()

Query

Here we perform the query using the recording’s vector embeddings and present the related YouTube videos. Recall that we care about the audio of these videos and observe the audios' similarity to the query’s audio.

# Query the vector index and display the results
query_results = index.query(queries=[sample_embedding], top_k=10)

print('\n Test audio file:\n')
display(Audio(wav_file_name))

result = query_results.results[0].matches
ids = [res.id for res in result]
scores = [res.score for res in result]
df_result = pd.DataFrame({
              'id':['_'.join(id.split('_', -2)[:-2]) for id in ids], 
              'start_time': [str(id.split('_')[-2]) for id in ids],
              'end_time': [str(id.split('_')[-1]) for id in ids],
              'score': scores,
              'url': ['https://www.youtube.com/watch?v={}&t={}'.format('_'.join(id.split('_')[:-2]), 
                                                                  int(float(id.split('_')[-2]))) for id in ids]})

df_result = df_result[df_result['id'].map(check_video_url) == True].reset_index(drop=True)[:5]
df_result.index += 1 
df_result_styler = df_result.style.format({'url': make_clickable})
print('\n\n Most similar audios based on Pinecone vector search: \n')
display(df_result_styler)
        
for i, row in df_result.iterrows():
    print(f'\n{(i)}.')
    display(play_video(row.id, int(float(row.start_time))))
 Most similar audios based on Pinecone vector search:
id start_time end_time score url
hbCaMcbT8to 30.0 40.0 0.261676 https://www.youtube.com/watch?v=hbCaMcbT8to&t=30
ZVvX2-ldhvY 30.0 40.0 0.223604 https://www.youtube.com/watch?v=ZVvX2-ldhvY&t=30
XWVGQbfpA0k 130.0 140.0 0.184721 https://www.youtube.com/watch?v=XWVGQbfpA0k&t=130
QS3DabGF41Y 120.0 130.0 0.178976 https://www.youtube.com/watch?v=QS3DabGF41Y&t=120
n27GpYJ_2Hs 30.0 40.0 0.171386 https://www.youtube.com/watch?v=n27GpYJ_2Hs&t=30

Delete the Index

Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again.

pinecone.delete_index(index_name)