Video Recommendation Using Similarity Search

In this notebook we will create a movie recommendation system on the Movielens dataset.

In this dataset we have a collection of movies, a bunch of users, and movie ratings from users that range from 1 to 5. These ratings are sparse because each user rates only a small percentage of the total movies, and they are biased because users' ratings are distributed differently. Our goal is to take any user ID and search for recommended movies for that user.

Open Notebook View Source

There are four parts to this recommender system:

  • The dataset of movie recommendations
  • Two deep learning models for embedding movies and users
  • A vector index to perform similarity search on those embeddings
  • A custom deep ranking model to score user-movie pairs and further improve relevance of the recommended movies.

We will use Pinecone to tie everything together and expose the recommender as a real-time service that will take any user ID and return relevant movie recommendations.

The architecture of our recommender system is shown below. In the “write” path (load), we start with 1,682 movie IDs and transform each into vector embeddings. The embedding function is trained such that proximity between movies in the multi-dimensional space represents the likelihood that a single user will rate both movies similarly. The 1,682 embeddings are then stored in the vector index.

recsys-explicit-pairwise-scoring-arch

In the “read” path (query), an embedding function transforms a given user ID into an embedding in the same vector space as the movies, representing the user’s movie preference. Candidate movie recommendations are then fetched based on proximity to the user’s location in the multi-dimensional space.

Finally, the candidates are ranked based on the custom deep ranking model.

Let’s get started.

Pinecone Setup

!pip install -qU pinecone-client
import pinecone

Get your API key here.

# Load Pinecone API key
import os 

api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pinecone.init(api_key=api_key)

#List all current indexes for your API key
pinecone.list_indexes()

Prepare Data and Models

Install and import relevant python packages

!pip install -qU scikit-learn pandas matplotlib tensorflow tfds-nightly
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_datasets as tfds

Prepare the Movie Rating Data

We are using the standard Movielens recommendation dataset containing 100,000 ratings (1-5) from 943 users on 1,682 movies.

# Load data
_ratings = tfds.load("movielens/100k-ratings", split="train")
_movies = tfds.load("movielens/100k-movies", split="train")

# Collect data that we'd use throughout this notebook
class DATA:
    ratings = None
    train = None
    test = None
    
DATA.ratings = _ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
    "user_rating": x["user_rating"]  # We keep the ratings of a user-movie pair
})

# Split data into training and testing
shuffled = DATA.ratings.shuffle(100_000, reshuffle_each_iteration=False)
DATA.train = shuffled.take(80_000)
DATA.test = shuffled.skip(80_000).take(20_000)

Load Pre-made Movie and User Embedding Models

We use premade movie and user embedding models following this tutorial by the TensorFlow recommender library. The models receive user id or movie title and output a corresponding 32-dimensional vector representation. The models were trained to preserve “similarity” such that similar user or movie ranking profiles will induce high inner-product values. For more information, follow the TensorFlow recommender tutorial.

#Downloading the pretrained models from Drive. If running outside Colab please install gdown.
!gdown --id 1OPflMW6yzBcSlaxWWOSa_SAFSe1iiTZl
# extract the models
import tarfile

tfile = tarfile.open("premade-models.tgz")
tfile.extractall()
tfile.close()
user_model = tf.saved_model.load('tf_retrieval_user_model')
movie_model = tf.saved_model.load('tf_retrieval_movie_model')

Load Premade Pairwise Ranking Model

We premade a pairwise ranking score model following this TensorFlow recommender library tutorial. The pairwise scoring model receives a user-movie vector embeddings pair and returns a relevance score indicating how relevant the user is to the movie.

scoring_model = tf.saved_model.load('tf_ranking_pairwise_score')

Sanity-Check the Models

Let’s run the three premade models locally. The user and movie models receive a corresponding id or title and output a vector. The scoring model receives two vectors and outputs a single score value. Later, we will upload these models to Pinecone’s model hub.

user_embedding = user_model(np.array(["42"]))
movie_embedding = movie_model(np.array(["One Flew Over the Cuckoo's Nest (1975)"]))
pairwised_score = scoring_model(tf.concat([user_embedding, movie_embedding], axis=1))
print(f"user embedding: {user_embedding}")
print(f"movie embedding: {movie_embedding}")
print(f"pairwise score: {pairwised_score}")
user embedding: [[ 0.01683676  0.4281152   0.4133582  -0.12968868  0.12420761  0.50607353
   0.08321043 -0.1869202  -0.1490686  -0.4416855   0.10145921  0.29046464
   0.29867113  0.5125411  -0.19146752  0.0899531  -0.47919908 -0.35008752
  -0.49349973 -0.21783432 -0.10766218 -0.32919145 -0.37802044 -0.39593038
  -0.19613707 -0.06320453 -0.05047337 -0.42567044 -0.20733188  0.45100415
   0.03500173 -0.15795036]]
movie embedding: [[ 0.60710007 -0.48249206 -0.35593712  0.13949828  0.40926376  0.31134516
   0.2955666   0.08130229  0.29382494 -0.14615472 -0.17722447  0.51746374
   0.1466727   0.50280887 -0.09506559  0.57231975  0.35222292  0.35850292
  -0.16756673 -0.17023738 -0.29335937  0.1672158  -0.2980432  -0.5691722
   0.25732362 -0.35206133  0.395624    0.29205453  0.24314538  0.13837588
   0.15536724  0.10586588]]
pairwise score: [[3.7656188]]

Create a Vector Similarity Search Service

This section shows how to use Pinecone to easily build and deploy a movie recommendation engine that turns raw data into vector embeddings, maintains a live index of those vectors, and returns recommended movies on demand.

The starting point is the premade movie and item embedding models, and the pairwise deep ranking model. Next, we show how to upload these vector embeddings into Pinecone’s vector index. Finally, we will query the index and retrieve recommendations for an arbitrary user-id.

The typical workflow of using Pinecone:

  1. Create an index.
  2. Create a connection to the index, and start sending insert and query requests.

Create an index

movielense_index_name = 'movielens-demo-simple'

# Check whether the index with the same name already exists
if movielense_index_name in pinecone.list_indexes():
    pinecone.delete_index(movielense_index_name)

pinecone.create_index(movielens_index_name, metric="dotproduct", shards=1)

Create a connection to the index service using the index’s name.

index = pinecone.Index(name = movielens_index_name)
index.info()

Upload Movie Embeddings

Our recommender service will index and fetch movie vector embeddings. This means we will use the premade movie model to generate embeddings for the movies.

Transform movies into embeddings, then prepare items to upload as a list of tuples in the form (id, vector).

# Get all of the movies
all_movies = [title.decode()[:64] for title in DATA.ratings.map(lambda xx: xx['movie_title']).as_numpy_iterator()]

# Transform movies into embeddings
movie_embeddings = movie_model(np.array(all_movies)).numpy()

# Prepare movie embeddings for upload
items_to_insert = list(zip(all_movies, movie_embeddings))
display(items_to_insert[:2])
[("One Flew Over the Cuckoo's Nest (1975)",
  array([ 0.60710007, -0.48249206, -0.35593712,  0.13949828,  0.40926376,
          0.31134516,  0.2955666 ,  0.08130229,  0.29382494, -0.14615472,
         -0.17722447,  0.51746374,  0.1466727 ,  0.50280887, -0.09506559,
          0.57231975,  0.35222292,  0.35850292, -0.16756673, -0.17023738,
         -0.29335937,  0.1672158 , -0.2980432 , -0.5691722 ,  0.25732362,
         -0.35206133,  0.395624  ,  0.29205453,  0.24314538,  0.13837588,
          0.15536724,  0.10586588], dtype=float32)),
 ('Strictly Ballroom (1992)',
  array([ 1.1002023e-02,  1.6174538e-01,  3.3386484e-01, -1.3717607e-01,
          3.5857826e-01, -5.8230944e-04,  6.1979485e-01,  3.0144712e-01,
          1.7912011e-01, -1.4533635e-02, -3.1142405e-01,  3.9141589e-01,
         -1.4907111e-01,  1.7742743e-01, -1.0792779e-01,  2.7480638e-02,
          5.1916885e-01, -1.9283614e-01, -1.0605173e-02, -3.9342603e-01,
         -2.7115941e-01,  2.0567685e-01, -2.9449952e-01,  7.1734749e-03,
          9.2954300e-02,  4.0293813e-01, -5.4571800e-02,  1.6616797e-01,
         -1.8021818e-01,  1.8495591e-02, -1.9566132e-01, -5.0525314e-01],
        dtype=float32))]

Insert items into the index service.

print('Index size before upsert:', index.info())

upsert_acks = index.upsert(items=[(ii[:64],x) for ii,x in items_to_insert])

print('Index size after upsert:', index.info())
print()

print(f'Sample upsert responses:')
pd.DataFrame(upsert_acks[:3])
Index size before upsert: InfoResult(index_size=0)

Index size after upsert: InfoResult(index_size=1664)

Sample upsert responses:
id
0One Flew Over the Cuckoo's Nest (1975)
1Strictly Ballroom (1992)
2Very Brady Sequel, A (1996)

Hurray! Your movie recommender service is up and running and all items are uploaded. You can now search for recommended movies for any user directly from your notebook or app, and get ranked results in real-time. Let’s add one more function that shows a movie poster for each result, just for fun.

Include Movie Posters in Results

The following utility functions helps visualize the results using scraped movie posters from the web. This is completely optional and just for fun.

import requests
from IPython.display import Image, display
from IPython.core.display import HTML 

def movie_title_to_poster_url(title):
    POSTERS_URL_TEMPLATE = "http://www.omdbapi.com/?i=tt3896198&apikey=4a3604bd&t="
    parsed_title = ' '.join(title.split()[:-1])
    if parsed_title.lower().endswith(', the'):
        parsed_title = "The "+parsed_title[:-len(', the')]
    try:
        r = requests.get(POSTERS_URL_TEMPLATE+parsed_title).json()
        return r['Poster']
    except:
        # Fallback image
        return "https://cdn4.iconfinder.com/data/icons/small-n-flat/24/movie-alt2-512.png"


def path_to_image_html(path):
    return '<img src="'+ path + '" width="100" >'


def show_image_tiles(images_urls, n_col=6):
    import math

    rows = [images_urls[ii * n_col: (1+ii) * n_col] for ii in range(math.ceil(len(images_urls) / n_col))]
    for ii, rr in enumerate(rows):
        if len(rr) < n_col:
            rows[ii] = rr + [''] * (n_col - len(rr))
    df = pd.DataFrame(rows)
    display(HTML(df.to_html(escape=False, formatters=[path_to_image_html]*df.shape[1])))


Re-ranking Recommendations

After we query a list of users and get recommendations from our vector index, we can rerank them using the custom pairwise-scoring model.

from pinecone import QueryResult

def rerank_recommendations(user_ids, query_results):
    
    output = []

    for q, res in zip(user_embeddings, query_results):
      
        updated_scores = [(i, scoring_model(tf.concat([[q], [movie_model(np.array([movie]))[0]]], axis=1)).numpy().flatten()[0]) for i, movie in enumerate(res.ids)]
        sorted_inx_score = sorted(updated_scores, key=lambda i_s: -i_s[1])
        new_scores = [s for _,s in sorted_inx_score]
        new_ids = [list(res.ids)[i] for i,_ in sorted_inx_score]
        new_data = np.array([list(res.data)[i] for i,_ in sorted_inx_score])
        output.append(QueryResult(ids=new_ids, scores=new_scores, data=new_data))

    return output

Let’s see what our movie recommender picks for User 55. Note how we retrieve 50 movies, yet only showing those ranked in the top five. The reason for setting the top-k=50 value is to give the re-ranking postprocessor (the ranking model) a sufficiently large set of movie candidates. Once reranked, we only care about the top five results.

# Define a list of users
user_ids = ['55']

# Retrieve user embeddings
user_embeddings = [user_model(np.array([user]))[0] for user in user_ids]

# Query by user embeddings
query_results = index.query(queries=user_embeddings, top_k=50, include_data=True)
query_results_reranked = rerank_recommendations(user_embeddings, query_results)
0it [00:00, ?it/s]

Comparing Recommendations With and Without Re-ranking

You may be wondering if the re-ranking step is worth the effort. Let’s see the query results with and without a reranking step and compare the results.

from IPython.display import display_html
from itertools import chain,cycle

def display_side_by_side(*args,titles=cycle([''])):
    html_str=''
    for df,title in zip(args, chain(titles,cycle(['</br>'])) ):
        html_str+='<th style="text-align:center"><td style="vertical-align:top">'
        html_str+=f'<h2>{title}</h2>'
        html_str+=df.to_html(escape=False, formatters=dict(image=path_to_image_html)).replace('table','table style="display:inline"')
        html_str+='</td></th>'
    display_html(html_str,raw=True)

# Print results
for _id, res, res_ranked in zip(user_ids, query_results, query_results_reranked):
    print(f'user_id={_id}')
    
    # Not ranked
    df = pd.DataFrame({'ids': res.ids[:5], 'scores': res.scores[:5]})
    df['image'] = list(map(movie_title_to_poster_url, df['ids']))

    # Ranked
    df_r = pd.DataFrame({'ids': res_ranked.ids[:5], 'scores': res_ranked.scores[:5]})
    df_r['image'] = list(map(movie_title_to_poster_url, df_r['ids']))

    # Show both
    display_side_by_side(df_r, df, titles=['With re-ranking','Without re-ranking'])

user_id=55

With Re-ranking

idsscoresimage
0Sudden Death (1995)4.426783
1Thinner (1996)4.384661
2Mr. Holland's Opus (1995)4.271734
3Die Hard (1988)4.262465
4Star Wars (1977)4.259742

Without Re-ranking

idsscoresimage
0Lost World: Jurassic Park, The (1997)2.879311
1Men in Black (1997)2.787736
2Executive Decision (1996)2.739666
3Con Air (1997)2.685612
4Rock, The (1996)2.578943

User Ranking Profile

Note the quality of the results depends on the quality of the provided premade models, and the accuracy of Pinecone’s vector index. The example demonstrates that the recommendations fit well with the user ranking profile.

user_ratings = [r for r in DATA.ratings if r["user_id"] == '55']
content = [dict(title="".join(map(chr,r["movie_title"].numpy())), movie=movie_title_to_poster_url("".join(map(chr,r["movie_title"].numpy()))), rating=r['user_rating'].numpy()) for r in user_ratings]
df = pd.DataFrame(content).sort_values('rating', ascending=False)
display(HTML(df.to_html(escape=False, index=False, formatters=dict(movie=path_to_image_html))))
titlemovierating
Fugitive, The (1993)5.0
Die Hard (1988)5.0
Twister (1996)5.0
Blade Runner (1982)5.0
Heat (1995)5.0
Braveheart (1995)5.0
Raiders of the Lost Ark (1981)4.0
Return of the Jedi (1983)4.0
Star Wars (1977)4.0
Pulp Fiction (1994)4.0
Volcano (1997)3.0
Men in Black (1997)3.0
Twelve Monkeys (1995)3.0
Independence Day (ID4) (1996)3.0
Rock, The (1996)3.0
Eraser (1996)2.0
Batman & Robin (1997)2.0
Speed 2: Cruise Control (1997)1.0
Mission: Impossible (1996)1.0
Executive Decision (1996)1.0
Con Air (1997)1.0

In this example, it appears the re-ranking step resulted in much better recommendations.


Turn Off the Recommender Service

Usually, you will deploy the service and keep it running for as long as it’s being used. In some cases you may want to turn it off, for example if you want to deploy a different version of the service with upgraded models.

pinecone.delete_index(movielens_index_name)

Summary

We showed how to build a movie recommendation service using Pinecone. The service embeds user IDs and movie titles, saves them in a vector index, receives and embeds queries, then retrieves, ranks, and displays personalized movie recommendations for any given user.

What will you build?

Upgrade your search or recommendation systems with just a few lines of code, or contact us for help.

}