Movie Recommender System With a Deep Ranking Model
In this notebook we will create a movie recommendation system on the Movielens dataset using Pinecone. In this dataset we have a collection of movies, a bunch of users, and movie ratings from users that range from 1 to 5. These ratings are sparse because each user rates only a small percentage of the total movies, and they are biased because users' ratings are distributed differently. Our goal is to take any user ID and search for recommended movies for that user.
There are five parts to this recommender system: the dataset of movie recommendations, two deep learning models for embedding movies and users, a vector index to perform similarity search on those embeddings, and a custom deep ranking model to score user-movie pairs and further improve relevance of the recommended movies. We will use Pinecone to tie everything together and expose the recommender as a real-time service that will take any user ID and return relevant movie recommendations.
The architecture of our recommender system is shown below. In the "write" path (load), we start with 1,682 movie IDs and transform each into vector embeddings. The embedding function is trained such that proximity between movies in the multi-dimensional space represents the likelihood that a single user will rate both movies similarly. The 1,682 embeddings are then stored in the vector index.
In the "read" path (query), an embedding function transforms a given user ID into an embedding in the same vector space as the movies, representing the user’s movie preference. Candidate movie recommendations are then fetched based on proximity to the user’s location in the multi-dimensional space. Finally, these candidates are ranked based on the custom deep ranking model.
Pinecone Setup
!pip install -qU pinecone-client
import pinecone
Get your API key here.
# Load Pinecone API key
import os
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pinecone.init(api_key=api_key, environment='us-west1-gcp')
#List all current indexes for your API key
pinecone.list_indexes()
[]
Prepare Data and Models
Install and import relevant python packages
!pip install -q scikit-learn pandas matplotlib==3.2.2 tensorflow tfds-nightly pandas-profiling unidecode
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_datasets as tfds
Prepare the Movie Rating Data
We are using the standard Movielens recommendation dataset containing 100,000 ratings (1-5) from 943 users on 1,682 movies.
# Load data
_ratings = tfds.load("movielens/100k-ratings", split="train")
_movies = tfds.load("movielens/100k-movies", split="train")
# Collect data that we'd use throughout this notebook
class DATA:
ratings = None
train = None
test = None
DATA.ratings = _ratings.map(lambda x: {
"movie_title": x["movie_title"],
"user_id": x["user_id"],
"user_rating": x["user_rating"] # We keep the ratings of a user-movie pair
})
# Split data into training and testing
shuffled = DATA.ratings.shuffle(100_000, reshuffle_each_iteration=False)
DATA.train = shuffled.take(80_000)
DATA.test = shuffled.skip(80_000).take(20_000)
[1mDownloading and preparing dataset 4.70 MiB (download: 4.70 MiB, generated: 32.41 MiB, total: 37.10 MiB) to /root/tensorflow_datasets/movielens/100k-ratings/0.1.0...[0m
Dl Completed...: 0 url [00:00, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s]
Extraction completed...: 0 file [00:00, ? file/s]
Generating splits...: 0%| | 0/1 [00:00<?, ? splits/s]
Generating train examples...: 0%| | 0/100000 [00:00<?, ? examples/s]
Shuffling /root/tensorflow_datasets/movielens/100k-ratings/0.1.0.incomplete7GH0TM/movielens-train.tfrecord*...…
[1mDataset movielens downloaded and prepared to /root/tensorflow_datasets/movielens/100k-ratings/0.1.0. Subsequent calls will reuse this data.[0m
[1mDownloading and preparing dataset 4.70 MiB (download: 4.70 MiB, generated: 150.35 KiB, total: 4.84 MiB) to /root/tensorflow_datasets/movielens/100k-movies/0.1.0...[0m
Dl Completed...: 0 url [00:00, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s]
Extraction completed...: 0 file [00:00, ? file/s]
Generating splits...: 0%| | 0/1 [00:00<?, ? splits/s]
Generating train examples...: 0%| | 0/1682 [00:00<?, ? examples/s]
Shuffling /root/tensorflow_datasets/movielens/100k-movies/0.1.0.incomplete3NO1P1/movielens-train.tfrecord*...:…
[1mDataset movielens downloaded and prepared to /root/tensorflow_datasets/movielens/100k-movies/0.1.0. Subsequent calls will reuse this data.[0m
Load Pre-made Movie and User Embedding Models
We use premade movie and user embedding models following this tutorial by the TensorFlow recommender library. The models receive user id or movie title and output a corresponding 32-dimensional vector representation. The models were trained to preserve "similarity" such that similar user or movie ranking profiles will induce high inner-product values. For more information, follow the TensorFlow recommender tutorial.
#Downloading the pretrained models from Drive. If running outside Colab please install gdown.
!gdown --id 1OPflMW6yzBcSlaxWWOSa_SAFSe1iiTZl
Downloading...
From: https://drive.google.com/uc?id=1OPflMW6yzBcSlaxWWOSa_SAFSe1iiTZl
To: /content/premade-models.tgz
100% 492k/492k [00:00<00:00, 90.4MB/s]
# extract the models
import tarfile
tfile = tarfile.open("premade-models.tgz")
tfile.extractall()
tfile.close()
user_model = tf.saved_model.load('tf_retrieval_user_model')
movie_model = tf.saved_model.load('tf_retrieval_movie_model')
Load Premade Pairwise Ranking Model
We premade a pairwise ranking score model following this TensorFlow recommender library tutorial. The pairwise scoring model receives a user-movie vector embeddings pair and returns a relevance score indicating how relevant the user is to the movie.
scoring_model = tf.saved_model.load('tf_ranking_pairwise_score')
Sanity-Check the Models
Let's run the three premade models locally. The user and movie models receive a corresponding id or title and output a vector. The scoring model receives two vectors and outputs a single score value. Later, we will upload these models to Pinecone's model hub.
user_embedding = user_model(np.array(["42"]))
movie_embedding = movie_model(np.array(["One Flew Over the Cuckoo's Nest (1975)"]))
pairwised_score = scoring_model(tf.concat([user_embedding, movie_embedding], axis=1))
print(f"user embedding: {user_embedding}")
print(f"movie embedding: {movie_embedding}")
print(f"pairwise score: {pairwised_score}")
user embedding: [[ 0.01683676 0.4281152 0.4133582 -0.12968868 0.12420761 0.50607353
0.08321043 -0.1869202 -0.1490686 -0.4416855 0.10145921 0.29046464
0.29867113 0.5125411 -0.19146752 0.0899531 -0.47919908 -0.35008752
-0.49349973 -0.21783432 -0.10766218 -0.32919145 -0.37802044 -0.39593038
-0.19613707 -0.06320453 -0.05047337 -0.42567044 -0.20733188 0.45100415
0.03500173 -0.15795036]]
movie embedding: [[ 0.60710007 -0.48249206 -0.35593712 0.13949828 0.40926376 0.31134516
0.2955666 0.08130229 0.29382494 -0.14615472 -0.17722447 0.51746374
0.1466727 0.50280887 -0.09506559 0.57231975 0.35222292 0.35850292
-0.16756673 -0.17023738 -0.29335937 0.1672158 -0.2980432 -0.5691722
0.25732362 -0.35206133 0.395624 0.29205453 0.24314538 0.13837588
0.15536724 0.10586588]]
pairwise score: [[3.7656188]]
Configure Pinecone
This section shows how to use Pinecone to easily build and deploy a movie recommendation engine that turns raw data into vector embeddings, maintains a live index of those vectors, and returns recommended movies on demand. The starting point is the premade movie and item embedding models, and the pairwise deep ranking model. Next, we show how to upload these vector embeddings into Pinecone's vector index. Finally, we will query the index and retrieve recommendations for an arbitrary user-id.
Create a Managed Service
The typical workflow of using Pinecone:
- Create an index.
- Create a connection to the index, and start sending insert and query requests.
Create an index
movielens_index_name = 'movielens-demo-simple'
# Check whether the index with the same name already exists
if movielens_index_name in pinecone.list_indexes():
pinecone.delete_index(movielens_index_name)
pinecone.create_index(movielens_index_name, dimension=32, metric="dotproduct")
Create a connection to the index service using the index's name.
index = pinecone.Index(movielens_index_name)
index.describe_index_stats()
{'dimension': 32, 'namespaces': {}}
Upload movie embeddings
Our recommender service will index and fetch movie vector embeddings. This means we will use the premade movie model to generate embeddings for the movies.
Transform movies into embeddings. Prepare items to upload as a list of tuples in the form (id, vector).
from unidecode import unidecode
# Get all of the movies
all_movies = list(set([unidecode(title.decode()[:64]) for title in DATA.ratings.map(lambda xx: xx['movie_title']).as_numpy_iterator()]))
# Transform movies into embeddings
movie_embeddings = movie_model(np.array(all_movies)).numpy().tolist()
# Prepare movie embeddings for upload
items_to_insert = list(zip(all_movies, movie_embeddings))
display(items_to_insert[:1])
[('Next Karate Kid, The (1994)',
[-0.2438640594482422,
0.3711131513118744,
0.1843249350786209,
-0.2965642511844635,
0.2615542709827423,
0.032611653208732605,
-0.42649367451667786,
-0.30155664682388306,
0.16028240323066711,
-0.406296044588089,
0.34872621297836304,
-0.4027850329875946,
0.31748971343040466,
0.30967098474502563,
-0.1959797888994217,
0.36000943183898926,
-0.20974202454090118,
-0.3767372965812683,
-0.23246760666370392,
-0.2351655662059784,
0.16198723018169403,
0.03874821215867996,
0.2895303964614868,
-0.294216513633728,
-0.30399253964424133,
0.29395079612731934,
-0.38212960958480835,
-0.21649977564811707,
-0.3853921890258789,
0.07490364462137222,
0.30172020196914673,
0.24804189801216125])]
Insert items into the index service.
import itertools
def chunks(iterable, batch_size=100):
it = iter(iterable)
chunk = tuple(itertools.islice(it, batch_size))
while chunk:
yield chunk
chunk = tuple(itertools.islice(it, batch_size))
print('Index statistics before upsert:', index.describe_index_stats())
# Upsert data
for batch in chunks([(ii[:64],x) for ii,x in items_to_insert], 1000):
index.upsert(vectors=batch)
print('Index statistics after upsert:', index.describe_index_stats())
Index statistics before upsert: {'dimension': 32, 'namespaces': {}}
Index statistics after upsert: {'dimension': 32, 'namespaces': {'': {'vector_count': 1664}}}
Search for Recommended Movies by User ID
Hurray! Your movie recommender service is up and running and all items are uploaded. You can now search for recommended movies for any user directly from your notebook or app, and get ranked results in real-time. Let's add one more function that shows a movie poster for each result, just for fun.
Include Movie Posters in Results
The following utility functions helps visualize the results using scraped movie posters from the web. This is completely optional and just for fun.
import requests
from IPython.display import Image, display
from IPython.core.display import HTML
def movie_title_to_poster_url(title):
POSTERS_URL_TEMPLATE = "http://www.omdbapi.com/?i=tt3896198&apikey=4a3604bd&t="
parsed_title = ' '.join(title.split()[:-1])
if parsed_title.lower().endswith(', the'):
parsed_title = "The "+parsed_title[:-len(', the')]
try:
r = requests.get(POSTERS_URL_TEMPLATE+parsed_title).json()
return r['Poster']
except:
# Fallback image
return "https://cdn4.iconfinder.com/data/icons/small-n-flat/24/movie-alt2-512.png"
def path_to_image_html(path):
return '<img src="'+ path + '" width="100" >'
def show_image_tiles(images_urls, n_col=6):
import math
rows = [images_urls[ii * n_col: (1+ii) * n_col] for ii in range(math.ceil(len(images_urls) / n_col))]
for ii, rr in enumerate(rows):
if len(rr) < n_col:
rows[ii] = rr + [''] * (n_col - len(rr))
df = pd.DataFrame(rows)
display(HTML(df.to_html(escape=False, formatters=[path_to_image_html]*df.shape[1])))
Re-ranking recommendations
After we query a list of users and get recommendations from our vector index, we can rerank them using the custom pairwise-scoring model.
def rerank_recommendations(user_embeddings, query_results):
result_matches = []
for q, res in zip(user_embeddings, query_results):
ids = [match.id for match in res.matches]
data = [match.values for match in res.matches]
updated_scores = [(i, scoring_model(tf.concat([[q], [movie_model(np.array([movie]))[0]]], axis=1)).numpy().flatten()[0]) for i, movie in enumerate(ids)]
sorted_inx_score = sorted(updated_scores, key=lambda i_s: -i_s[1])
new_scores = [float(s) for _,s in sorted_inx_score]
new_ids = [list(ids)[i] for i,_ in sorted_inx_score]
new_data = [list(data)[i] for i,_ in sorted_inx_score]
for new_id, new_score, new_data in zip(new_ids, new_scores, new_data):
result_matches.append({
'id':new_id,
'score': new_score,
'values': new_data
})
query_reranked_results = [{'matches' : result_matches}]
return query_reranked_results
Search for Recommended Movie
Let's see what our movie recommender picks for User 55. Note how we retrieve 50 movies, yet only showing those ranked in the top five. The reason for setting the top-k=50
value is to give the re-ranking postprocessor (the ranking model) a sufficiently large set of movie candidates. Once reranked, we only care about the top five results.
# Define a list of users
user_ids = ['55']
# Retrieve user embeddings
user_embeddings = [user_model(np.array([user]))[0].numpy().tolist() for user in user_ids]
# Query by user embeddings
query_results = []
for xq in user_embeddings:
res = index.query(xq, top_k=50, include_values=True)
query_results.append(res)
query_results_reranked = rerank_recommendations(user_embeddings, query_results)
Comparing Recommendations With and Without Re-ranking
You may be wondering if the re-ranking step is worth the effort. Let's see the query results with and without a reranking step and compare the results.
from IPython.display import display_html
from itertools import chain,cycle
def display_side_by_side(*args,titles=cycle([''])):
html_str=''
for df,title in zip(args, chain(titles,cycle(['</br>'])) ):
html_str+='<th style="text-align:center"><td style="vertical-align:top">'
html_str+=f'<h2>{title}</h2>'
html_str+=df.to_html(escape=False, formatters=dict(image=path_to_image_html)).replace('table','table style="display:inline"')
html_str+='</td></th>'
display_html(html_str,raw=True)
# Print results
for _id, res, res_ranked in zip(user_ids, query_results, query_results_reranked):
print(f'user_id={_id}')
# Not ranked
df = pd.DataFrame({'ids': [match['id'] for match in res['matches']][:5], 'scores': [match['score'] for match in res['matches']][:5]})
df['image'] = list(map(movie_title_to_poster_url, df['ids']))
# Ranked
df_r = pd.DataFrame({'ids': [match['id'] for match in res_ranked['matches'][:5]], 'scores': [match['score'] for match in res_ranked['matches'][:5]]})
df_r['image'] = list(map(movie_title_to_poster_url, df_r['ids']))
# Show both
display_side_by_side(df_r, df, titles=['With re-ranking','Without re-ranking'])
user_id=55
With re-ranking
ids | scores | image | |
---|---|---|---|
0 | Sudden Death (1995) | 4.426783 | ![]() |
1 | Thinner (1996) | 4.384661 | ![]() |
2 | Mr. Holland's Opus (1995) | 4.271734 | ![]() |
3 | Die Hard (1988) | 4.262465 | ![]() |
4 | Star Wars (1977) | 4.259742 | ![]() |
Without re-ranking
ids | scores | image | |
---|---|---|---|
0 | Lost World: Jurassic Park, The (1997) | 2.879311 | ![]() |
1 | Men in Black (1997) | 2.787736 | ![]() |
2 | Executive Decision (1996) | 2.739666 | ![]() |
3 | Con Air (1997) | 2.685612 | ![]() |
4 | Rock, The (1996) | 2.578943 | ![]() |
User Ranking Profile
Note the quality of the results depends on the quality of the provided premade models, and the accuracy of Pinecone's vector index. The example demonstrates that the recommendations fit well with the user ranking profile.
user_ratings = [r for r in DATA.ratings if r["user_id"] == '55']
content = [dict(title="".join(map(chr,r["movie_title"].numpy())), movie=movie_title_to_poster_url("".join(map(chr,r["movie_title"].numpy()))), rating=r['user_rating'].numpy()) for r in user_ratings]
df = pd.DataFrame(content).sort_values('rating', ascending=False)
display(HTML(df.to_html(escape=False, index=False, formatters=dict(movie=path_to_image_html))))
title | movie | rating |
---|---|---|
Fugitive, The (1993) | ![]() |
5.0 |
Die Hard (1988) | ![]() |
5.0 |
Twister (1996) | ![]() |
5.0 |
Blade Runner (1982) | ![]() |
5.0 |
Heat (1995) | ![]() |
5.0 |
Braveheart (1995) | ![]() |
5.0 |
Raiders of the Lost Ark (1981) | ![]() |
4.0 |
Return of the Jedi (1983) | ![]() |
4.0 |
Star Wars (1977) | ![]() |
4.0 |
Pulp Fiction (1994) | ![]() |
4.0 |
Volcano (1997) | ![]() |
3.0 |
Men in Black (1997) | ![]() |
3.0 |
Twelve Monkeys (1995) | ![]() |
3.0 |
Independence Day (ID4) (1996) | ![]() |
3.0 |
Rock, The (1996) | ![]() |
3.0 |
Eraser (1996) | ![]() |
2.0 |
Batman & Robin (1997) | ![]() |
2.0 |
Speed 2: Cruise Control (1997) | ![]() |
1.0 |
Mission: Impossible (1996) | ![]() |
1.0 |
Executive Decision (1996) | ![]() |
1.0 |
Con Air (1997) | ![]() |
1.0 |
In this example, it appears the re-ranking step resulted in much better recommendations.
Delete the index
Usually, you will deploy the index and keep it running for as long as it's being used. In some cases you may want to delete it, for example if you want to deploy a different version of the index with upgraded models.
pinecone.delete_index(movielens_index_name)
Summary
We showed how to build a movie recommendation service using Pinecone. The service embeds user IDs and movie titles, saves them in a vector index, receives and embeds queries, then retrieves, ranks, and displays personalized movie recommendations for any given user.