Personalized Article Recommender

This notebook demonstrates how to use Pinecone’s similarity search to create a simple personalized article or content recommender.

The goal is to create a recommendation engine that retrieves the best article recommendations for each user. When making recommendations with content-based filtering, we evaluate the user’s past behavior and the content items themselves. So in this example, users will be recommended articles that are similar to those they’ve already read.

Install and Import Python Packages

!pip install --quiet sentence-transformers
!pip install --quiet wordcloud
!pip install --quiet pandas==1.2.3
!pip install --quiet swifter
import pandas as pd
import numpy as np
import time
import swifter
import re
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
%matplotlib inline

In the following sections, we will use Pinecone to easily build and deploy an article recommendation engine. Pinecone will be responsible for storing embeddings for articles, maintaining a live index of those vectors, and returning recommended articles on-demand.

Pinecone Installation and Setup

!pip install --quiet -U pinecone-client
import pinecone.graph
import pinecone.service
import pinecone.connector
import pinecone.hub
# load Pinecone API key

api_key = '<YOUR_API_KEY>'
pinecone.init(api_key=api_key)

Get a Pinecone API key if you don’t have one already.

Create a New Service

The typical workflow when using Pinecone:

  1. Create a graph.
  2. Deploy the graph and wait for the corresponding named-service to become live.
  3. Create a connection to the service, and start sending read/write requests.
service_name = 'articles-recommendation'
if service_name in pinecone.service.ls():
    pinecone.service.stop(service_name)

Create a graph

graph = pinecone.graph.IndexGraph(metric='cosine')

graph.view()

Similarity search service for personalized content recommendations

Deploy the graph

pinecone.service.deploy(service_name, graph, timeout=300)
{'success': True, 'msg': ''}

Create the connection to the new service

conn = pinecone.connector.connect(service_name)
conn.info()
InfoResult(index_size=0)

Upload articles

Next, we will prepare data for the Pinecone vector index, and insert it in batches.

Load data

The dataset used throughout this example contains 2.7 million news articles and essays from 27 American publications.

Let’s download the dataset.

!rm all-the-news-2-1.zip
!rm all-the-news-2-1.csv
!wget https://www.dropbox.com/s/cn2utnr5ipathhh/all-the-news-2-1.zip -q --show-progress
!unzip -q all-the-news-2-1.zip
all-the-news-2-1.zi 100%[===================>]   3.13G  83.2MB/s    in 36s

Use Ready Made Vector Embedding Model

Model used in this example is the Average Word Embeddings Models. This model allows us to create vector embeddings for each article. We will create the vectors using the title and the content of each article.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('average_word_embeddings_komninos')
/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0

Using the complete dataset may require more time for the model to generate vector embeddings. We will use only a sample, but if you want to try uploading the whole dataset, set the NROWS flag to None.

NROWS = 200000      # number of rows to be loaded from the csv, set to None for loading all rows, reduce if you have a low amount of RAM or want a faster execution
BATCH_SIZE = 1000   # batch size for upserting in batches

Upload Data in Batches

Let’s prepare data for upload.

Uploading the data may take a while, and depends on the network you use.

#%%time

def prepare_data(data) -> pd.DataFrame:
    'Preprocesses data and prepares it for upsert.'

    # rename id column and remove unnecessary columns
    print("Preparing data...")
    data.rename(columns={"Unnamed: 0": "id"}, inplace = True)
    data.drop(columns=['Unnamed: 0.1', 'date'], inplace = True)

    # extract only first few sentences of each article for quicker vector calculations
    data['article'] = data['article'].fillna('')
    data['article'] = data.article.swifter.apply(lambda x: ' '.join(re.split(r'(?<=[.:;])\s', x)[:4]))
    data['title_article'] = data['title'] + data['article']

    # create a vector embedding based on title and article columns
    print('Encoding articles...')
    encoded_articles = model.encode(data['title_article'], show_progress_bar=True)
    data['article_vector'] = pd.Series(encoded_articles.tolist())

    return data


def upload_items(data):
    'Uploads data in batches.'
    print("Uploading items")

    # create a list of items for upload
    items_to_upload = [(row.id, row.article_vector) for i,row in data.iterrows()]

    # upsert
    for i in range(0, len(items_to_upload), BATCH_SIZE):
        conn.upsert(items=items_to_upload[i:i+BATCH_SIZE]).collect()


def process_file(filename: str) -> pd.DataFrame:
    'Reads csv files in chunks, prepares and uploads data.'

    data = pd.read_csv(filename, nrows=NROWS)
    data = prepare_data(data)
    upload_items(data)
    return data

uploaded_data = process_file(filename='all-the-news-2-1.csv')
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3338: DtypeWarning: Columns (6,10) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):
Preparing data...
Pandas Apply:   0%|          | 0/200000 [00:00<?, ?it/s]
Encoding articles...
Batches:   0%|          | 0/6250 [00:00<?, ?it/s]
Uploading items
conn.info()
InfoResult(index_size=200000)

Query the Pinecone Service

We will query the index for the specific users. The users are defined as a set of the articles that they previously read. More specifically, we will define 10 articles for each user, and based on the article embeddings, we will define a unique embedding for the user.

We will create three users and query Pinecone for each of them:

  • User who likes to read Sports News
  • User who likes to read Entertainment News
  • User who likes to read Business News

Let’s define mappings for titles, sections, and publications for each article.

titles_mapped = dict(zip(uploaded_data.id, uploaded_data.title))
sections_mapped = dict(zip(uploaded_data.id, uploaded_data.section))
publications_mapped = dict(zip(uploaded_data.id, uploaded_data.publication))

Also, we will define a function that uses wordcloud to visualize results.

def get_wordcloud_for_user(recommendations):

    stopwords = set(STOPWORDS).union([np.nan, 'NaN', 'S'])

    wordcloud = WordCloud(
                   max_words=50000,
                   min_font_size =12,
                   max_font_size=50,
                   relative_scaling = 0.9,
                   stopwords=set(STOPWORDS),
                   normalize_plurals= True
    )

    clean_titles = [word for word in recommendations.title.values if word not in stopwords]
    title_wordcloud = wordcloud.generate(' '.join(clean_titles))

    plt.imshow(title_wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

Let’s query the Pinecone service using three users.

Query Sports User

from statistics import mean


# first create a user who likes to read sport news about tennis
sport_user = uploaded_data.loc[((uploaded_data['section'] == 'Sports News' ) |
                                (uploaded_data['section'] == 'Sports')) &
                                (uploaded_data['article'].str.contains('Tennis'))][:10]

print('\nHere is the example of previously read articles by this user:\n')
display(sport_user[['title', 'article', 'section', 'publication']])

# then create a vector for this user
a = sport_user['article_vector']
sport_user_vector = [*map(mean, zip(*a))]

# query the pinecone
query_results = conn.query(queries=[sport_user_vector], top_k=10).collect()

# print results
res = query_results[0]
df = pd.DataFrame({'id':res.ids,
                   'score':res.scores,
                   'title': [titles_mapped[int(_id)] for _id in res.ids],
                   'section': [sections_mapped[int(_id)] for _id in res.ids],
                   'publication': [publications_mapped[int(_id)] for _id in res.ids]
                    })

print("\nThis table contains recommended articles for the user:\n")
display(df)
print("\nA word-cloud representing the results:\n")
get_wordcloud_for_user(df)
Here is the example of previously read articles by this user:
titlearticlesectionpublication
2261Son of Borg makes quiet debut on London grassc...LONDON (Reuters) - A blonde-haired, blue-eyed ...Sports NewsReuters
12373Cilic offers Nadal a Wimbledon reality checkLONDON (Reuters) - Spaniard Rafael Nadal got a...Sports NewsReuters
17124Perth confirmed as host for Fed Cup final(Reuters) - Perth has been named host city for...Sports NewsReuters
18411Fed Cup gets revamp with 12-nation Finals in B...LONDON (Reuters) - The Fed Cup’s existing form...Sports NewsReuters
26574Nadal to prepare for Wimbledon at Hurlingham e...(Reuters) - World number two Rafa Nadal has en...Sports NewsReuters
34957Tennis Legend Margaret Court Went Off the Rail...Margaret Court, the most decorated tennis play...SportsVice
35508Puck City: The Enduring Success of Ice Hockey ...This article originally appeared on VICE Sport...SportsVice
38393As if by royal command, seven Britons make it ...LONDON (Reuters) - Tennis fan the Duchess of C...Sports NewsReuters
62445Williams fined $17,000 for U.S. Open code viol...NEW YORK (Reuters) - Serena Williams has been ...Sports NewsReuters
84122Kyrgios still wrestling with his tennis soul a...LONDON (Reuters) - Timothy Gallwey’s million-s...Sports NewsReuters
This table contains recommended articles for the user:
idscoretitlesectionpublication
01388650.966407Federer survives first-set wobble to down Wimb...Sports NewsReuters
1265740.965867Nadal to prepare for Wimbledon at Hurlingham e...Sports NewsReuters
2123730.965307Cilic offers Nadal a Wimbledon reality checkSports NewsReuters
31559130.963684U.S. men likely to wander Wimbledon wilderness...Sports NewsReuters
4606130.962414Auger-Aliassime powers past Tsitsipas into Que...Sports NewsReuters
5227640.962373Serena headed to Wimbledon seeking return to formSports NewsReuters
6717680.962168Venus, Serena, and the Power of BelievingSportsVice
722610.961590Son of Borg makes quiet debut on London grassc...Sports NewsReuters
8454690.961451Tennis: Barty a win away from world number oneSports NewsReuters
9550610.960677Warrior on court, diplomat off it, classy Bart...Sports NewsReuters
A word-cloud representing the results:

Wordcloud of recommended sports articles

Query Entertainment User

# first create a user who likes to read news about Xbox
entertainment_user = uploaded_data.loc[((uploaded_data['section'] == 'Entertainment') |
                                        (uploaded_data['section'] == 'Games') |
                                        (uploaded_data['section'] == 'Tech by VICE')) &
                                        (uploaded_data['article'].str.contains('Xbox'))][:10]

print('\nHere is the example of previously read articles by this user:\n')
display(entertainment_user[['title', 'article', 'section', 'publication']])

# then create a vector for this user
a = entertainment_user['article_vector']
entertainment_user_vector = [*map(mean, zip(*a))]

# query the pinecone
query_results = conn.query(queries=[entertainment_user_vector], top_k=10).collect()

# print results
res = query_results[0]
df = pd.DataFrame({'id':res.ids,
                   'score':res.scores,
                   'title': [titles_mapped[int(_id)] for _id in res.ids],
                   'section': [sections_mapped[int(_id)] for _id in res.ids],
                   'publication': [publications_mapped[int(_id)] for _id in res.ids]
                    })

print("\nThis table contains recommended articles for the user:\n")
display(df)
print("\nA word-cloud representing the results:\n")
get_wordcloud_for_user(df)
Here is the example of previously read articles by this user:
titlearticlesectionpublication
4977A Canadian Man Is Pissed That His Son Ran Up a...A Pembroke, Ontario, gun shop owner is "mad as...GamesVice
12016'I Expect You to Die' is One of Virtual Realit...The reason I bought a Vive over and Oculus ear...GamesVice
16078Windows 10's Killer App? Xbox One GamesMicrosoft's crusade to get the world to instal...Tech by VICEVice
20318Black Friday Not Your Thing? Play These Free G...It's Black Friday, the oh-so-American shopping...GamesVice
25785Nintendo’s Win at E3 Shows That It's a Console...​ E3 has come and gone for 2016, the LA expo o...GamesVice
29653You Can Smell Like a Gamer With Lynx’s New Xbo...Gamers in Australia and New Zealand will soon ...GamesVice
33234It’s Old and It’s Clunky, But You Really Must ...When Dragon's Dogma first popped up in 2012, t...GamesVice
34617Nintendo’s Win at E3 Shows That It's a Console...E3 has come and gone for 2016, the LA expo of ...GamesVice
38608PC Gaming Is Still Way Too HardHere's Motherboard's super simple guide to bui...Tech by VICEVice
41444Here’s Everything That Happened at the Xbox E3...That's Xbox's Big Show for E3 2016 over and do...GamesVice
This table contains recommended articles for the user:
idscoretitlesectionpublication
0346170.966390Nintendo’s Win at E3 Shows That It's a Console...GamesVice
1632930.965053A Title Card vs Six Teraflops: How Metroid Sto...GamesVice
2257850.964193Nintendo’s Win at E3 Shows That It's a Console...GamesVice
3167710.963487The Lo-Fi Flaws That Define Our Favorite Old G...GamesVice
4386080.960349PC Gaming Is Still Way Too HardTech by VICEVice
51211400.960174Microsoft’s New Direction All Started With the...Tech by VICEVice
61604090.959801Sometimes a David Bowie Song Gets Your Favorit...Tech by VICEVice
7296530.959628You Can Smell Like a Gamer With Lynx’s New Xbo...GamesVice
81565850.959381Google Takes Aim at PlayStation, Xbox With Gam...GamesVice
91858640.958857The Switch Succeeds on Nintendo's Historic "To...GamesVice
A word-cloud representing the results:

Wordcloud of recommended entertainment articles

Query Business User

# first create a user who likes to read about Wall Street business news
business_user = uploaded_data.loc[((uploaded_data['section'] == 'Business News')|
                                   (uploaded_data['section'] == 'business')) &
                                   (uploaded_data['article'].str.contains('Wall Street'))][:10]

print('\nHere is the example of previously read articles by this user:\n')
display(business_user[['title', 'article', 'section', 'publication']])

# then create a vector for this user
a = business_user['article_vector']
business_user_vector = [*map(mean, zip(*a))]

# query the pinecone
query_results = conn.query(queries=[business_user_vector], top_k=10).collect()

# print results
res = query_results[0]
df = pd.DataFrame({'id':res.ids,
                   'score':res.scores,
                   'title': [titles_mapped[int(_id)] for _id in res.ids],
                   'section': [sections_mapped[int(_id)] for _id in res.ids],
                   'publication': [publications_mapped[int(_id)] for _id in res.ids]
                    })

print("\nThis table contains recommended articles for the user:\n")
display(df)
print("\nA word-cloud representing the results:\n")
get_wordcloud_for_user(df)
Here is the example of previously read articles by this user:
titlearticlesectionpublication
370Wall St. falls as investors eye a united hawki...NEW YORK (Reuters) - Wall Street’s major index...Business NewsReuters
809Oil surges on tanker attacks; stocks rise on F...NEW YORK (Reuters) - Oil futures rose on Thurs...Business NewsReuters
885A look at Tesla's nine-member board(Reuters) - Tesla Inc’s board has named a spec...Business NewsReuters
1049Home Depot posts rare sales miss as delayed sp...(Reuters) - Home Depot Inc (HD.N) on Tuesday m...Business NewsReuters
1555PepsiCo's mini-sized sodas boost quarterly res...(Reuters) - PepsiCo Inc’s (PEP.O) quarterly re...Business NewsReuters
1638Wall Street extends rally on U.S.-China trade ...NEW YORK (Reuters) - U.S. stocks rallied on Fr...Business NewsReuters
1900U.S. plans limits on Chinese investment in U.S...WASHINGTON (Reuters) - The U.S. Treasury Depar...Business NewsReuters
2109Exxon Mobil, Chevron dogged by refining, chemi...HOUSTON (Reuters) - Exxon Mobil Corp and Chevr...Business NewsReuters
2286Wall Street soars on U.S. rate cut hopesNEW YORK (Reuters) - Wall Street’s three major...Business NewsReuters
2563Apple shares drop on iPhone suppliers' warnings(Reuters) - Apple Inc (AAPL.O) shares fell to ...Business NewsReuters
This table contains recommended articles for the user:
idscoretitlesectionpublication
01316030.970930US STOCKS-Wall Street muted as rate cut bets t...Market NewsReuters
1932870.970408MONEY MARKETS-U.S. rate-cut bets in June slip ...Bonds NewsReuters
21595870.970357Wall Street ekes out gain, Apple cuts revenue ...Business NewsReuters
3536020.969962US STOCKS-Wall St drops on trade worries, Fed ...Market NewsReuters
4455330.969199Wall Street wavers as tech gives ground and in...Business NewsReuters
51473200.968577Dented Fed rate cut hopes drag on stocks; doll...DavosReuters
61523130.968503MIDEAST - Factors to watch - July 9Earnings SeasonReuters
7345830.968178Global stocks rally after speech by Fed's Powe...Business NewsReuters
8899760.968088Stocks, yields rise after deal announced to en...Business NewsReuters
9961070.968017Wall Street surges on higher oil after U.S. qu...Business NewsReuters
A word-cloud representing the results:

Wordcloud of recommended business articles

Query Results

We can see that each user’s recommendations have a high similarity to what the user actually reads. A user who likes tennis news has plenty of tennis news recommendations. A user who likes to read about Xbox has that kind of news. And a business user has plenty of Wall Street news that they enjoy.

From the word-cloud, you can see the most frequent words that appear in the recommended articles' titles.

Since we used only the title and the content of the article to define the embeddings, and we did not take publications and sections into account, a user may get recommendations from a publication/section that he does not regularly read. You may try adding this information when creating embeddings as well and check your query results then!

Also, you may notice that some articles appear in the recommendations, although the user has already read them. These articles could be removed as part of postprocessing the query results, in case you prefer not to see them in the recommendations.

Turn Off the Service

Turn off the service once you are sure that you do not want to use it anymore. Once the service is stopped, you cannot use it again.

pinecone.service.stop(service_name)
{'success': True}