Personalized Article Recommender

This notebook demonstrates how to use Pinecone's similarity search to create a simple personalized article or content recommender.

The goal is to create a recommendation engine that retrieves the best article recommendations for each user. When making recommendations with content-based filtering, we evaluate the user’s past behavior and the content items themselves. So in this example, users will be recommended articles that are similar to those they've already read.

Install and Import Python Packages

!pip install -qU wordcloud swifter pinecone-client
!pip install -qU sentence-transformers --no-cache-dir
import pandas as pd
import numpy as np
import time
import swifter
import re
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
%matplotlib inline

In the following sections, we will use Pinecone to easily build and deploy an article recommendation engine. Pinecone will be responsible for storing embeddings for articles, maintaining a live index of those vectors, and returning recommended articles on-demand.

Pinecone Setup

!pip install --quiet -U pinecone-client
import pinecone

# load Pinecone API key
api_key = 'YOUR-API-KEY'
pinecone.init(api_key=api_key)

Get a Pinecone API key if you don’t have one already.

index_name = 'articles-recommendation'
# If index of the same name exists, then delete it
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

Create the index.

pinecone.create_index(name=index_name,metric='cosine')
{'msg': '', 'success': True}

Connect to the new index.

index = pinecone.Index(name=index_name)
print(pinecone.describe_index(index_name))
ResourceDescription(name='articles-recommendation', kind='index', status={'ready': True, 'host': 'articles-recommendation-d358397.svc.beta.pinecone.io', 'port': 443, 'waiting': [], 'crashed': []}, config=OrderedDict([('index_type', 'approximated'), ('metric', 'cosine'), ('shards', 1), ('replicas', 1), ('gateway_replicas', 1), ('node_type', 'STANDARD'), ('engine_args', {'engine_cpus': 1})]))

Upload Articles

Next, we will prepare data for the Pinecone vector index, and insert it in batches.

The dataset used throughout this example contains 2.7 million news articles and essays from 27 American publications.

Let's download the dataset:

!rm all-the-news-2-1.zip
!rm all-the-news-2-1.csv
!wget https://www.dropbox.com/s/cn2utnr5ipathhh/all-the-news-2-1.zip -q --show-progress
!unzip -q all-the-news-2-1.zip
rm: cannot remove 'all-the-news-2-1.zip': No such file or directory
rm: cannot remove 'all-the-news-2-1.csv': No such file or directory
all-the-news-2-1.zi 100%[===================>]   3.13G  29.6MB/s    in 53s

Create Vector Embeddings

The model used in this example is the Average Word Embeddings Models. This model allows us to create vector embeddings for each article, using the content and title of each.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('average_word_embeddings_komninos')

Using the complete dataset may require more time for the model to generate vector embeddings. We will use only a sample, but if you want to try uploading the whole dataset, set the NROWS flag to None.

NROWS = 200000      # number of rows to be loaded from the csv, set to None for loading all rows, reduce if you have a low amount of RAM or want a faster execution

Let's prepare data for upload.

Uploading the data may take a while, and depends on the network you use.

#%%time

def prepare_data(data) -> pd.DataFrame:
    'Preprocesses data and prepares it for upsert.'

    # rename id column and remove unnecessary columns
    print("Preparing data...")
    data.rename(columns={"Unnamed: 0": "id"}, inplace = True)
    data.drop(columns=['Unnamed: 0.1', 'date'], inplace = True)

    # extract only first few sentences of each article for quicker vector calculations
    data['article'] = data['article'].fillna('')
    data['article'] = data.article.swifter.apply(lambda x: ' '.join(re.split(r'(?<=[.:;])\s', x)[:4]))
    data['title_article'] = data['title'] + data['article']

    # create a vector embedding based on title and article columns
    print('Encoding articles...')
    encoded_articles = model.encode(data['title_article'], show_progress_bar=True)
    data['article_vector'] = pd.Series(encoded_articles.tolist())

    return data


def upload_items(data):
    'Uploads data in batches.'
    print("Uploading items")

    # create a list of items for upload
    items_to_upload = [(row.id, row.article_vector) for i,row in data.iterrows()]

    # upsert
    index.upsert(items=items_to_upload)


def process_file(filename: str) -> pd.DataFrame:
    'Reads csv files in chunks, prepares and uploads data.'

    data = pd.read_csv(filename, nrows=NROWS)
    data = prepare_data(data)
    upload_items(data)
    return data

uploaded_data = process_file(filename='all-the-news-2-1.csv')
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2822: DtypeWarning: Columns (6,10) have mixed types.Specify dtype option on import or set low_memory=False.
  if self.run_code(code, result):


Preparing data...

Encoding articles...

Uploading items
index.info()
InfoResult(index_size=200000)

Query the Pinecone Index

We will query the index for the specific users. The users are defined by the set of the articles they previously read. More specifically, we will define 10 articles for each user, and based on the article embeddings, we will define a unique embedding for the user.

We will create three users and query Pinecone for each of them:

  • User who likes to read Sports News
  • User who likes to read Entertainment News
  • User who likes to read Business News

Let's define mappings for titles, sections, and publications for each article.

titles_mapped = dict(zip(uploaded_data.id, uploaded_data.title))
sections_mapped = dict(zip(uploaded_data.id, uploaded_data.section))
publications_mapped = dict(zip(uploaded_data.id, uploaded_data.publication))

Also, we will define a function that uses wordcloud to visualize results.

def get_wordcloud_for_user(recommendations):

    stopwords = set(STOPWORDS).union([np.nan, 'NaN', 'S'])

    wordcloud = WordCloud(
                   max_words=50000,
                   min_font_size =12,
                   max_font_size=50,
                   relative_scaling = 0.9,
                   stopwords=set(STOPWORDS),
                   normalize_plurals= True
    )

    clean_titles = [word for word in recommendations.title.values if word not in stopwords]
    title_wordcloud = wordcloud.generate(' '.join(clean_titles))

    plt.imshow(title_wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

Let's query the Pinecone service using three users.

Query Sports User

from statistics import mean


# first create a user who likes to read sport news about tennis
sport_user = uploaded_data.loc[((uploaded_data['section'] == 'Sports News' ) |
                                (uploaded_data['section'] == 'Sports')) &
                                (uploaded_data['article'].str.contains('Tennis'))][:10]

print('\nHere is the example of previously read articles by this user:\n')
print(sport_user[['title', 'article', 'section', 'publication']])

# then create a vector for this user
a = sport_user['article_vector']
sport_user_vector = [*map(mean, zip(*a))]

# query the pinecone
query_results = index.query(queries=[sport_user_vector], top_k=10)
# print results
res = query_results[0]
df = pd.DataFrame({'id':res.ids,
                   'score':res.scores,
                   'title': [titles_mapped[int(_id)] for _id in res.ids],
                   'section': [sections_mapped[int(_id)] for _id in res.ids],
                   'publication': [publications_mapped[int(_id)] for _id in res.ids]
                    })

print("\nThis table contains recommended articles for the user:\n")
print(df)
print("\nA word-cloud representing the results:\n")
get_wordcloud_for_user(df)
Here is the example of previously read articles by this user:

                                                   title  ... publication
2261   Son of Borg makes quiet debut on London grassc...  ...     Reuters
12373       Cilic offers Nadal a Wimbledon reality check  ...     Reuters
17124          Perth confirmed as host for Fed Cup final  ...     Reuters
18411  Fed Cup gets revamp with 12-nation Finals in B...  ...     Reuters
26574  Nadal to prepare for Wimbledon at Hurlingham e...  ...     Reuters
34957  Tennis Legend Margaret Court Went Off the Rail...  ...        Vice
35508  Puck City: The Enduring Success of Ice Hockey ...  ...        Vice
38393  As if by royal command, seven Britons make it ...  ...     Reuters
62445  Williams fined $17,000 for U.S. Open code viol...  ...     Reuters
84122  Kyrgios still wrestling with his tennis soul a...  ...     Reuters

[10 rows x 4 columns]

This table contains recommended articles for the user:

       id     score  ...      section publication
0  138865  0.966407  ...  Sports News     Reuters
1   26574  0.965867  ...  Sports News     Reuters
2   12373  0.965307  ...  Sports News     Reuters
3  155913  0.963684  ...  Sports News     Reuters
4   60613  0.962414  ...  Sports News     Reuters
5   22764  0.962373  ...  Sports News     Reuters
6   71768  0.962168  ...       Sports        Vice
7    2261  0.961590  ...  Sports News     Reuters
8   45469  0.961451  ...  Sports News     Reuters
9   55061  0.960677  ...  Sports News     Reuters

[10 rows x 5 columns]

A word-cloud representing the results:

Wordcloud of recommended sports articles

Query Entertainment User

# first create a user who likes to read news about Xbox
entertainment_user = uploaded_data.loc[((uploaded_data['section'] == 'Entertainment') |
                                        (uploaded_data['section'] == 'Games') |
                                        (uploaded_data['section'] == 'Tech by VICE')) &
                                        (uploaded_data['article'].str.contains('Xbox'))][:10]

print('\nHere is the example of previously read articles by this user:\n')
print(entertainment_user[['title', 'article', 'section', 'publication']])

# then create a vector for this user
a = entertainment_user['article_vector']
entertainment_user_vector = [*map(mean, zip(*a))]

# query the pinecone
query_results = index.query(queries=[entertainment_user_vector], top_k=10)

# print results
res = query_results[0]
df = pd.DataFrame({'id':res.ids,
                   'score':res.scores,
                   'title': [titles_mapped[int(_id)] for _id in res.ids],
                   'section': [sections_mapped[int(_id)] for _id in res.ids],
                   'publication': [publications_mapped[int(_id)] for _id in res.ids]
                    })

print("\nThis table contains recommended articles for the user:\n")
print(df)
print("\nA word-cloud representing the results:\n")
get_wordcloud_for_user(df)
Here is the example of previously read articles by this user:

                                                   title  ... publication
4977   A Canadian Man Is Pissed That His Son Ran Up a...  ...        Vice
12016  'I Expect You to Die' is One of Virtual Realit...  ...        Vice
16078            Windows 10's Killer App? Xbox One Games  ...        Vice
20318  Black Friday Not Your Thing? Play These Free G...  ...        Vice
25785  Nintendo’s Win at E3 Shows That It's a Console...  ...        Vice
29653  You Can Smell Like a Gamer With Lynx’s New Xbo...  ...        Vice
33234  It’s Old and It’s Clunky, But You Really Must ...  ...        Vice
34617  Nintendo’s Win at E3 Shows That It's a Console...  ...        Vice
38608                    PC Gaming Is Still Way Too Hard  ...        Vice
41444  Here’s Everything That Happened at the Xbox E3...  ...        Vice

[10 rows x 4 columns]


This table contains recommended articles for the user:

       id     score  ...       section publication
0   34617  0.966390  ...         Games        Vice
1   63293  0.965053  ...         Games        Vice
2   25785  0.964193  ...         Games        Vice
3   16771  0.963487  ...         Games        Vice
4   38608  0.960349  ...  Tech by VICE        Vice
5  121140  0.960174  ...  Tech by VICE        Vice
6  160409  0.959802  ...  Tech by VICE        Vice
7   29653  0.959628  ...         Games        Vice
8  156585  0.959381  ...         Games        Vice
9  185864  0.958857  ...         Games        Vice

[10 rows x 5 columns]

A word-cloud representing the results:

Wordcloud of recommended entertainment articles

Query Business User

# first create a user who likes to read about Wall Street business news
business_user = uploaded_data.loc[((uploaded_data['section'] == 'Business News')|
                                   (uploaded_data['section'] == 'business')) &
                                   (uploaded_data['article'].str.contains('Wall Street'))][:10]

print('\nHere is the example of previously read articles by this user:\n')
print(business_user[['title', 'article', 'section', 'publication']])

# then create a vector for this user
a = business_user['article_vector']
business_user_vector = [*map(mean, zip(*a))]

# query the pinecone index
query_results = index.query(queries=[business_user_vector], top_k=10)

# print results
res = query_results[0]
df = pd.DataFrame({'id':res.ids,
                   'score':res.scores,
                   'title': [titles_mapped[int(_id)] for _id in res.ids],
                   'section': [sections_mapped[int(_id)] for _id in res.ids],
                   'publication': [publications_mapped[int(_id)] for _id in res.ids]
                    })

print("\nThis table contains recommended articles for the user:\n")
print(df)
print("\nA word-cloud representing the results:\n")
get_wordcloud_for_user(df)
Here is the example of previously read articles by this user:

                                                  title  ... publication
370   Wall St. falls as investors eye a united hawki...  ...     Reuters
809   Oil surges on tanker attacks; stocks rise on F...  ...     Reuters
885                 A look at Tesla's nine-member board  ...     Reuters
1049  Home Depot posts rare sales miss as delayed sp...  ...     Reuters
1555  PepsiCo's mini-sized sodas boost quarterly res...  ...     Reuters
1638  Wall Street extends rally on U.S.-China trade ...  ...     Reuters
1900  U.S. plans limits on Chinese investment in U.S...  ...     Reuters
2109  Exxon Mobil, Chevron dogged by refining, chemi...  ...     Reuters
2286           Wall Street soars on U.S. rate cut hopes  ...     Reuters
2563    Apple shares drop on iPhone suppliers' warnings  ...     Reuters

[10 rows x 4 columns]

This table contains recommended articles for the user:

       id     score  ...          section publication
0  131603  0.970930  ...      Market News     Reuters
1   93287  0.970408  ...       Bonds News     Reuters
2  159587  0.970357  ...    Business News     Reuters
3   53602  0.969963  ...      Market News     Reuters
4   45533  0.969199  ...    Business News     Reuters
5  147320  0.968577  ...            Davos     Reuters
6  152313  0.968503  ...  Earnings Season     Reuters
7   34583  0.968178  ...    Business News     Reuters
8   89976  0.968087  ...    Business News     Reuters
9   96107  0.968018  ...    Business News     Reuters

[10 rows x 5 columns]

A word-cloud representing the results:

Wordcloud of recommended business articles

Results

We can see that each user's recommendations have a high similarity to what the user actually reads. A user who likes Tennis news has plenty of Tennis news recommendations. A user who likes to read about Xbox has that kind of news. And a business user has plenty of Wall Street news that he/she enjoys.

From the word-cloud, you can see the most frequent words that appear in the recommended articles' titles.

Since we used only the title and the content of the article to define the embeddings, and we did not take publications and sections into account, a user may get recommendations from a publication/section that he does not regularly read. You may try adding this information when creating embeddings as well and check your query results then!

Also, you may notice that some articles appear in the recommendations, although the user has already read them. These articles could be removed as part of postprocessing the query results, in case you prefer not to see them in the recommendations.

Turn Off the Service

Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again.

pinecone.delete_index(index_name)
{'success': True}