Semantic textual search with vector embeddings

This notebook demonstrates how to create a simple semantic text search using Pinecone’s similarity search service.

The goal is to create a search application that retrieves news articles based on short description queries (e.g., article titles). To achieve that, we will store vector representations of the articles in Pinecone's index. These vectors and their proximity capture semantic relations. Nearby vectors indicate similar content, and contents from faraway vectors are dissimilar.

Semantic textual search is a technique used for solving other text-based applications. For example, our deduplication, question-answering and personalized article recommendation demos use semantic textual search.

Open Notebook View Source

Pinecone Setup

!pip install -qU pinecone-client ipywidgets
import pinecone

# Load Pinecone API key
import os
api_key = os.getenv("PINECONE_API_KEY") or "YOUR-API-KEY"
pinecone.init(api_key=api_key)
# List all indexes currently present for your key
pinecone.list_indexes()

Get a Pinecone API key if you don’t have one already.

Install and Import Python Packages

!pip install -qU wordcloud pandas-profiling
!pip install -qU sentence-transformers --no-cache-dir
import pandas as pd
import numpy as np
import time
import re
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import sqlite3

pd.set_option('display.max_colwidth', 200)

Create a New Service

# Pick a name for the new index
index_name = 'semantic-text-search'

# Check whether the index with the same name already exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

pinecone.create_index(name=index_name, metric='cosine', shards=1)

index = pinecone.Index(name = index_name, response_timeout=300)
index.info()
InfoResult(index_size=0)

Upload

We will define two separate sub-indexes using Pinecone's namespace feature. One for indexing articles by content, and the other by title. At query time, we will return an aggregation of the results from the content and title indexes.

First, we will load data and the model, and then create embeddings and upsert them into the namespaces.

Load data

The dataset used throughout this example contains 204,135 articles from 18 American publications.

Let's download the dataset and load data.

import requests, os
DATA_DIR = 'tmp'
URL = "https://www.dropbox.com/s/b2cyb85ib17s7zo/all-the-news.db?dl=1"
FILE = f"{DATA_DIR}/all-the-news.db"

def download_data():
    os.makedirs(DATA_DIR, exist_ok=True)

    if not os.path.exists(FILE):
        r = requests.get(URL)  # create HTTP response object
        with open(FILE, "wb") as f:
            f.write(r.content)

download_data()
cnx = sqlite3.connect(FILE)
data = pd.read_sql_query("SELECT * FROM longform", cnx)
data.set_index('id', inplace=True)
data.head()
title author date content year month publication category digital section url
id
1 Agent Cooper in Twin Peaks is the audience: once delighted, now disintegrating \nTasha Robinson\n 2017-05-31 And never more so than in Showtime’s new series revival Some spoilers ahead through episode 4 of season 3 of Twin Peaks. On May 21st, Showtime brought back David Lynch’s groundbreaking TV se... 2017 5 Verge Longform 1.0 None None
2 AI, the humanity! \nSam Byford\n 2017-05-30 AlphaGo’s victory isn’t a defeat for humans — it’s an opportunity A loss for humanity! Man succumbs to machine! If you heard about AlphaGo’s latest exploits last week — crushing the world’s ... 2017 5 Verge Longform 1.0 None None
3 The Viral Machine \nKaitlyn Tiffany\n 2017-05-25 Super Deluxe built a weird internet empire. Can it succeed on TV? When Wolfgang Hammer talks about the future of entertainment, people listen. Hammer is the mastermind behind the American re... 2017 5 Verge Longform 1.0 None None
4 How Anker is beating Apple and Samsung at their own accessory game \nNick Statt\n 2017-05-22 Steven Yang quit his job at Google in the summer of 2011 to build the products he felt the world needed: a line of reasonably priced accessories that would be better than the ones you could ... 2017 5 Verge Longform 1.0 None None
5 Tour Black Panther’s reimagined homeland with Ta-Nehisi Coates \nKwame Opam\n 2017-05-15 Ahead of Black Panther’s 2018 theatrical release, Marvel turned to Ta-Nehisi Coates to breathe new life into the nation of Wakanda. “I made most of my career analyzing the forces of racism a... 2017 5 Verge Longform 1.0 None None
# Define number of test articles
NUM_OF_TEST_ARTICLES = 2

# Remove test articles from data and keep them in separate dataframe
test_articles = data[['title','content']][97::81][:NUM_OF_TEST_ARTICLES]
data.drop(list(test_articles.index), inplace=True)

Use Ready-Made Vector Embedding Model

We will use an Average Word Embeddings Model to create both title and content embeddings. Pinecone allows you to create paritions in the index that we call namespaces. This will allow us to maintain separate embeddings for the data that can be used for different tasks.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('average_word_embeddings_komninos')

Upload Vectors of Titles

Here we index articles by title only. You can notice we create a title namespace for this purpose.

# Fill missing and remove redundant data
data['title'] = data['title'].fillna('')

# Create vector embeddings based on the title column
print('Encoding titles...')
encoded_titles = model.encode(data['title'].tolist(), show_progress_bar=True)
data['title_vector'] = encoded_titles.tolist()

# Upsert title vectors in title namespace
print("Uploading vectors to title namespace..")
acks_titles = index.upsert(items=zip(data.index, data.title_vector), namespace='title', batch_size=1000)

Upload Vectors of Content

Now we index articles by their content. We want to separately maintain embeddings for both title and content hence we use a separate namespace in the same index.

# Fill missing data
data['content'] = data['content'].fillna('')

# Extract only first few sentences of each article for quicker vector calculations
data['content'] = data.content.apply(lambda x: ' '.join(re.split(r'(?<=[.:;])\s', x)[:10]))

# Create vector embeddings based on the content column
print('Encoding content...')
encoded_content = model.encode(data['content'].tolist(), show_progress_bar=True)
data['content_vector'] = encoded_content.tolist()

# Upsert content vectors in content namespace
acks_content =index.upsert(items=zip(data.index, data.content_vector), namespace='content', batch_size=1000)

Now that we have upserted data, we can check the size of each namespace.

# Check index size for each namespace
print(index.info(namespace='title'))
print(index.info(namespace='content'))
InfoResult(index_size=204133)
InfoResult(index_size=204133)

Query

Let's see what our test articles look like first.

# Print test articles
display(test_articles)
title content
id
111 The Rise and Fall and Rise of Virtual Reality In the wake of Facebook's purchase of Oculus VR, can this revolutionary technology triumph anew?
6467 Who should go to Mars? Elon Musk laid out his plan to colonize Mars at a conference on Tuesday, but it was during the Q&ampampA session that a woman asked one of the key questions: who will be chosen to embark on ...

The following utility functions help us process and present the results.

titles_mapped = dict(zip(data.index, data.title))
content_mapped = dict(zip(data.index, data.content))
def get_wordcloud_for_article(recommendations, namespace):
    'Generates word cloud for the recommendations (titles or content).'

    stopwords = set(STOPWORDS).union([np.nan, 'NaN', 'S'])
    wordcloud = WordCloud(
                   max_words=50000,
                   min_font_size =12,
                   max_font_size=50,
                   relative_scaling = 0.9,
                   stopwords=set(STOPWORDS),
                   normalize_plurals= True
                  )

    if namespace == 'title':
        clean_titles = [word for word in recommendations.title.values if word not in stopwords]
        wordcloud = wordcloud.generate(' '.join(clean_titles))
    else:
        clean_content = [word for word in recommendations.content.values if word not in stopwords]
        wordcloud = wordcloud.generate(' '.join(clean_content))

    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()


def print_query_results(query_result, query, namespace, show_options={'wordcloud':True, 'tabular':True}):
    'Prints query result with wordcloud.'

    print(f'\nMost similar results querying {query} in "{namespace}" namespace:\n')

    res = query_result[0]
    df = pd.DataFrame({'id':res.ids,
                       'score':res.scores,
                       'title': [titles_mapped[int(_id)] if int(_id) in titles_mapped else ' '  for _id in res.ids],
                       'content': [content_mapped[int(_id)] if int(_id) in content_mapped else ' '  for _id in res.ids],
                       })
    if show_options['tabular']:
        display(df.head(5))
    if show_options['wordcloud']:
        get_wordcloud_for_article(df, namespace)
    print('\n')

The following two functions we use to query the test article's title or content in either of the namespaces we created. This means we can query the title in the "title" namespace or the "content" namespace. The same is with the article content.

def query_article_title(test_article, namespace, top_k=5, show_options={'wordcloud':True, 'tabular':True}):
    '''Queries an article using its title in the specified
     namespace and prints results.'''

    # Create vector embeddings based on the title column
    encoded_titles = model.encode(test_article['title'],
                                  show_progress_bar=False)
    test_article['title_vector'] = pd.Series(encoded_titles.tolist())

    # Query namespace passed as parameter using title vector
    query_result_titles = index.query(queries=[test_article.title_vector],
                                      namespace=namespace,
                                      top_k=top_k,
                                      disable_progress_bar=True)

    # Print query results
    if show_options['wordcloud'] or show_options['tabular']:
        print_query_results(query_result_titles, query='title',
                            namespace=namespace,
                            show_options=show_options)

    return query_result_titles

When querying content, we will first create the article content vector and search for the most similar vectors in the "title" or the "content" namespace.

def query_article_content(test_article, namespace, top_k=5, show_options={'wordcloud':True, 'tabular':True}):
    '''Queries an article using its content in the specified
    namespace and prints results.'''

    # Create vector embeddings based on the content column
    encoded_content = model.encode(test_article['content'],
                                   show_progress_bar=False)
    test_article['content_vector'] = pd.Series(encoded_content.tolist())

    # Query content namespace using content vector
    query_result_content = index.query(queries=[test_article.content_vector],
                                       namespace=namespace,
                                       top_k=top_k,
                                       disable_progress_bar=True)

    # Print query results
    if show_options['wordcloud'] or show_options['tabular']:
        print_query_results(query_result_content,
                            query='content',
                            namespace=namespace,
                            show_options=show_options)

    return query_result_content

Now it's time to do the cross namespace querying and aggregate the results. The following functions query for four combinations (title/content only and title-content, content-title). Then aggregates the results of the four queries to calculate the total occurrence of the articles and their average scores. They are ranked accordingly and the most similar articles are returned.

def aggregate_results(article):
    '''Aggregates results after querying both namespaces
       for both the article's title and content.'''

    results = []

    results.append(query_article_title(article, namespace='title', top_k=30, show_options={'wordcloud':False, 'tabular':False}))
    results.append(query_article_title(article, namespace='content', top_k=30, show_options={'wordcloud':False, 'tabular':False}))
    results.append(query_article_content(article, namespace='title', top_k=30, show_options={'wordcloud':False, 'tabular':False}))
    results.append(query_article_content(article, namespace='content', top_k=30, show_options={'wordcloud':False, 'tabular':False}))

    articles_scores = {}
    articles_count = {}

    for res in results:
        for id, score in zip(res[0].ids, res[0].scores):
            if id not in articles_scores:
                articles_scores[id] = score
                articles_count[id] = 1
            else:
                articles_scores[id] += score
                articles_count[id] += 1

    return articles_scores , articles_count

def show_aggregated_results(results_dict, counts, show_options={'wordcloud':True, 'tabular':True}):
    '''Shows results after aggregation. Values are sorted based
    on the number of queries they appear (1-4) and based on their
    average score.'''

    df = pd.DataFrame({'id':results_dict.keys(),
                       'count': counts.values(),
                       'average_score':[round(r/c, 3) for r, c in zip(results_dict.values(),counts.values())],
                       'title': [titles_mapped[int(_id)] if int(_id) in titles_mapped else ' '  for _id in results_dict.keys()],
                       'content': [content_mapped[int(_id)] if int(_id) in content_mapped else ' '  for _id in results_dict.keys()],
                       })
    df.sort_values(by=['count', 'average_score'], ascending=False, inplace=True)

    if show_options['tabular']:
        print('\nMost similar results after aggregation:\n')
        display(df.head(5))
    if show_options['wordcloud']:
        print('\nWordcloud for titles and content after aggregation:')
        print('-Titles:')
        get_wordcloud_for_article(df[:10], 'title')
        print('-Content:')
        get_wordcloud_for_article(df[:10], 'content')
    print('\n')

Query by Aggregation

We are ready to query our service! We will use all the above auxiliary functions to query the test articles. We will be using our cross-namespace approach that combines four query results into one.

Note that you can add the tabular data results for each query by changing the show_options flags below.

# Query index using simple and cross namespace approach
for e, (_, test_article) in enumerate(test_articles.iterrows()):
    print(f'\nArticle {e+1}')
    print(f'\n Title: {test_article.title}')
    print(f' Content: {test_article.content[:200].strip()}' + ('...' if len(test_article.content) > 200 else ''))

    # Uncomment to query the titles in title namespace
    #query_article_title(test_article, 'title',  show_options={'wordcloud':True, 'tabular':False})

    # Uncomment to query the content in content namespace
    #query_article_content(test_article, namespace='content', show_options={'wordcloud':True, 'tabular':False})

    # Cross namespace query
    aggregated_results, counts = aggregate_results(test_article)
    show_aggregated_results(aggregated_results, counts, show_options={'wordcloud':False, 'tabular':True})
Article 1

 Title: The Rise and Fall and Rise of Virtual Reality
 Content: In the wake of Facebook's purchase of Oculus VR, can this revolutionary technology triumph anew?

Most similar results after aggregation:
id average_score title content
19720 0.809 Oculus Founder, at Center of Legal Battle Over VR, Departs Facebook - The New York Times SAN FRANCISCO — Palmer Luckey, a founder of the virtual-reality technology company Oculus, has left Facebook three years after the social network acquired his company for close to $3 billion. Mr. ...
7201 0.808 Flush with cash, Oculus plans ambitious new VR headset According to Oculus Rift inventor Palmer Luckey, virtual reality is near and dear to Marc Andreessen&amprsquos heart. Twenty years ago&ampnbsp&ampmdash&ampnbspbefore he created the Mosaic we...
34611 0.806 Microsoft Introducing VR Headsets at Half the Price of Oculus Rift - Breitbart On October 26, Microsoft doubled down on virtual reality by announcing their own VR headsets at the Windows 10 event.[Unless you’ve got $599 for the Oculus Rift, or $799 for Valve’s HTC Vive, your...
13199 0.800 Oculus VR founder Palmer Luckey talks GoPro, 'Minecraft' and eSports - LA Times Oculus VR founder Palmer Luckey answers questions at the Loews Hollywood Hotel on Sept. 24. ', 'A few years ago, journalism major Palmer Luckey dropped out of Cal State Long Beach to work on a dev...
9533 0.800 Virtual reality visionary Palmer Luckey leaves Facebook 3 years after $2-billion Oculus deal - LA Times Palmer Luckey, the Long Beach entrepreneur whose zeal for virtual reality kickstarted mass investment in the technology, has left Facebook three years after selling his start-up Oculus VR to the s...

​ ​ ​ Article 2 ​ ​ Title: Who should go to Mars? ​ Content: Elon Musk laid out his plan to colonize Mars at a conference on Tuesday, but it was during the Q&ampA session that a woman asked one of the key questions: who will be chosen to embark on a ri... ​ Most similar results after aggregation:

id average_score title content
206370 0.703 How Mars lost its atmosphere, and why Earth didn’t Mars was once wetter and warmer, and very possibly a congenial environment for life as we know it. Today it looks mighty dead, with all due respect. If there's life, it's cryptic. Mars ju...
83533 0.679 Mars Reconnaissance Orbiter celebrates 10 years at red planet [Sign in to comment!, NASA’s Mars Reconnaissance Orbiter (MRO) arrived at the red planet 10 years ago today and has since completed 45,000 orbits and generated a vast amount of scientific data., O...
83531 0.664 Buzz Aldrin eyes 2040 for manned Mars mission [Sign in to comment!, Former astronaut Buzz Aldrin is eyeing 2040 for the first manned mission to Mars, noting that the red planet’s moon Phobos could play a vital role for astronauts., “I think t...
16615 0.641 NASA orbiters watch as comet flies safely past Mars - LA Times Comet Siding Spring sailed past Mars on Sunday, coming 10 times closer to the Red Planet than any comet on record has come to Earth.', "At the time of the comet's closest approach at 11:27 a.m., i...
158293 0.640 Mars makes closest approach to Earth for 11 years Mars reaches its closest approach to Earth for 11 years this evening at 21:35 GMT. The red planet will be just 75 million kilometres away., Mars has been steadily approaching, tripling its apparen...

Summary

We demonstrated a simple textual semantic search approach that aggregates results from two different news article representations: titles only, and content only. We do that by utilizing Pinecone's namespace feature to create two namespaced indexes.

The aggregation mechanism is simple: We use the query's title and content representations to query both namespaces and weight results by their occurrences. Our examples show the effectiveness of this approach.

We encourage you to try the code with your data. You might want to try other embedding or aggregation mechanisms. Working with a similarity search service makes such experimentations easy. Have fun, and let us know if you have any questions or interesting findings.

Delete the index

Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again. Use it as a cleanup step if you are done working with a specific index.

pinecone.delete_index(index_name)