Semantic Textual Search with Vector Embeddings

This notebook demonstrates how to create a simple semantic text search using Pinecone’s similarity search service.

The goal is to create a search application that retrieves news articles based on short description queries (e.g., article titles). To achieve that, we will store vector representations of the articles in Pinecone’s index. These vectors and their proximity capture semantic relations. Nearby vectors indicate similar content, and contents from faraway vectors are dissimilar.

Semantic textual search is a technique used for solving other text-based applications. For example, our deduplication, question-answering and personalized article recommendation demos use semantic textual search.

Open Notebook View Source

Pinecone Setup

!pip install -qU pinecone-client ipywidgets
import pinecone

# Load Pinecone API key
import os
api_key = os.getenv("PINECONE_API_KEY") or "YOUR-API-KEY"
pinecone.init(api_key=api_key)
# List all indexes currently present for your key
pinecone.list_indexes()

Get a Pinecone API key if you don’t have one already.

Install and Import Python Packages

!pip install -qU wordcloud pandas-profiling
!pip install -qU sentence-transformers --no-cache-dir
import pandas as pd
import numpy as np
import time
import re
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import sqlite3

pd.set_option('display.max_colwidth', 200)

Create a New Service

# Pick a name for the new index
index_name = 'semantic-text-search'

# Check whether the index with the same name already exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

pinecone.create_index(name=index_name, metric='cosine', shards=1)

index = pinecone.Index(name = index_name, response_timeout=300)
index.info()
InfoResult(index_size=0)

Upload

We will define two separate sub-indexes using Pinecone’s namespace feature. One for indexing articles by content, and the other by title. At query time, we will return an aggregation of the results from the content and title indexes.

First, we will load data and the model, and then create embeddings and upsert them into the namespaces.

Load data

The dataset used throughout this example contains 204,135 articles from 18 American publications.

Let’s download the dataset and load data.

import requests, os
DATA_DIR = 'tmp'
URL = "https://www.dropbox.com/s/b2cyb85ib17s7zo/all-the-news.db?dl=1"
FILE = f"{DATA_DIR}/all-the-news.db"

def download_data():
    os.makedirs(DATA_DIR, exist_ok=True)

    if not os.path.exists(FILE):
        r = requests.get(URL)  # create HTTP response object
        with open(FILE, "wb") as f:            
            f.write(r.content)

download_data()
cnx = sqlite3.connect(FILE)
data = pd.read_sql_query("SELECT * FROM longform", cnx)
data.set_index('id', inplace=True)
data.head()
titleauthordatecontentyearmonthpublicationcategorydigitalsectionurl
id
1Agent Cooper in Twin Peaks is the audience: once delighted, now disintegrating\nTasha Robinson\n2017-05-31And never more so than in Showtime’s new series revival Some spoilers ahead through episode 4 of season 3 of Twin Peaks. On May 21st, Showtime brought back David Lynch’s groundbreaking TV se...20175VergeLongform1.0NoneNone
2AI, the humanity!\nSam Byford\n2017-05-30AlphaGo’s victory isn’t a defeat for humans — it’s an opportunity A loss for humanity! Man succumbs to machine! If you heard about AlphaGo’s latest exploits last week — crushing the world’s ...20175VergeLongform1.0NoneNone
3The Viral Machine\nKaitlyn Tiffany\n2017-05-25Super Deluxe built a weird internet empire. Can it succeed on TV? When Wolfgang Hammer talks about the future of entertainment, people listen. Hammer is the mastermind behind the American re...20175VergeLongform1.0NoneNone
4How Anker is beating Apple and Samsung at their own accessory game\nNick Statt\n2017-05-22Steven Yang quit his job at Google in the summer of 2011 to build the products he felt the world needed: a line of reasonably priced accessories that would be better than the ones you could ...20175VergeLongform1.0NoneNone
5Tour Black Panther’s reimagined homeland with Ta-Nehisi Coates\nKwame Opam\n2017-05-15Ahead of Black Panther’s 2018 theatrical release, Marvel turned to Ta-Nehisi Coates to breathe new life into the nation of Wakanda. “I made most of my career analyzing the forces of racism a...20175VergeLongform1.0NoneNone
# Define number of test articles
NUM_OF_TEST_ARTICLES = 2

# Remove test articles from data and keep them in separate dataframe
test_articles = data[['title','content']][97::81][:NUM_OF_TEST_ARTICLES]
data.drop(list(test_articles.index), inplace=True)

Use Ready-Made Vector Embedding Model

We will use an Average Word Embeddings Model to create both title and content embeddings. Pinecone allows you to create paritions in the index that we call namespaces. This will allow us to maintain separate embeddings for the data that can be used for different tasks.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('average_word_embeddings_komninos')

Upload Vectors of Titles

Here we index articles by title only. You can notice we create a title namespace for this purpose.

# Fill missing and remove redundant data
data['title'] = data['title'].fillna('')

# Create vector embeddings based on the title column
print('Encoding titles...')
encoded_titles = model.encode(data['title'].tolist(), show_progress_bar=True)
data['title_vector'] = encoded_titles.tolist()

# Upsert title vectors in title namespace
print("Uploading vectors to title namespace..")
acks_titles = index.upsert(items=zip(data.index, data.title_vector), namespace='title', batch_size=1000)

Upload Vectors of Content

Now we index articles by their content. We want to separately maintain embeddings for both title and content hence we use a separate namespace in the same index.

# Fill missing data
data['content'] = data['content'].fillna('')

# Extract only first few sentences of each article for quicker vector calculations
data['content'] = data.content.apply(lambda x: ' '.join(re.split(r'(?<=[.:;])\s', x)[:10]))

# Create vector embeddings based on the content column
print('Encoding content...')
encoded_content = model.encode(data['content'].tolist(), show_progress_bar=True)
data['content_vector'] = encoded_content.tolist()

# Upsert content vectors in content namespace
acks_content =index.upsert(items=zip(data.index, data.content_vector), namespace='content', batch_size=1000)

Now that we have upserted data, we can check the size of each namespace.

# Check index size for each namespace
print(index.info(namespace='title'))
print(index.info(namespace='content'))
InfoResult(index_size=204133)
InfoResult(index_size=204133)

Query

Let’s see what our test articles look like first.

# Print test articles
display(test_articles)
titlecontent
id
111The Rise and Fall and Rise of Virtual RealityIn the wake of Facebook's purchase of Oculus VR, can this revolutionary technology triumph anew?
6467Who should go to Mars?Elon Musk laid out his plan to colonize Mars at a conference on Tuesday, but it was during the Q&ampA session that a woman asked one of the key questions: who will be chosen to embark on ...

The following utility functions help us process and present the results.

titles_mapped = dict(zip(data.index, data.title))
content_mapped = dict(zip(data.index, data.content))
def get_wordcloud_for_article(recommendations, namespace):
    'Generates word cloud for the recommendations (titles or content).'

    stopwords = set(STOPWORDS).union([np.nan, 'NaN', 'S'])
    wordcloud = WordCloud(
                   max_words=50000, 
                   min_font_size =12, 
                   max_font_size=50, 
                   relative_scaling = 0.9, 
                   stopwords=set(STOPWORDS),
                   normalize_plurals= True
                  )
    
    if namespace == 'title':
        clean_titles = [word for word in recommendations.title.values if word not in stopwords]
        wordcloud = wordcloud.generate(' '.join(clean_titles))
    else:
        clean_content = [word for word in recommendations.content.values if word not in stopwords]
        wordcloud = wordcloud.generate(' '.join(clean_content))

    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()


def print_query_results(query_result, query, namespace, show_options={'wordcloud':True, 'tabular':True}):
    'Prints query result with wordcloud.'
      
    print(f'\nMost similar results querying {query} in "{namespace}" namespace:\n')

    res = query_result[0]
    df = pd.DataFrame({'id':res.ids, 
                       'score':res.scores,
                       'title': [titles_mapped[int(_id)] if int(_id) in titles_mapped else ' '  for _id in res.ids],
                       'content': [content_mapped[int(_id)] if int(_id) in content_mapped else ' '  for _id in res.ids],
                       })
    if show_options['tabular']:
        display(df.head(5))
    if show_options['wordcloud']:
        get_wordcloud_for_article(df, namespace)
    print('\n')

The following two functions we use to query the test article’s title or content in either of the namespaces we created. This means we can query the title in the “title” namespace or the “content” namespace. The same is with the article content.

def query_article_title(test_article, namespace, top_k=5, show_options={'wordcloud':True, 'tabular':True}):
    '''Queries an article using its title in the specified
     namespace and prints results.'''

    # Create vector embeddings based on the title column
    encoded_titles = model.encode(test_article['title'], 
                                  show_progress_bar=False)
    test_article['title_vector'] = pd.Series(encoded_titles.tolist())

    # Query namespace passed as parameter using title vector
    query_result_titles = index.query(queries=[test_article.title_vector], 
                                      namespace=namespace, 
                                      top_k=top_k, 
                                      disable_progress_bar=True)

    # Print query results 
    if show_options['wordcloud'] or show_options['tabular']:
        print_query_results(query_result_titles, query='title', 
                            namespace=namespace, 
                            show_options=show_options)

    return query_result_titles

When querying content, we will first create the article content vector and search for the most similar vectors in the “title” or the “content” namespace.

def query_article_content(test_article, namespace, top_k=5, show_options={'wordcloud':True, 'tabular':True}):
    '''Queries an article using its content in the specified 
    namespace and prints results.'''

    # Create vector embeddings based on the content column
    encoded_content = model.encode(test_article['content'], 
                                   show_progress_bar=False)
    test_article['content_vector'] = pd.Series(encoded_content.tolist())

    # Query content namespace using content vector
    query_result_content = index.query(queries=[test_article.content_vector], 
                                       namespace=namespace, 
                                       top_k=top_k, 
                                       disable_progress_bar=True)

    # Print query results 
    if show_options['wordcloud'] or show_options['tabular']:
        print_query_results(query_result_content, 
                            query='content', 
                            namespace=namespace, 
                            show_options=show_options)

    return query_result_content

Now it’s time to do the cross namespace querying and aggregate the results. The following functions query for four combinations (title/content only and title-content, content-title). Then aggregates the results of the four queries to calculate the total occurrence of the articles and their average scores. They are ranked accordingly and the most similar articles are returned.

def aggregate_results(article):
    '''Aggregates results after querying both namespaces
       for both the article's title and content.'''

    results = []
    
    results.append(query_article_title(article, namespace='title', top_k=30, show_options={'wordcloud':False, 'tabular':False}))
    results.append(query_article_title(article, namespace='content', top_k=30, show_options={'wordcloud':False, 'tabular':False}))
    results.append(query_article_content(article, namespace='title', top_k=30, show_options={'wordcloud':False, 'tabular':False}))
    results.append(query_article_content(article, namespace='content', top_k=30, show_options={'wordcloud':False, 'tabular':False}))

    articles_scores = {}
    articles_count = {}

    for res in results:
        for id, score in zip(res[0].ids, res[0].scores):
            if id not in articles_scores:
                articles_scores[id] = score
                articles_count[id] = 1
            else:
                articles_scores[id] += score
                articles_count[id] += 1
    
    return articles_scores , articles_count

def show_aggregated_results(results_dict, counts, show_options={'wordcloud':True, 'tabular':True}):
    '''Shows results after aggregation. Values are sorted based
    on the number of queries they appear (1-4) and based on their
    average score.'''
    
    df = pd.DataFrame({'id':results_dict.keys(), 
                       'count': counts.values(),
                       'average_score':[round(r/c, 3) for r, c in zip(results_dict.values(),counts.values())],
                       'title': [titles_mapped[int(_id)] if int(_id) in titles_mapped else ' '  for _id in results_dict.keys()],
                       'content': [content_mapped[int(_id)] if int(_id) in content_mapped else ' '  for _id in results_dict.keys()],
                       })
    df.sort_values(by=['count', 'average_score'], ascending=False, inplace=True)
    
    if show_options['tabular']:
        print('\nMost similar results after aggregation:\n')
        display(df.head(5))
    if show_options['wordcloud']:
        print('\nWordcloud for titles and content after aggregation:')
        print('-Titles:')
        get_wordcloud_for_article(df[:10], 'title')
        print('-Content:')
        get_wordcloud_for_article(df[:10], 'content')
    print('\n')

Query by Aggregation

We are ready to query our service! We will use all the above auxiliary functions to query the test articles. We will be using our cross-namespace approach that combines four query results into one.

Note that you can add the tabular data results for each query by changing the show_options flags below.

# Query index using simple and cross namespace approach
for e, (_, test_article) in enumerate(test_articles.iterrows()):
    print(f'\nArticle {e+1}')
    print(f'\n Title: {test_article.title}')
    print(f' Content: {test_article.content[:200].strip()}' + ('...' if len(test_article.content) > 200 else ''))
    
    # Uncomment to query the titles in title namespace
    #query_article_title(test_article, 'title',  show_options={'wordcloud':True, 'tabular':False})

    # Uncomment to query the content in content namespace
    #query_article_content(test_article, namespace='content', show_options={'wordcloud':True, 'tabular':False})

    # Cross namespace query
    aggregated_results, counts = aggregate_results(test_article)
    show_aggregated_results(aggregated_results, counts, show_options={'wordcloud':False, 'tabular':True})
Article 1

 Title: The Rise and Fall and Rise of Virtual Reality
 Content: In the wake of Facebook's purchase of Oculus VR, can this revolutionary technology triumph anew?

Most similar results after aggregation:
idaverage_scoretitlecontent
197200.809Oculus Founder, at Center of Legal Battle Over VR, Departs Facebook - The New York TimesSAN FRANCISCO — Palmer Luckey, a founder of the virtual-reality technology company Oculus, has left Facebook three years after the social network acquired his company for close to $3 billion. Mr. ...
72010.808Flush with cash, Oculus plans ambitious new VR headsetAccording to Oculus Rift inventor Palmer Luckey, virtual reality is near and dear to Marc Andreessen&rsquos heart. Twenty years ago&nbsp&mdash&nbspbefore he created the Mosaic we...
346110.806Microsoft Introducing VR Headsets at Half the Price of Oculus Rift - BreitbartOn October 26, Microsoft doubled down on virtual reality by announcing their own VR headsets at the Windows 10 event.[Unless you’ve got $599 for the Oculus Rift, or $799 for Valve’s HTC Vive, your...
131990.800Oculus VR founder Palmer Luckey talks GoPro, 'Minecraft' and eSports - LA TimesOculus VR founder Palmer Luckey answers questions at the Loews Hollywood Hotel on Sept. 24. ', 'A few years ago, journalism major Palmer Luckey dropped out of Cal State Long Beach to work on a dev...
95330.800Virtual reality visionary Palmer Luckey leaves Facebook 3 years after $2-billion Oculus deal - LA TimesPalmer Luckey, the Long Beach entrepreneur whose zeal for virtual reality kickstarted mass investment in the technology, has left Facebook three years after selling his start-up Oculus VR to the s...
Article 2

 Title: Who should go to Mars?
 Content: Elon Musk laid out his plan to colonize Mars at a conference on Tuesday, but it was during the Q&ampampA session that a woman asked one of the key questions: who will be chosen to embark on a ri...

Most similar results after aggregation:
idaverage_scoretitlecontent
2063700.703How Mars lost its atmosphere, and why Earth didn’tMars was once wetter and warmer, and very possibly a congenial environment for life as we know it. Today it looks mighty dead, with all due respect. If there's life, it's cryptic. Mars ju...
835330.679Mars Reconnaissance Orbiter celebrates 10 years at red planet[Sign in to comment!, NASA’s Mars Reconnaissance Orbiter (MRO) arrived at the red planet 10 years ago today and has since completed 45,000 orbits and generated a vast amount of scientific data., O...
835310.664Buzz Aldrin eyes 2040 for manned Mars mission[Sign in to comment!, Former astronaut Buzz Aldrin is eyeing 2040 for the first manned mission to Mars, noting that the red planet’s moon Phobos could play a vital role for astronauts., “I think t...
166150.641NASA orbiters watch as comet flies safely past Mars - LA TimesComet Siding Spring sailed past Mars on Sunday, coming 10 times closer to the Red Planet than any comet on record has come to Earth.', "At the time of the comet's closest approach at 11:27 a.m., i...
1582930.640Mars makes closest approach to Earth for 11 yearsMars reaches its closest approach to Earth for 11 years this evening at 21:35 GMT. The red planet will be just 75 million kilometres away., Mars has been steadily approaching, tripling its apparen...

Summary

We demonstrated a simple textual semantic search approach that aggregates results from two different news article representations: titles only, and content only. We do that by utilizing Pinecone’s namespace feature to create two namespaced indexes.

The aggregation mechanism is simple: We use the query’s title and content representations to query both namespaces and weight results by their occurrences. Our examples show the effectiveness of this approach.

We encourage you to try the code with your data. You might want to try other embedding or aggregation mechanisms. Working with a similarity search service makes such experimentations easy. Have fun, and let us know if you have any questions or interesting findings.

Delete the index

Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again. Use it as a cleanup step if you are done working with a specific index.

pinecone.delete_index(index_name)