Semantic Textual Search with Vector Index
This notebook demonstrates how to create a simple semantic text search using Pinecone’s similarity search service.
The goal is to create a search application that retrieves news articles based on short description queries (e.g., article titles). To achieve that, we will store vector representations of the articles in Pinecone's index. These vectors and their proximity capture semantic relations. Nearby vectors indicate similar content, and contents from faraway vectors are dissimilar.
Semantic textual search is a technique used for solving other text-based applications. For example, our deduplication, question-answering and personalized article recommendation demos were solved using semantic textual search.
Pinecone Setup
!pip install -qU pinecone-client ipywidgets
import pinecone
# Load Pinecone API key
import os
api_key = os.getenv("PINECONE_API_KEY") or "YOUR-API-KEY"
pinecone.init(api_key=api_key, environment='us-west1-gcp')
# List all indexes currently present for your key
pinecone.list_indexes()
[]
Get a Pinecone API key if you don’t have one already.
Install and Import Python Packages
!pip install -qU wordcloud pandas-profiling
!pip install -qU sentence-transformers --no-cache-dir
import pandas as pd
import numpy as np
import time
import re
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import sqlite3
pd.set_option('display.max_colwidth', 200)
Create a New Service
# Pick a name for the new index
index_name = 'semantic-text-search'
# Check whether the index with the same name already exists
if index_name in pinecone.list_indexes():
pinecone.delete_index(index_name)
pinecone.create_index(name=index_name, dimension=300)
index = pinecone.Index(index_name=index_name)
Upload
We will define two separate sub-indexes using Pinecone's namespace feature. One for indexing articles by content, and the other by title. At query time, we will return an aggregation of the results from the content and title indexes.
First, we will load data and the model, and then create embeddings and upsert them into the namespaces.
Load data
The dataset used throughout this example contains 204,135 articles from 18 American publications.
Let's download the dataset and load data.
import requests, os
DATA_DIR = 'tmp'
URL = "https://www.dropbox.com/s/b2cyb85ib17s7zo/all-the-news.db?dl=1"
FILE = f"{DATA_DIR}/all-the-news.db"
def download_data():
os.makedirs(DATA_DIR, exist_ok=True)
if not os.path.exists(FILE):
r = requests.get(URL) # create HTTP response object
with open(FILE, "wb") as f:
f.write(r.content)
download_data()
cnx = sqlite3.connect(FILE)
data = pd.read_sql_query("SELECT * FROM longform", cnx)
data.set_index('id', inplace=True)
data.head()
title | author | date | content | year | month | publication | category | digital | section | url | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
1 | Agent Cooper in Twin Peaks is the audience: once delighted, now disintegrating | \nTasha Robinson\n | 2017-05-31 | And never more so than in Showtime’s new series revival Some spoilers ahead through episode 4 of season 3 of Twin Peaks. On May 21st, Showtime brought back David Lynch’s groundbreaking TV se... | 2017 | 5 | Verge | Longform | 1.0 | None | None |
2 | AI, the humanity! | \nSam Byford\n | 2017-05-30 | AlphaGo’s victory isn’t a defeat for humans — it’s an opportunity A loss for humanity! Man succumbs to machine! If you heard about AlphaGo’s latest exploits last week — crushing the world’s ... | 2017 | 5 | Verge | Longform | 1.0 | None | None |
3 | The Viral Machine | \nKaitlyn Tiffany\n | 2017-05-25 | Super Deluxe built a weird internet empire. Can it succeed on TV? When Wolfgang Hammer talks about the future of entertainment, people listen. Hammer is the mastermind behind the American re... | 2017 | 5 | Verge | Longform | 1.0 | None | None |
4 | How Anker is beating Apple and Samsung at their own accessory game | \nNick Statt\n | 2017-05-22 | Steven Yang quit his job at Google in the summer of 2011 to build the products he felt the world needed: a line of reasonably priced accessories that would be better than the ones you could ... | 2017 | 5 | Verge | Longform | 1.0 | None | None |
5 | Tour Black Panther’s reimagined homeland with Ta-Nehisi Coates | \nKwame Opam\n | 2017-05-15 | Ahead of Black Panther’s 2018 theatrical release, Marvel turned to Ta-Nehisi Coates to breathe new life into the nation of Wakanda. “I made most of my career analyzing the forces of racism a... | 2017 | 5 | Verge | Longform | 1.0 | None | None |