This notebook demonstrates how to use Pinecone's similarity search to create a simple application to identify duplicate documents.
The goal is to create a data deduplication application for eliminating near-duplicate copies of academic texts. In this example, we will perform the deduplication of a given text in two steps. First, we will sift a small set of candidate texts using a similarity-search service. Then, we will apply a near-duplication detector over these candidates.
The similarity search will use a vector representation of the texts. With this, semantic similarity is translated to proximity in a vector space. For detecting near-duplicates, we will employ a classification model that examines the raw text.
import os
import json
import math
import statistics
import pandas as pd
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from gensim.utils import tokenize
from datasketch.minhash import MinHash
from datasketch.lsh import MinHashLSH
Now let us look at the columns in the dataset that are relevant for our task.
core_id - Unique indentifier for each article
processed_abstract - This is obtained by applying preprocssing steps like this to the original abstract of the article from the column original abstract.
processed_title - Same as the abstract but for the title of the article.
cat - Every article falls into one of the three possible categories: 'exactdup','neardup','non_dup'
labelled_duplicates - A list of core_ids of articles that are duplicates of current article
Let's calculate the frequency of duplicates per article. Observe that half of the articles have no duplicates, and only a small fraction of the articles have more than ten duplicates.
We will make use of the text data to create vectors for every article. We combine the processed_abstract and processed_title of the article to create a new combined_text column.
Copy
Copied
# Define a new column for calculating embeddings
df["combined_text"]= df.apply(lambda x:str(x.processed_title)+" "+str(x.processed_abstract), axis=1)
Load model
We will use the Average Word Embedding GloVe model to transform text into vector embeddings. We then upload the embeddings into the Pinecone vector index.
pythonpython
Copy
Copied
model = SentenceTransformer("average_word_embeddings_glove.6B.300d")
Now that we have created vectors for the articles and inserted them in the index, we will create a test set for querying. For each article in the test set we will query the index to get the most similar articles, they are the candidates on which we will performs the next classification step.
Below, we list statistics of the number of duplicates per article in the resulting test set.
Copy
Copied
# Create a sample from the dataset
SAMPLE_FRACTION =0.002
test_documents =(
df.groupby(df["labelled_duplicates"].map(len)).apply(lambda x: x.head(math.ceil(len(x)* SAMPLE_FRACTION))).reset_index(drop=True))print("Number of documents with specified number of duplicates:")
lens = test_documents.labelled_duplicates.apply(len)
lens.value_counts()
Number of documents with specified number of duplicates:
0 100
1 73
2 16
3 7
4 3
5 2
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
Name: labelled_duplicates, dtype: int64
pythonpythonpython
Copy
Copied
# Use the model to create embeddings for test articles, which will be the query vectors
queries = model.encode(test_documents.combined_text.to_list()).tolist()
Copy
Copied
# Query the vector indexdefquery_chunks(lst, n):for i inrange(0,len(lst), n):yield lst[i:i + n]
query_results =[]for chunk in query_chunks(queries,50):
query_res = index.query(queries=chunk, top_k=100)
query_results.extend(query_res.results)
Copy
Copied
# Save all retrieval recalls into a list
recalls =[]forid, res in tqdm(list(zip(test_documents.core_id.values, query_results))):# Find document with id in labelled dataset
labeled_df = df[df.core_id ==str(id)]# Calculate the retrieval recall
top_k_list =set([match.idformatchin res.matches])
labelled_duplicates =set(labeled_df.labelled_duplicates.values[0])
intersection = top_k_list.intersection(labelled_duplicates)iflen(labelled_duplicates)!=0:
recalls.append(len(intersection)/len(labelled_duplicates))
print("Mean for the retrieval recall is "+str(statistics.mean(recalls)))print("Standard Deviation is "+str(statistics.stdev(recalls)))
Mean for the retrieval recall is 0.9702529886016125
Standard Deviation is 0.16219287104729735
Running the Classifier
We mentioned earlier in the article that we will perform two steps for deduplication, searching to produce candidates and performing classifciation on them.
We will use Deduplication Classifier based on LSH for detecting duplicates on the results from the previous step. We will run this on a sample of query results we got in the previous step. Feel free to try out the results on the entire set of query results.
Copy
Copied
# Counters for correct/false predictions
all_predictions ={"Correct":0,"False":0}
predictions_per_category ={}# From the results in the previous step, we will take a subset to test our classifier
query_sample = query_results[::10]
ids_sample = test_documents.core_id.to_list()[::10]forid, res inzip(ids_sample, query_sample):# Find document with id from the labelled dataset
labeled_df = df[df.core_id ==str(id)]"""
For every article in the reuslt set, we store the scores and abstract of the articles most similar
to it, according to search in the previous step.
"""
df_result = pd.DataFrame({"id":[match.idformatchin res.matches],"document":[
df[df.core_id == _id].processed_abstract.values[0]for _id in[match.idformatchin res.matches]],"score":[match.score formatchin res.matches],})print(df_result.head())# We need content and labels for our classifier which we can get from the df_results
content = df_result.document.values
labels =list(df_result.id.values)# Create MinHash for each of the documents in result set
min_hashes ={}for label, text inzip(labels, content):
m = MinHash(num_perm=128, seed=5)
tokens =set(tokenize(text))for d in tokens:
m.update(d.encode('utf8'))
min_hashes[label]= m
# Create LSH index
lsh = MinHashLSH(threshold=0.7, num_perm=128,)for i, j in min_hashes.items():
lsh.insert(str(i), j)
query_minhash = min_hashes[id]
duplicates = lsh.query(query_minhash)
duplicates.remove(str(id))# Check whether prediction matches labeled duplicates. Here the groud truth is the set of duplicates from our original set
prediction =("Correct"ifset(labeled_df.labelled_duplicates.values[0])==set(duplicates)else"False")# Add to all predictions
all_predictions[prediction]+=1# Create and/or add to the specific category based on number of duplicates in original dataset
num_of_duplicates =len(labeled_df.labelled_duplicates.values[0])if num_of_duplicates notin predictions_per_category:
predictions_per_category[num_of_duplicates]=[0,0]if prediction =="Correct":
predictions_per_category[num_of_duplicates][0]+=1else:
predictions_per_category[num_of_duplicates][1]+=1# Print the results for a documentprint("{}: expected: {}, predicted: {}, prediction: {}".format(id, labeled_df.labelled_duplicates.values[0], duplicates, prediction
))
# Overall accuracy on a test
accuracy =round(
all_predictions["Correct"]/(all_predictions["Correct"]+ all_predictions["False"]),4,)
accuracy
1.0
Copy
Copied
# Print the prediction count for each class depending on the number of duplicates in labeled dataset
pd.DataFrame.from_dict(
predictions_per_category, orient="index", columns=["Correct","False"])
Correct
False
0
10
0
1
8
0
2
1
0
3
1
0
5
1
0
Delete the Index
Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again.
Copy
Copied
# Delete the index if it's not going to be used anymore
pinecone.delete_index(index_name)
Summary
In this notebook we demonstrate how to perform a deduplication task of over 100,000 articles using Pinecone. With articles embedded as vectors, you can use Pinecone's vector index to find similar articles. For each query article, we then use an LSH classifier on the similar articles to identify duplicate articles. Overall, we show that it is ease to incorporate Pinecone wtih article embedding models and duplication classifiers to build a deduplication service.