Document Deduplication with Similarity Search

This notebook demonstrates how to use Pinecone’s similarity search to create a simple application to identify duplicate documents.

The goal is to create a data deduplication application for eliminating near-duplicate copies of academic texts. In this example, we will perform the deduplication of a given text in two steps. First, we will sift a small set of candidate texts using a similarity-search service. Then, we will apply a near-duplication detector over these candidates.

The similarity search will use a vector representation of the texts. With this, semantic similarity is translated to proximity in a vector space. For detecting near-duplicates, we will employ a classification model that examines the raw text.

Open Notebook in Google Colab

Dependencies

!pip install -qU snapy mmh3 pinecone-client ipywidgets
!pip install -qU sentence-transformers --no-cache-dir
import os
import json
import math
import statistics
import pandas as pd
from tqdm import tqdm
from snapy import MinHash, LSH
from sentence_transformers import SentenceTransformer

Pinecone Setup

import pinecone

# Load Pinecone API key

api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pinecone.init(api_key=api_key)

Get a Pinecone API key if you don’t have one already.

Define a New Pinecone Index

# Pick a name for the new index
index_name = "deduplication"
# Check whether an index with the same name already exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

Create Index

pinecone.create_index(name=index_name, metric="cosine")
0%|          | 0/2 [00:00<?, ?it/s]

{'msg': '', 'success': True}

Create Index object

The index object, a class instance of pinecone.Index , will be reused for optimal performance.

index = pinecone.Index(name=index_name)

Upload

In this tutorial, we will use the Deduplication Dataset 2020 that consists of 100,000 scholarly documents.

Load data

import requests, os, zipfile

DATA_DIR = "tmp"
DATA_FILE = f"{DATA_DIR}/deduplication_dataset_2020.zip"
DATA_URL = "https://core.ac.uk/exports/custom_datasets/deduplication_dataset_2020.zip"

def download_data():
    os.makedirs(DATA_DIR, exist_ok=True)

    if not os.path.exists(DATA_FILE):
        r = requests.get(DATA_URL)  # create HTTP response object
        with open(DATA_FILE, "wb") as f:
            f.write(r.content)
        with zipfile.ZipFile(DATA_FILE, "r") as zip_ref:
            zip_ref.extractall(DATA_DIR)


download_data()
DATA_PATH = os.path.join(DATA_DIR, "deduplication_dataset_2020/Ground_Truth_data.jsonl")

with open(DATA_PATH, encoding="utf8") as json_file:
    data = list(json_file)

Here is a sample of the data.

data_json = [json.loads(json_str) for json_str in data]
df = pd.DataFrame.from_dict(data_json)
df.head()
core_iddoioriginal_abstractoriginal_titleprocessed_titleprocessed_abstractcatlabelled_duplicates
01125108610.1016/j.ajhg.2007.12.013Unobstructed vision requires a particular refr...Mutation of solute carrier SLC16A12 associates...mutation of solute carrier slc16a12 associates...unobstructed vision refractive lens differenti...exact_dup[82332306]
11130975110.1103/PhysRevLett.101.193002Two-color multiphoton ionization of atomic hel...Polarization control in two-color above-thresh...polarization control in two-color above-thresh...multiphoton ionization helium combining extrem...exact_dup[147599753]
21131138510.1016/j.ab.2011.02.013Lectin’s are proteins capable of recognising a...Optimisation of the enzyme-linked lectin assay...optimisation of the enzyme-linked lectin assay...lectin’s capable recognising oligosaccharide t...exact_dup[147603441]
31199224010.1016/j.jpcs.2007.07.063In this work, we present a detailed transmissi...Vertical composition fluctuations in (Ga,In)(N...vertical composition fluctuations in (ga,in)(n...microscopy interfacial uniformity wells grown ...exact_dup[148653623]
41199499010.1016/S0169-5983(03)00013-3Three-dimensional (3D) oscillatory boundary la...Three-dimensional streaming flows driven by os...three-dimensional streaming flows driven by os...oscillatory attached deformable walls boundari...exact_dup[148656283]

Now let us look at the columns in the dataset that are relevant for our task.

core_id - Unique indentifier for each article

processed_abstract - This is obtained by applying preprocssing steps like this to the original abstract of the article from the column original abstract.

processed_title - Same as the abstract but for the title of the article.

cat - Every article falls into one of the three possible categories: ‘exact_dup’,‘near_dup’,‘non_dup’

labelled_duplicates - A list of core_ids of articles that are duplicates of current article

Let’s calculate the frequency of duplicates per article. Observe that half of the articles have no duplicates, and only a small fraction of the articles have more than ten duplicates.

lens = df.labelled_duplicates.apply(len)
lens.value_counts()
0     50000
1     36166
2      7620
3      3108
4      1370
5       756
6       441
7       216
8       108
10       66
9        60
11       48
13       28
12       13
Name: labelled_duplicates, dtype: int64

We will make use of the text data to create vectors for every article. We combine the processed_abstract and processed_title of the article to create a new combined_text column.

# Define a new column for calculating embeddings
df["combined_text"] = df.apply(
    lambda x: str(x.processed_title) + " " + str(x.processed_abstract), axis=1
)

Load model

We will use the Average Word Embedding GloVe model to transform text into vector embeddings. We then upload the embeddings into the Pinecone vector index.

model = SentenceTransformer("average_word_embeddings_glove.6B.300d")
0%|          | 0.00/441M [00:00<?, ?B/s]
df["vectors"] = list(model.encode(df.combined_text.to_list(), show_progress_bar=True))
Batches:   0%|          | 0/3125 [00:00<?, ?it/s]
upsert_acks = index.upsert(items=zip(df.core_id, df.vectors))
0it [00:00, ?it/s]
index.info()
InfoResult(index_size=100000)

Searching for Candidates

Now that we have created vectors for the articles and inserted them in the index, we will create a test set for querying. For each article in the test set we will query the index to get the most similar articles, they are the candidates on which we will performs the next classification step.

Below, we list statistics of the number of duplicates per article in the resulting test set.

# Create a sample from the dataset
SAMPLE_FRACTION = 0.01
test_documents = (
    df.groupby(df["labelled_duplicates"].map(len))
    .apply(lambda x: x.head(math.ceil(len(x) * SAMPLE_FRACTION)))
    .reset_index(drop=True)
)

print("Number of documents with specified number of duplicates:")
lens = test_documents.labelled_duplicates.apply(len)
lens.value_counts()
Number of documents with specified number of duplicates:

0     500
1     362
2      77
3      32
4      14
5       8
6       5
7       3
8       2
13      1
12      1
11      1
10      1
9       1
Name: labelled_duplicates, dtype: int64
# Use the model to create embeddings for test articles, which will be the query vectors
queries = model.encode(test_documents.combined_text.to_list())
# Query the vector index
query_results = index.query(queries=queries, top_k=100)
0it [00:00, ?it/s]
# Save all retrieval recalls into a list
recalls = []

for id, res in tqdm(list(zip(test_documents.core_id.values, query_results))):

    # Find document with id in labelled dataset
    labeled_df = df[df.core_id == str(id)]

    # Calculate the retrieval recall
    top_k_list = set(res.ids)
    labelled_duplicates = set(labeled_df.labelled_duplicates.values[0])
    intersection = top_k_list.intersection(labelled_duplicates)
    if len(labelled_duplicates) != 0:
        recalls.append(len(intersection) / len(labelled_duplicates))
100%|██████████| 1008/1008 [00:08<00:00, 115.72it/s]
print("Mean for the retrieval recall is " + str(statistics.mean(recalls)))
print("Standard Deviation is  " + str(statistics.stdev(recalls)))
Mean for the retrieval recall is 0.9936172751133381
Standard Deviation is  0.07584971503312801

Running the Classifier

We mentioned earlier in the article that we will perform two steps for deduplication, searching to produce candidates and performing classifciation on them.

We will use Deduplication Classifier based on LSH for detecting duplicates on the results from the previous step. We will run this on a sample of query results we got in the previous step. Feel free to try out the results on the entire set of query results.

# Counters for correct/false predictions
all_predictions = {"Correct": 0, "False": 0}
predictions_per_category = {}

# From the results in the previous step, we will take a subset to test our classifier
query_sample = query_results[::50]
ids_sample = test_documents.core_id.to_list()[::50]

for id, res in zip(ids_sample, query_sample):

    # Find document with id from the labelled dataset
    labeled_df = df[df.core_id == str(id)]

    """
    For every article in the reuslt set, we store the scores and abstract of the articles most similar
    to it, according to search in the previous step.
    """

    df_result = pd.DataFrame(
        {
            "id": res.ids,
            "document": [
                df[df.core_id == _id].processed_abstract.values[0] for _id in res.ids
            ],
            "score": res.scores,
        }
    )

    print(df_result.head())

    # We need content and labels for our classifier which we can get from the df_results
    content = df_result.document.values
    labels = list(df_result.id.values)

    # Create MinHash object
    minhash = MinHash(content, n_gram=4, permutations=100, hash_bits=64, seed=5)

    # Create LSH model
    lsh = LSH(minhash, labels, no_of_bands=50)

    # Query to find near duplicates for the query document
    duplicates = lsh.query(id, min_jaccard=0.3)

    # Check whether prediction matches labeled duplicates. Here the groud truth is the set of duplicates from our original set
    prediction = (
        "Correct"
        if set(labeled_df.labelled_duplicates.values[0]) == set(duplicates)
        else "False"
    )

    # Add to all predictions
    all_predictions[prediction] += 1

    # Create and/or add to the specific category based on number of duplicates in original dataset
    num_of_duplicates = len(labeled_df.labelled_duplicates.values[0])
    if num_of_duplicates not in predictions_per_category:
        predictions_per_category[num_of_duplicates] = [0, 0]

    if prediction == "Correct":
        predictions_per_category[num_of_duplicates][0] += 1
    else:
        predictions_per_category[num_of_duplicates][1] += 1

    # Print the results for a document
    print(
        "{}: expected: {}, predicted: {}, prediction: {}".format(
            id, labeled_df.labelled_duplicates.values[0], duplicates, prediction
        )
    )
         id                                           document     score
0  15080768  analyse centred methodology. discretisation so...  1.000000
1  52900859  audiencethe tissues pulses modelled compartmen...  0.787797
2  52682462  audiencethe tissues pulses modelled compartmen...  0.787797
3   2553555  multilayered illuminated acoustic electromagne...  0.781398
4  48261378  heterostructure schr dinger poisson numericall...  0.778778
15080768: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   15070865  drink heavily nonstudent peers unaware drinkin...  1.000000
1  154671698  updated alcohol suicidal level. searches retri...  0.889408
2   52132897  updated alcohol suicidal level. searches retri...  0.889408
3   43606482  fulltext .pdf publisher effectiveness drinking...  0.883402
4   82484980  abstractthe effectiveness drinking motive tail...  0.883145
15070865: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  12204526  nonobstructing colonic dilatation commonly tra...  1.000000
1  12204465  february underwent orthotopic transplantation ...  0.781499
2  82030348  abstractintroductionrare adenosquamous carcino...  0.779890
3  62715217  rare adenosquamous carcinomas incidence. nonsp...  0.773131
4  12205293  mucosal injury ischemia reperfusion documented...  0.760542
12204526: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  38283834  publisher please click hyperlink links click h...  1.000000
1  18161776  older adults coronary traditional aged adults....  0.782105
2  61370917  courses idiopathic interstitial pneumonia pred...  0.765892
3  59037046  courses idiopathic interstitial pneumonia pred...  0.765892
4  37830673  describes prevalence anomalies demographic con...  0.746488
38283834: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   84058394  pityriasis rubra pilaris solely resemblance ps...  1.000000
1   39304288  関節リウマチから新たなヘルパーt細胞を同定 慢性炎症のメカニズム解明に期待 京都大学プレスリ...  0.885882
2   43577553  abundant rheumatoid arthritis pathogenesis poo...  0.877882
3   70340267  antigen epcam epithelial adhesion molecule car...  0.859223
4  188208537  rare autosomal encoding keratinocyte molecule ...  0.858379
84058394: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  83829582  comprehensive timely burden adolescents improv...  1.000000
1  82047193  backgroundnon fatal injury increasingly detrac...  0.927517
2  76997320  focuses younger years. comparable nonfatal fat...  0.927468
3  83638883  fatal injury increasingly detract live largely...  0.927394
4  77028407  fatal injury increasingly detract world’s live...  0.925706
83829582: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  80166671  rainfall erosivity rain erosion rainfall power...  1.000000
1  84815499  presents bivariate capable rainfall e.g. rainf...  0.809825
2   9697454  quantifying precipitation extremes challenge a...  0.788008
3  15566207  kalman filter merge weather radar rain rainfal...  0.779109
4  62473370  madeira island portugal experienced intense ra...  0.776462
80166671: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0  146466133  researchers hybrid halide perovskites exhibit ...  1.000000
1   42128223  photophysics thermally delayed fluorescence ta...  0.818589
2   98113376  breakthrough electronics anticipated emerging ...  0.816006
3   93951557  neutral nanophotonic offer versatile platform ...  0.814093
4   52169849  poly ethylenedioxypyrrole –gold nanoparticle –...  0.812512
146466133: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  77618383  maintaining hierarchy recognized oriented desi...  1.000000
1  12096773  denotation hierarchical hierarchical interpret...  0.805681
2  10874659  enterprise configured organizational operation...  0.803281
3  55607270  increasingly popular formal foundations rigoro...  0.797738
4  48228496  audienceweb orchestrations conventionally empl...  0.796659
77618383: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  14987378  interpersonal similarity perceived persuasibil...  1.000000
1  20543615  examining conceptual personal perception cruci...  0.775047
2  80717702  intriguing centuries philosophers scientists a...  0.771628
3  16140694  fulltext .pdf publisher situations personally ...  0.770426
4  29819006  behavioral episodes judgments groups. paradigm...  0.763183
14987378: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  11251086  unobstructed vision refractive lens differenti...  1.000000
1  82332306  unobstructed vision refractive lens differenti...  1.000000
2  59036307  aims osmotic oxidative progression advancement...  0.839049
3  61371524  aims osmotic oxidative progression advancement...  0.839049
4  11249430  dysfunction cilia nearly ubiquitously solitary...  0.796623
11251086: expected: ['82332306'], predicted: ['82332306'], prediction: Correct
          id                                           document     score
0  158351487  installed rapidly decade deployments deeper wa...  1.000000
1  188365084  installed rapidly decade deployments deeper wa...  1.000000
2   83926778  offshore turbine reliability biggest paper. un...  0.853790
3  158370190  offshore turbine reliability biggest paper. un...  0.853790
4   74226591  investigates overruns underruns occurring onsh...  0.834363
188365084: expected: ['158351487'], predicted: ['158351487'], prediction: Correct
         id                                           document     score
0  51931521  audiencewe boron doped diamond epilayers dopan...  1.000000
1  52670633  audiencewe boron doped diamond epilayers dopan...  1.000000
2  42741620  broadband conductivity superconducting crystal...  0.764618
3  51954784  audiencehomoepitaxial films boron doped diamon...  0.759308
4  52681545  audiencehomoepitaxial films boron doped diamon...  0.759308
51931521: expected: ['52670633'], predicted: ['52670633'], prediction: Correct
          id                                           document     score
0  148657542  manure conveyor belt partially slatted floor f...  1.000000
1   11996167  manure conveyor belt partially slatted floor f...  1.000000
2   80905050  belt conveyor drifting away maintenance readju...  0.683395
3   52715438  audience plasmapause belt boundaries. mission ...  0.661620
4  185669216  constructing conveyor developed. conveyor tech...  0.648793
148657542: expected: ['11996167'], predicted: ['11996167'], prediction: Correct
         id                                           document     score
0  52193278  audienceobjectivesdespite improvements extensi...  1.000000
1  48180068  audienceobjectivesdespite improvements extensi...  1.000000
2  54032397  audiencecommon toxicity neglects fate intracel...  0.798693
3  48172374  audiencecommon toxicity neglects fate intracel...  0.798693
4  80841240  growing myeloid dendritic monocyte modc immuno...  0.798068
52193278: expected: ['48180068'], predicted: ['48180068'], prediction: Correct
          id                                           document     score
0  148659725  efforts multimedia bridge socalled “semantic g...  1.000000
1   11998348  efforts multimedia bridge socalled “semantic g...  1.000000
2   48333582  merging conceptual systems. kind studying expr...  0.794105
3   53843205  multilingual characterize semantic understanda...  0.785507
4   52105035  searching frustrating. reasons ambiguity words...  0.784552
148659725: expected: ['11998348'], predicted: ['11998348'], prediction: Correct
         id                                           document     score
0  51942555  audiencethe aortic stent grafts plays success ...  1.000000
1  52617545  audiencethe aortic stent grafts plays success ...  1.000000
2  48160472  audienceendovascular repair abdominal aortic a...  0.893296
3  52636266  audienceendovascular repair abdominal aortic a...  0.893296
4  51936106  audienceendovascular repair abdominal aortic a...  0.893296
51942555: expected: ['52617545'], predicted: ['52617545'], prediction: Correct
          id                                           document     score
0  160271518  book insight supporting sustainability industr...  1.000000
1  159415624  book insight supporting sustainability industr...  1.000000
2   52136704  ambition decarbonize electricity. potentially ...  0.828932
3   11997042  contributes debate analysing barriers enablers...  0.821604
4  148659345  contributes debate analysing barriers enablers...  0.821604
160271518: expected: ['159415624'], predicted: ['159415624'], prediction: Correct
         id                                           document     score
0  52659710  audiencein planetary weather performing compar...  1.000000
1  51930649  audiencein planetary weather performing compar...  1.000000
2  52708825  audiencein planetary weather performing compar...  1.000000
3   2744052  peer reviewed submitted publication earth land...  0.810740
4  52709135  audienceto earth iron outer indirect available...  0.799998
52708825: expected: ['51930649', '52659710'], predicted: ['52659710', '51930649'], prediction: Correct
          id                                           document     score
0   48184849  audienceit recognized nanoparticles great effe...  1.000000
1   52679424  audienceit recognized nanoparticles great effe...  1.000000
2   52998783  audienceit recognized nanoparticles great effe...  1.000000
3   51946573  audienceit recognized nanoparticles great effe...  1.000000
4  161266545  copolymers bcps directed assembly emerged real...  0.843905
51946573: expected: ['48184849', '52679424', '52998783'], predicted: ['48184849', '52679424', '52998783'], prediction: Correct
          id                                           document  score
0   47118923  aims.our rays distant galactic nuclei derive e...    1.0
1  152270421  aims.our rays distant galactic nuclei derive e...    1.0
2   52701875  aims.our rays distant galactic nuclei derive e...    1.0
3   52912706  aims.our rays distant galactic nuclei derive e...    1.0
4   52663940  aims.our rays distant galactic nuclei derive e...    1.0
52761216: expected: ['46776360', '47118923', '47309787', '152270421', '52663940', '52701875', '52912706'], predicted: ['47118923', '152270421', '52701875', '52912706', '52663940', '46776360', '47309787'], prediction: Correct
all_predictions
{'Correct': 21, 'False': 0}
# Overall accuracy on a test
accuracy = round(
    all_predictions["Correct"]
    / (all_predictions["Correct"] + all_predictions["False"]),
    4,
)
accuracy
1.0
# Print the prediction count for each class depending on the number of duplicates in labeled dataset
pd.DataFrame.from_dict(
    predictions_per_category, orient="index", columns=["Correct", "False"]
)
CorrectFalse
0100
180
210
310
710

Delete the Index

Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again.

# Delete the index if it's not going to be used anymore
pinecone.delete_index(index_name)
0%|          | 0/1 [00:00<?, ?it/s]

{'success': True}

Summary

In this notebook we demonstrate how to perform a deduplication task of over 100,000 articles using Pinecone. With articles embedded as vectors, you can use Pinecone’s vector index to find similar articles. For each query article, we then use an LSH classifier on the similar articles to identify duplicate articles. Overall, we show that it is ease to incorporate Pinecone wtih article embedding models and duplication classifiers to build a deduplication service.