Document Deduplication with Similarity Search

This notebook demonstrates how to use Pinecone's similarity search to create a simple application to identify duplicate documents.

The goal is to create a data deduplication application for eliminating near-duplicate copies of academic texts. In this example, we will perform the deduplication of a given text in two steps. First, we will sift a small set of candidate texts using a similarity-search service. Then, we will apply a near-duplication detector over these candidates.

The similarity search will use a vector representation of the texts. With this, semantic similarity is translated to proximity in a vector space. For detecting near-duplicates, we will employ a classification model that examines the raw text.

Open Notebook in Google Colab

Dependencies

!pip install -qU datasketch gensim mmh3 pinecone-client ipywidgets
!pip install -qU sentence-transformers --no-cache-dir

import os
import json
import math
import statistics
import pandas as pd
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from gensim.utils import tokenize
from datasketch.minhash import MinHash
from datasketch.lsh import MinHashLSH

Pinecone Setup

import pinecone

# Load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pinecone.init(api_key=api_key, environment='us-west1-gcp')

Get a Pinecone API key if you don’t have one already.

Define a New Pinecone Index

# Pick a name for the new index
index_name = "deduplication"

# Check whether an index with the same name already exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

Create Index

pinecone.create_index(name=index_name, dimension=300, metric="cosine")

Create Index object

The index object, a class instance of pinecone.Index , will be reused for optimal performance.

index = pinecone.Index(index_name)

Upload

In this tutorial, we will use the Deduplication Dataset 2020 that consists of 100,000 scholarly documents.

Load data

import requests, os, zipfile

DATA_DIR = "tmp"
DATA_FILE = f"{DATA_DIR}/deduplication_dataset_2020.zip"
DATA_URL = "https://core.ac.uk/exports/custom_datasets/deduplication_dataset_2020.zip"


def download_data():
    os.makedirs(DATA_DIR, exist_ok=True)

    if not os.path.exists(DATA_FILE):
        r = requests.get(DATA_URL)  # create HTTP response object
        with open(DATA_FILE, "wb") as f:
            f.write(r.content)
        with zipfile.ZipFile(DATA_FILE, "r") as zip_ref:
            zip_ref.extractall(DATA_DIR)

download_data()

DATA_PATH = os.path.join(DATA_DIR, "deduplication_dataset_2020/Ground_Truth_data.jsonl")

with open(DATA_PATH, encoding="utf8") as json_file:
    data = list(json_file)

Here is a sample of the data.

data_json = [json.loads(json_str) for json_str in data]
df = pd.DataFrame.from_dict(data_json)
df.head()
core_id doi original_abstract original_title processed_title processed_abstract cat labelled_duplicates
0 11251086 10.1016/j.ajhg.2007.12.013 Unobstructed vision requires a particular refr... Mutation of solute carrier SLC16A12 associates... mutation of solute carrier slc16a12 associates... unobstructed vision refractive lens differenti... exact_dup [82332306]
1 11309751 10.1103/PhysRevLett.101.193002 Two-color multiphoton ionization of atomic hel... Polarization control in two-color above-thresh... polarization control in two-color above-thresh... multiphoton ionization helium combining extrem... exact_dup [147599753]
2 11311385 10.1016/j.ab.2011.02.013 Lectin’s are proteins capable of recognising a... Optimisation of the enzyme-linked lectin assay... optimisation of the enzyme-linked lectin assay... lectin’s capable recognising oligosaccharide t... exact_dup [147603441]
3 11992240 10.1016/j.jpcs.2007.07.063 In this work, we present a detailed transmissi... Vertical composition fluctuations in (Ga,In)(N... vertical composition fluctuations in (ga,in)(n... microscopy interfacial uniformity wells grown ... exact_dup [148653623]
4 11994990 10.1016/S0169-5983(03)00013-3 Three-dimensional (3D) oscillatory boundary la... Three-dimensional streaming flows driven by os... three-dimensional streaming flows driven by os... oscillatory attached deformable walls boundari... exact_dup [148656283]

Now let us look at the columns in the dataset that are relevant for our task.

core_id - Unique indentifier for each article

processed_abstract - This is obtained by applying preprocssing steps like this to the original abstract of the article from the column original abstract.

processed_title - Same as the abstract but for the title of the article.

cat - Every article falls into one of the three possible categories: 'exactdup','neardup','non_dup'

labelled_duplicates - A list of core_ids of articles that are duplicates of current article

Let's calculate the frequency of duplicates per article. Observe that half of the articles have no duplicates, and only a small fraction of the articles have more than ten duplicates.

lens = df.labelled_duplicates.apply(len)
lens.value_counts()
0     50000
1     36166
2      7620
3      3108
4      1370
5       756
6       441
7       216
8       108
10       66
9        60
11       48
13       28
12       13
Name: labelled_duplicates, dtype: int64

We will make use of the text data to create vectors for every article. We combine the processed_abstract and processed_title of the article to create a new combined_text column.

# Define a new column for calculating embeddings
df["combined_text"] = df.apply(
    lambda x: str(x.processed_title) + " " + str(x.processed_abstract), axis=1
)

Load model

We will use the Average Word Embedding GloVe model to transform text into vector embeddings. We then upload the embeddings into the Pinecone vector index.

model = SentenceTransformer("average_word_embeddings_glove.6B.300d")

df["vectors"] = list(model.encode(df.combined_text.to_list(), show_progress_bar=True).tolist())

Index the Vectors

import itertools

def chunks(iterable, batch_size):
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

for batch in chunks(zip(df.core_id.astype(str), df.vectors), 500):
    index.upsert(vectors=batch)

index.describe_index_stats()

Searching for Candidates

Now that we have created vectors for the articles and inserted them in the index, we will create a test set for querying. For each article in the test set we will query the index to get the most similar articles, they are the candidates on which we will performs the next classification step.

Below, we list statistics of the number of duplicates per article in the resulting test set.

# Create a sample from the dataset
SAMPLE_FRACTION = 0.002
test_documents = (
    df.groupby(df["labelled_duplicates"].map(len))
    .apply(lambda x: x.head(math.ceil(len(x) * SAMPLE_FRACTION)))
    .reset_index(drop=True)
)

print("Number of documents with specified number of duplicates:")
lens = test_documents.labelled_duplicates.apply(len)
lens.value_counts()
Number of documents with specified number of duplicates:

0     500
1     362
2      77
3      32
4      14
5       8
6       5
7       3
8       2
13      1
12      1
11      1
10      1
9       1
Name: labelled_duplicates, dtype: int64
# Use the model to create embeddings for test articles, which will be the query vectors
queries = model.encode(test_documents.combined_text.to_list()).tolist()

# Query the vector index
query_results = index.query(queries=queries, top_k=100)

# Save all retrieval recalls into a list
recalls = []

for id, res in tqdm(list(zip(test_documents.core_id.values, query_results.results))):

    # Find document with id in labelled dataset
    labeled_df = df[df.core_id == str(id)]

    # Calculate the retrieval recall
    top_k_list = set([match.id for match in res.matches])
    labelled_duplicates = set(labeled_df.labelled_duplicates.values[0])
    intersection = top_k_list.intersection(labelled_duplicates)
    if len(labelled_duplicates) != 0:
        recalls.append(len(intersection) / len(labelled_duplicates))

print("Mean for the retrieval recall is " + str(statistics.mean(recalls)))
print("Standard Deviation is  " + str(statistics.stdev(recalls)))
Mean for the retrieval recall is 0.9702529886016125
Standard Deviation is  0.16219287104729735

Running the Classifier

We mentioned earlier in the article that we will perform two steps for deduplication, searching to produce candidates and performing classifciation on them.

We will use Deduplication Classifier based on LSH for detecting duplicates on the results from the previous step. We will run this on a sample of query results we got in the previous step. Feel free to try out the results on the entire set of query results.

# Counters for correct/false predictions
all_predictions = {"Correct": 0, "False": 0}
predictions_per_category = {}

# From the results in the previous step, we will take a subset to test our classifier
query_sample = query_results.results[::10]
ids_sample = test_documents.core_id.to_list()[::10]

for id, res in zip(ids_sample, query_sample):
    
    # Find document with id from the labelled dataset
    labeled_df = df[df.core_id == str(id)]

    """
    For every article in the reuslt set, we store the scores and abstract of the articles most similar 
    to it, according to search in the previous step.
    """

    df_result = pd.DataFrame(
        {
            "id": [match.id for match in res.matches],
            "document": [
                df[df.core_id == _id].processed_abstract.values[0] for _id in  [match.id for match in res.matches]
            ],
            "score": [match.score for match in res.matches],
        }
    )

    print(df_result.head())

    # We need content and labels for our classifier which we can get from the df_results
    content = df_result.document.values
    labels = list(df_result.id.values)
    
    # Create MinHash for each of the documents in result set
    min_hashes = {}
    for label, text in zip(labels, content):
        m = MinHash(num_perm=128, seed=5)
        tokens = set(tokenize(text))
        for d in tokens:
            m.update(d.encode('utf8'))
        min_hashes[label] = m
    
    # Create LSH index
    lsh = MinHashLSH(threshold=0.7, num_perm=128, )
    for i, j in min_hashes.items():
        lsh.insert(str(i), j)
    
    query_minhash = min_hashes[id]
    duplicates = lsh.query(query_minhash)
    duplicates.remove(str(id))
    
    # Check whether prediction matches labeled duplicates. Here the groud truth is the set of duplicates from our original set
    prediction = (
        "Correct"
        if set(labeled_df.labelled_duplicates.values[0]) == set(duplicates)
        else "False"
    )
    
    # Add to all predictions
    all_predictions[prediction] += 1
    
    # Create and/or add to the specific category based on number of duplicates in original dataset
    num_of_duplicates = len(labeled_df.labelled_duplicates.values[0])
    if num_of_duplicates not in predictions_per_category:
        predictions_per_category[num_of_duplicates] = [0, 0]

    if prediction == "Correct":
        predictions_per_category[num_of_duplicates][0] += 1
    else:
        predictions_per_category[num_of_duplicates][1] += 1

    # Print the results for a document
    print(
        "{}: expected: {}, predicted: {}, prediction: {}".format(
            id, labeled_df.labelled_duplicates.values[0], duplicates, prediction
        )
    )
id                                           document     score
0  15080768  analyse centred methodology. discretisation so...  1.000000
1  52682462  audiencethe tissues pulses modelled compartmen...  0.787797
2  52900859  audiencethe tissues pulses modelled compartmen...  0.787797
3   2553555  multilayered illuminated acoustic electromagne...  0.781398
4  48261378  heterostructure schr dinger poisson numericall...  0.778777
15080768: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   55110306  latrepirdine orally administered molecule init...  1.000000
1  188404434  cysteamine potentially numerous huntington dis...  0.903965
2   81634102  deutetrabenazine molecule deuterium attenuates...  0.880078
3   42021224  comorbidities. safe drugs available. efficacy ...  0.857741
4   78271101  promising prevent onset ultrahigh psychosis di...  0.849158
55110306: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   10914205  read objectives schoolchildren sunscreen morni...  1.000000
1   77409456  overeating harmful alcohol tobacco aetiology c...  0.669037
2   10896024  sunlight cutaneous vitamin production. highlig...  0.633516
3   15070865  drink heavily nonstudent peers unaware drinkin...  0.633497
4  154670695  dette siste tekst versjon artikkelen inneholde...  0.627933
10914205: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  43096919  publishedcomparative studymulticenter tcontext...  1.000000
1  77165332  cerebral amyloid aggregation pathological alzh...  0.871247
2  70343569  neurodegenerative heterogeneous disorders prog...  0.867806
3  18448676  beta amyloid beta deposition hallmarks alzheim...  0.855655
4  46964510  alzheimer unexplained. sought loci detect robu...  0.855137
43096919: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  12203626  hypernatremia recipients homografts postoperat...  1.000000
1  82542813  abstractobjectivesto intravenous maintenance f...  0.800283
2  81206306  uromodulin tamm–horsfall abundant excreted uri...  0.794892
3  36026525  drinking sodium bicarbonated mineral cardiovas...  0.793452
4  83567081  drinking sodium bicarbonated mineral cardiovas...  0.793252
12203626: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   15070865  drink heavily nonstudent peers unaware drinkin...  1.000000
1   52132897  updated alcohol suicidal level. searches retri...  0.889408
2  154671698  updated alcohol suicidal level. searches retri...  0.889408
3   43606482  fulltext .pdf publisher effectiveness drinking...  0.883402
4   82484980  abstractthe effectiveness drinking motive tail...  0.883145
15070865: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   80341690  potentially inappropriate medicines pims older...  1.000000
1   39320843  elderly receive medications adverse effects. e...  0.807533
2   82162292  abstractbackgroundrisk assessments widely pred...  0.780006
3   77027179  assessments widely predict opioid disorder unc...  0.779405
4  153514317  yesbackground challenging person dementia. beh...  0.757255
80341690: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0   9066821  commotio retinae opacification retina blunt oc...  1.000000
1  78051578  neovascular macular degeneration anti–vascular...  0.731147
2  86422032  automated lesions challenging diagnostic lesio...  0.703925
3  52434306  audiencewe propose voxelwise images. relies ge...  0.699708
4  48174418  audiencewe propose voxelwise images. relies ge...  0.699708
9066821: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   15052827  indirect schizophrenia australia incidence cos...  1.000000
1  154860392  illness schizophrenia bipolar disorder depress...  0.795662
2   51964867  audiencebackground cholesterol lowering jupite...  0.791904
3   75913230  thesis characterize burden cardiovascular deme...  0.775635
4   52133218  aims depression anxiety myocardial infarction ...  0.765936
15052827: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  12203661  glomerulonephritis serious hemoptysis. antiglo...  1.000000
1  12204810  twenty alagille syndrome underwent transplanta...  0.811871
2  47112592  audiencepatients autoimmune polyendocrine synd...  0.810457
3  52460385  audiencepatients autoimmune polyendocrine synd...  0.810457
4  52198725  audiencepatients autoimmune polyendocrine synd...  0.810457
12203661: expected: [], predicted: [], prediction: Correct
         id                                           document     score
0  11251086  unobstructed vision refractive lens differenti...  1.000000
1  82332306  unobstructed vision refractive lens differenti...  1.000000
2  59036307  aims osmotic oxidative progression advancement...  0.839049
3  61371524  aims osmotic oxidative progression advancement...  0.839049
4  82072232  dysfunction cilia nearly ubiquitously solitary...  0.796623
11251086: expected: ['82332306'], predicted: ['82332306'], prediction: Correct
          id                                           document     score
0  148662402  presents vision successfully discriminates wee...  1.000000
1   12001088  presents vision successfully discriminates wee...  1.000000
2  148666025  proposes oriented crop maize weed pressure. vi...  0.904243
3   18424329  proposes oriented crop maize weed pressure. vi...  0.904243
4   18424394  proposes oriented identifying crop rows maize ...  0.861464
12001088: expected: ['148662402'], predicted: ['148662402'], prediction: Correct
          id                                           document     score
0  147595688  reflectance exciton–polariton film polycrystal...  1.000000
1   11307919  reflectance exciton–polariton film polycrystal...  1.000000
2   11307922  photoluminescence reflectance oriented polycry...  0.816958
3  147595695  photoluminescence reflectance oriented polycry...  0.816958
4   33106913  macroscopic dielectric polycrystalline commonl...  0.804686
147595688: expected: ['11307919'], predicted: ['11307919'], prediction: Correct
          id                                           document     score
0  148663921  thanks inherent probabilistic graphical prime ...  1.000000
1   12002296  thanks inherent probabilistic graphical prime ...  1.000000
2   52634130  audienceobject oriented brms platform automati...  0.869993
3   52294731  audienceobject oriented brms platform automati...  0.869993
4   34403460  acceptance artificial intelligence aims learn ...  0.865815
148663921: expected: ['12002296'], predicted: ['12002296'], prediction: Correct
          id                                           document     score
0  151641478  stabilised soems unstable aircraft presented. ...  1.000000
1   11874260  stabilised soems unstable aircraft presented. ...  1.000000
2   29528077  projection snapshot balanced truncation unstab...  0.724496
3   77005252  projection snapshot balanced truncation unstab...  0.724496
4  148663435  ideas robust computationally amenable industri...  0.722027
151641478: expected: ['11874260'], predicted: ['11874260'], prediction: Correct
          id                                           document     score
0  188365084  installed rapidly decade deployments deeper wa...  1.000000
1  158351487  installed rapidly decade deployments deeper wa...  1.000000
2  158370190  offshore turbine reliability biggest paper. un...  0.853790
3   83926778  offshore turbine reliability biggest paper. un...  0.853790
4   74226591  investigates overruns underruns occurring onsh...  0.834363
188365084: expected: ['158351487'], predicted: ['158351487'], prediction: Correct
         id                                           document     score
0   9030380  propose vulnerability network. analogy balls l...  1.000000
1   2097371  propose vulnerability network. analogy balls l...  1.000000
2  49270269  audiencethis introduces validates sensor propa...  0.754055
3  43094896  peer reviewed brownjohn displacement sensor co...  0.745553
4  82328418  abstractthe triple deformable variational gree...  0.725793
2097371: expected: ['9030380'], predicted: ['9030380'], prediction: Correct
          id                                           document     score
0  148674298  race segments swimmers. analysed finals sessio...  1.000000
1   33176265  race segments swimmers. analysed finals sessio...  1.000000
2  148674300  swimming race parameters. hundred fifty eight ...  0.886608
3   33176267  swimming race parameters. hundred fifty eight ...  0.886608
4  143900637  swimmers swimmers coaches trainers. video sens...  0.736030
33176265: expected: ['148674298'], predicted: ['148674298'], prediction: Correct
         id                                           document     score
0  52844591  audiencehere geochemical lopevi volcano volcan...  1.000000
1  52722823  audiencehere geochemical lopevi volcano volcan...  1.000000
2  52308905  audiencehere geochemical lopevi volcano volcan...  1.000000
3  52717537  audiencethe volcanism cameroon volcanic mantle...  0.893717
4  52840980  audiencethe volcanism cameroon volcanic mantle...  0.893717
52308905: expected: ['52722823', '52844591'], predicted: ['52844591', '52722823'], prediction: Correct
         id                                           document     score
0  35078501  lagrangian formalism supermembrane supergravit...  1.000000
1   2531039  lagrangian formalism supermembrane supergravit...  1.000000
2  35093363  lagrangian formalism supermembrane supergravit...  1.000000
3  44119402  lagrangian formalism supermembrane supergravit...  1.000000
4  35089833  supergravity correlators worldsheet analogous ...  0.847565
44119402: expected: ['2531039', '35078501', '35093363'], predicted: ['2531039', '35078501', '35093363'], prediction: Correct
          id                                           document  score
0   46770666  microlensing surveys tens millions stars. unpr...    1.0
1   52456923  microlensing surveys tens millions stars. unpr...    1.0
2  152091185  microlensing surveys tens millions stars. unpr...    1.0
3   52695218  microlensing surveys tens millions stars. unpr...    1.0
4   47110549  microlensing surveys tens millions stars. unpr...    1.0
47110549: expected: ['46770666', '52456923', '152091185', '52695218', '52739626'], predicted: ['52695218', '46770666', '52739626', '152091185', '52456923'], prediction: Correct
all_predictions
{'Correct': 21, 'False': 0}
# Overall accuracy on a test
accuracy = round(
    all_predictions["Correct"]
    / (all_predictions["Correct"] + all_predictions["False"]),
    4,
)
accuracy
1.0
# Print the prediction count for each class depending on the number of duplicates in labeled dataset
pd.DataFrame.from_dict(
    predictions_per_category, orient="index", columns=["Correct", "False"]
)
Correct False
0 10 0
1 8 0
2 1 0
3 1 0
5 1 0

Delete the Index

Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again.

# Delete the index if it's not going to be used anymore
pinecone.delete_index(index_name)

Summary

In this notebook we demonstrate how to perform a deduplication task of over 100,000 articles using Pinecone. With articles embedded as vectors, you can use Pinecone's vector index to find similar articles. For each query article, we then use an LSH classifier on the similar articles to identify duplicate articles. Overall, we show that it is ease to incorporate Pinecone with article embedding models and duplication classifiers to build a deduplication service.