Document Deduplication

This notebook demonstrates how to use Pinecone’s similarity search to create a simple data deduplication application.

The goal is to create a data deduplication application for eliminating near-duplicate copies of academic texts. Running a near-duplication detector over a large set of documents is not feasible. Thus, in this example, we will perform the deduplication of a given text in two steps. First, we will sift a small set of candidate texts using a similarity-search service. Then, we will apply a near-duplication detector over these candidates.

The similarity search will use a vector representation of the texts. In such a way, semantic similarity is translated to proximity in a vector space. For detecting near-duplicates, we will employ a classification model that examines the raw text.

Dependencies

!pip install --quiet sentence-transformers
!pip install --quiet progressbar2
!pip install --quiet snapy
!pip install --quiet mmh3
import os
import json
import math
import statistics
import pandas as pd
from progressbar import progressbar
from snapy import MinHash, LSH
from sentence_transformers import SentenceTransformer

Pinecone Installation and Setup

!pip install --quiet -U pinecone-client
import pinecone.graph
import pinecone.service
import pinecone.connector
import pinecone.hub
# Load Pinecone API key

api_key = '<YOUR-API-KEY>'
pinecone.init(api_key=api_key)

Get a Pinecone API key if you don’t have one already.

Define a New Pinecone Service

# Pick a name for the new service
service_name = 'deduplication'
# Check whether the service with the same name already exists
if service_name in pinecone.service.ls():
    pinecone.service.stop(service_name)

Create a graph

graph = pinecone.graph.IndexGraph(metric='cosine', shards=1)
graph.view()

Diagram of document deduplication service

Deploy the graph

pinecone.service.deploy(service_name, graph, timeout=300)
{'success': True, 'msg': ''}

Create a connection to the new service

conn = pinecone.connector.connect('deduplication')
conn.info()
InfoResult(index_size=0)

Upload

The Deduplication Dataset 2020 consists of 100,000 scholarly documents.

Load data

!wget https://core.ac.uk/exports/custom_datasets/deduplication_dataset_2020.zip -q --show-progress
!unzip -q deduplication_dataset_2020.zip
deduplication_datas 100%[===================>]  59.10M  1.67MB/s    in 37s
ROOT_PATH = os.getcwd()
DATA_PATH = (os.path.join(ROOT_PATH, "./deduplication_dataset_2020/Ground_Truth_data.jsonl"))

with open(DATA_PATH, encoding="utf8") as json_file:
        data = list(json_file)

Here is an example of the data.

data_json = [json.loads(json_str) for json_str in data]
df = pd.DataFrame.from_dict(data_json)
df.head()
core_iddoioriginal_abstractoriginal_titleprocessed_titleprocessed_abstractcatlabelled_duplicates
01125108610.1016/j.ajhg.2007.12.013Unobstructed vision requires a particular refr...Mutation of solute carrier SLC16A12 associates...mutation of solute carrier slc16a12 associates...unobstructed vision refractive lens differenti...exact_dup[82332306]
11130975110.1103/PhysRevLett.101.193002Two-color multiphoton ionization of atomic hel...Polarization control in two-color above-thresh...polarization control in two-color above-thresh...multiphoton ionization helium combining extrem...exact_dup[147599753]
21131138510.1016/j.ab.2011.02.013Lectin’s are proteins capable of recognising a...Optimisation of the enzyme-linked lectin assay...optimisation of the enzyme-linked lectin assay...lectin’s capable recognising oligosaccharide t...exact_dup[147603441]
31199224010.1016/j.jpcs.2007.07.063In this work, we present a detailed transmissi...Vertical composition fluctuations in (Ga,In)(N...vertical composition fluctuations in (ga,in)(n...microscopy interfacial uniformity wells grown ...exact_dup[148653623]
41199499010.1016/S0169-5983(03)00013-3Three-dimensional (3D) oscillatory boundary la...Three-dimensional streaming flows driven by os...three-dimensional streaming flows driven by os...oscillatory attached deformable walls boundari...exact_dup[148656283]

Let’s calculate the frequency of duplicates per article. Observe that half of the items have no duplicates, and only a small fraction of the items have more than ten duplicates.

lens = df.labelled_duplicates.apply(len)
lens.value_counts()
0     50000
1     36166
2      7620
3      3108
4      1370
5       756
6       441
7       216
8       108
10       66
9        60
11       48
13       28
12       13
Name: labelled_duplicates, dtype: int64
# Define a new column for calculating embeddings
df['combined_text'] = df.apply(lambda x: str(x.processed_title) +" "+ str(x.processed_abstract), axis=1)

Load model

We will use the Average Word Embedding GloVe model to transform text into vector embeddings. We then upload the embeddings into a Pinecone vector index.

model = SentenceTransformer('average_word_embeddings_glove.6B.300d')
/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
vectors = model.encode(df.combined_text.to_list(), show_progress_bar=True)
Batches:   0%|          | 0/3125 [00:00<?, ?it/s]
upsert_acks = conn.upsert(items=zip(df.core_id.values, vectors)).collect()
conn.info()
InfoResult(index_size=100000)

Let’s create a set of test queries from the dataset. Recall that the data contains for each text its corresponding duplicates. Below, we list statistics of the number of duplicates per query in the resulting test set.

# Create a sample from the dataset
SAMPLE_FRACTION = 0.01
test_documents = df.groupby(df['labelled_duplicates'].map(len)).apply(lambda x: x.head(math.ceil(len(x)*SAMPLE_FRACTION))).reset_index(drop=True)

print('Number of documents with specified number of duplicates:')
lens = test_documents.labelled_duplicates.apply(len)
lens.value_counts()
Number of documents with specified number of duplicates:
0     500
1     362
2      77
3      32
4      14
5       8
6       5
7       3
8       2
9       1
10      1
11      1
12      1
13      1
Name: labelled_duplicates, dtype: int64
# Use the model to create embeddings for test documents
vectors = model.encode(test_documents.combined_text.to_list())
# Query the vector index
query_results = conn.query(queries=vectors, top_k=100).collect()
# Save all retrieval recalls into a list
recalls = []

for id, res in progressbar(list(zip(test_documents.core_id.values, query_results))):

    # Find document with id in labelled dataset
    labeled_df = df[df.core_id == str(id)]

    # Calculate the retrieval recall
    top_k_list = set(res.ids)
    labelled_duplicates = set(labeled_df.labelled_duplicates.values[0])
    intersection = top_k_list.intersection(labelled_duplicates)
    if len(labelled_duplicates) != 0:
        recalls.append(len(intersection) / len(labelled_duplicates))
    else:
        recalls.append(1.0)
100% (1008 of 1008) |####################| Elapsed Time: 0:00:13 Time:  0:00:13
print("Mean for the retrieval recall is " + str(statistics.mean(recalls)))
print("Standard Deviation is  " + str(statistics.stdev(recalls)))
Mean for the retrieval recall is 0.9967833092833093
Standard Deviation is  0.0539145625573685

Running the Classifier

We will use Deduplication Classifier for detecting duplicates in the query result, but only on a sample this time, as it runs slow. Feel free to check results on a complete test dataset.

# Counters for correct/false predictions
all_predictions = {'Correct':0, 'False':0}
predictions_per_category = {}

# Create samples
query_sample = query_results[::50]
ids_sample = test_documents.core_id.to_list()[::50]

for id, res in zip(ids_sample, query_sample):

    # Find document with id in labelled dataset
    labeled_df = df[df.core_id == str(id)]

    # Create dataframe with most similar documents based on pinecone vector search
    df_result = pd.DataFrame({'id':res.ids,
                              'document': [df[df.core_id == _id].processed_abstract.values[0] for _id in res.ids],
                              'score':res.scores})

    # Define content and labels from query results
    content = df_result.document.values
    labels = list(df_result.id.values)

    # Create MinHash object
    minhash = MinHash(content, n_gram=4, permutations=100, hash_bits=64, seed=5)

    # Create LSH model
    lsh = LSH(minhash, labels, no_of_bands=50)

    # Query to find near duplicates for the query document
    duplicates = lsh.query(id, min_jaccard=0.3)

    # Check whether prediction matches labeled duplicates
    prediction = 'Correct' if set(labeled_df.labelled_duplicates.values[0]) == set(duplicates) else 'False'

    # Add to all predictions
    all_predictions[prediction] += 1

    # Create and/or add to the specific category based on number of duplicates in original dataset
    num_of_duplicates = len(labeled_df.labelled_duplicates.values[0])
    if num_of_duplicates not in predictions_per_category:
        predictions_per_category[num_of_duplicates] = [0,0]

    if prediction == 'Correct':
        predictions_per_category[num_of_duplicates][0] += 1
    else:
        predictions_per_category[num_of_duplicates][1] += 1

    # Print the results for a document
    print('{}: expected: {}, predicted: {}, prediction: {}'.format(id,
                                                                   labeled_df.labelled_duplicates.values[0],
                                                                   duplicates,
                                                                   prediction))
all_predictions
{'Correct': 21, 'False': 0}
# Overall accuracy on a test
accuracy = round(all_predictions['Correct'] / (all_predictions['Correct'] + all_predictions['False']), 4)
accuracy
1.0
# Print the prediction count for each class depending on the number of duplicates in labeled dataset
pd.DataFrame.from_dict(predictions_per_category, orient='index',
                       columns=['Correct', 'False'])
CorrectFalse
0100
180
210
310
710

Summary

We ran a deduplication task over 100,000 text documents. The results indicate the effectiveness of our two-folded process. The filtering step is fast and results in a small subset of candidate texts that contains all duplicates 99.6% of the time. The classification step is slow yet detects the near-duplicates accurately. It’s worth noting that you can improve the filter step accuracy with a refined vector embedding model and the classification step with a more robust classifier. Feel free to explore how changing these models affect the final results.

Turn off the Service

Turn off the service once you are sure that you do not want to use it anymore. Once the service is stopped, you cannot use it again.

# Stop a service if is not going to be used anymore
pinecone.service.stop("deduplication")
{'success': True}