Extreme Classification with Similarity Search

This demo aims to label new texts automatically when the number of possible labels is enormous. This scenario is known as extreme classification, a supervised learning variant that deals with multi-class and multi-label problems involving many choices.

Examples for applying extreme classification are labeling a new article with Wikipedia's topical labels, matching web content with a set of relevant advertisements, classifying product descriptions with catalog labels, and classifying a resume into a collection of pertinent job titles.

Article labeling with extreme classification.

Here's how we'll perform extreme classification:

  1. We'll transform 250,000 labels into vector embeddings using a publicly available embedding model and upload them into a managed vector index.
  2. Then we'll take an article that requires labeling and transform it into a vector embedding using the same model.
  3. We'll use that article's vector embedding as the query to search the vector index. In effect, this will retrieve the most similar labels to the article's semantic content.
  4. With the most relevant labels retrieved, we can automatically apply them to the article.

Let's get started!

Open Notebook View Source


!pip install -qU pinecone-client ipywidgets setuptools>=36.2.1 wikitextparser
!pip install -qU sentence-transformers --no-cache-dir
import os
import re
import gzip
import json
import pandas as pd
import numpy as np
from wikitextparser import remove_markup, parse
from sentence_transformers import SentenceTransformer

Setting up Pinecone's Similarity Search Service

Here we set up our similarity search service. We assume you are familiar with Pinecone's quick start tutorial.

import pinecone
# Load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"

Get a Pinecone API key if you don’t have one.

# Pick a name for the new index
index_name = 'extreme-ml'
# Check whether the index with the same name already exists
if index_name in pinecone.list_indexes():
# Create a new vector index
pinecone.create_index(name=index_name, metric='cosine', shards=1)

# Connect to the created index
index = pinecone.Index(name = index_name, response_timeout=300)

# Print info

Data Preparation

In this demo, we classify Wikipedia articles using a standard dataset from an extreme classification benchmarking resource. The data used in this example is Wikipedia-500k which contains around 500,000 labels. Here, we will download the raw data and prepare it for the classification task.

# Download train dataset
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=10RBSf6nC9C38wUMwqWur2Yd8mCwtup5K' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=10RBSf6nC9C38wUMwqWur2Yd8mCwtup5K" -O 'trn.raw.json.gz' && rm -rf /tmp/cookies.txt 

# Download test dataset
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1pEyKXtkwHhinuRxmARhtwEQ39VIughDf' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1pEyKXtkwHhinuRxmARhtwEQ39VIughDf" -O 'tst.raw.json.gz' && rm -rf /tmp/cookies.txt

# Download categories labels file
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ZYTZPlnkPBCMcNqRRO-gNx8EPgtV-GL3' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1ZYTZPlnkPBCMcNqRRO-gNx8EPgtV-GL3" -O 'Yf.txt' && rm -rf /tmp/cookies.txt

# Create and move downloaded files to data folder
!mkdir data
!mv 'trn.raw.json.gz' 'tst.raw.json.gz' 'Yf.txt' data
# Define paths
ROOT_PATH = os.getcwd()
TRAIN_DATA_PATH = (os.path.join(ROOT_PATH, './data/trn.raw.json.gz'))
TEST_DATA_PATH = (os.path.join(ROOT_PATH, './data/tst.raw.json.gz'))
# Load categories
with open('./data/Yf.txt',  encoding='utf-8') as f:
    categories = f.readlines()

# Clean values
categories = [cat.split('->')[1].strip('\n') for cat in categories]

# Show frist few categories
['!!!_albums', '+/-_(band)_albums', '+44_(band)_songs']

Using a Subset of the Data

For this example, we will select and use a subset of wikipedia articles. This will save time for processing and consume much less memory than the complete dataset.

We will select a sample of 200,000 articles that contains around 250,000 different labels.

Feel free to run the notebook with more data.

WIKI_ARTICLES_INDEX = range(0, 1000000, 5)

lines = []

with gzip.open(TRAIN_DATA_PATH) as f:
    for e, line in enumerate(f):
        if e >= 1000000:
        if e in WIKI_ARTICLES_INDEX:
df = pd.DataFrame.from_dict(lines)
df = df[['title', 'content', 'target_ind']]
title content target_ind
0 Anarchism {{redirect2|anarchist|anarchists|the fictional... [81199, 83757, 83805, 193030, 368811, 368937, ...
1 Academy_Awards {{redirect2|oscars|the oscar|the film|the osca... [19080, 65864, 78208, 96051]
2 Anthropology {{about|the social science}} {{use dmy dates|d... [83605, 423943]
3 American_Football_Conference {{refimprove|date=september 2014}} {{use dmy d... [76725, 314198, 334093]
4 Analysis_of_variance {{use dmy dates|date=june 2013}} '''analysis o... [81170, 168516, 338198, 441529]
(200000, 3)

Remove Wikipedia Markup Format

We are going to use only the first part of the articles to make them comparable in terms of length. Also, Wikipedia articles have a certain format that is not so readable, so we will remove the markup to make the content as clean as possible.

# Reduce content to first 3000 characters
df['content_short'] = df.content.apply(lambda x: x[:3000])

# Remove wiki articles markup
df['content_cleaned'] = df.content_short.apply(lambda x: remove_markup(x))

# Keep only certain columns
df = df[['title', 'content_cleaned', 'target_ind']]

# Show data
title content_cleaned target_ind
0 Anarchism anarchism is a political philosophy that a... [81199, 83757, 83805, 193030, 368811, 368937, ...
1 Academy_Awards the academy awards or the oscars (the offi... [19080, 65864, 78208, 96051]
2 Anthropology anthropology is the scientific study of hu... [83605, 423943]
3 American_Football_Conference the american football conference (afc) is o... [76725, 314198, 334093]
4 Analysis_of_variance analysis of variance (anova) is a collection ... [81170, 168516, 338198, 441529]
# Keep all labels in a single list
all_categories = []
for i, row in df.iterrows():
print('Number of labels: ',len(list(set(all_categories))))
Number of labels:  256899

Create Article Vector Embeddings

Recall, we want to index and search all possible (250,000) labels. We do that by averaging, for each label, the corresponding article vector embeddings that contain that label.

Let's first create the article vector embeddings. Here we use the Average Word Embeddings Models. In the next section, we will aggregate these vectors to make the final label embeddings.

# Load the model
model = SentenceTransformer('average_word_embeddings_komninos')

# Create embeddings
encoded_articles = model.encode(df['content_cleaned'], show_progress_bar=True)
df['content_vector'] = pd.Series(encoded_articles.tolist())

Upload articles

It appears that using the article embeddings per se doesn't provide good enough accuracies. Therefore, we chose to index and search the labels directly.

The label embedding is simply the average of all its corresponding article embeddings.

# Explode the target indicator column
df_explode = df.explode('target_ind')

# Group by label and define a unique vector for each label
result = df_explode.groupby('target_ind').agg(mean=('content_vector', lambda x: np.vstack(x).mean(axis=0).tolist()))
result['target_ind'] = result.index
result.columns = ['content_vector', 'ind']

content_vector ind
2 [0.0704750344157219, -0.007719345390796661, 0.... 2
3 [0.05894148722290993, -0.03119848482310772, 0.... 3
5 [0.18302207440137863, 0.061663837544620036, 0.... 5
6 [0.1543595753610134, 0.03904660418629646, 0.03... 6
9 [0.22310754656791687, 0.1524289846420288, 0.09... 9
# Create a list of items to upsert
items_to_upsert = [(categories[int(row.ind)][:64], row.content_vector) for i, row in result.iterrows()]
# Upsert data
acks = index.upsert(items=items_to_upsert)

Let's validate the number of indexed labels.



Now, let's test the vector index and examine the classifier results. Observe that here we retrieve a fixed number of labels. Naturally, in an actual application, you might want to calculate the size of the retrieved label set dynamically.

WIKI_ARTICLES_INDEX = range(1111, 100000, 57)[:NUM_OF_WIKI_ARTICLES]

lines = []

with gzip.open(TEST_DATA_PATH) as f:
    for e, line in enumerate(f):
        if e in  WIKI_ARTICLES_INDEX:
        if e > max(WIKI_ARTICLES_INDEX):
df_test = pd.DataFrame.from_dict(lines)
df_test = df_test[['title', 'content', 'target_ind']]
title content target_ind
0 Discrimination {{otheruses}} {{discrimination sidebar}} '''di... [170479, 423902]
1 Erfurt {{refimprove|date=june 2014}} {{use dmy dates|... [142638, 187156, 219262, 294479, 329185, 38243...
2 ETA {{about|the basque organization|other uses|eta... [83681, 100838, 100849, 100868, 176034, 188979...
# Reduce content to first 3000 characters
df_test['content_short'] = df_test.content.apply(lambda x: x[:3000])

# Remove wiki articles markup
df_test['content_cleaned'] = df_test.content_short.apply(lambda x: remove_markup(x))

# Keep only certain columns
df_test = df_test[['title', 'content_cleaned', 'target_ind']]

# Show data
title content_cleaned target_ind
0 Discrimination discrimination is action that denies social ... [170479, 423902]
1 Erfurt erfurt () is the capital city of thuringia ... [142638, 187156, 219262, 294479, 329185, 38243...
2 ETA eta (, ), an acronym for euskadi ta askatas... [83681, 100838, 100849, 100868, 176034, 188979...
# Create embeddings for test articles
test_vectors = model.encode(df_test['content_cleaned'], show_progress_bar=True)
# Query the vector index
query_results = index.query(queries=test_vectors, top_k=10)
# Show results
for term, labs, res in zip(df_test.title.tolist(), df_test.target_ind.tolist(), query_results):
    print('Term queried: ',term)
    print('Original labels: ')
    for l in labs:
        if l in all_categories:
            print('\t', categories[l])
    print('Predicted: ')
    df_result = pd.DataFrame({
                'id':[id for id in res.ids],
                'score':[id for id in res.scores],})
Term queried:  Discrimination
Original labels: 
id score
0 Discrimination 0.972957
1 Sociological_terminology 0.971605
2 Identity_politics 0.970097
3 Social_concepts 0.967534
4 Sexism 0.967476
5 Affirmative_action 0.967288
6 Political_correctness 0.966926
7 Human_behavior 0.966475
8 Persecution 0.965421
9 Social_movements 0.964393
Term queried:  Erfurt
Original labels: 
id score
0 University_towns_in_Germany 0.966058
1 Province_of_Saxony 0.959731
2 Populated_places_on_the_Rhine 0.958738
3 Imperial_free_cities 0.957159
4 Hildesheim_(district) 0.956928
5 History_of_the_Electoral_Palatinate 0.956800
6 Towns_in_Saxony-Anhalt 0.956501
7 Towns_in_Lower_Saxony 0.955259
8 Halle_(Saale) 0.954934
9 Cities_in_Saxony-Anhalt 0.954934
Term queried:  ETA
Original labels: 
id score
0 Organizations_designated_as_terrorist_in_Europe 0.948875
1 Terrorism_in_Spain 0.948431
2 Basque_politics 0.942670
3 Politics_of_Spain 0.941830
4 European_Union_designated_terrorist_organizations 0.940194
5 Irregular_military 0.938163
6 Political_parties_disestablished_in_1977 0.936437
7 Algerian_Civil_War 0.936311
8 Republicanism_in_Spain 0.935577
9 Guerrilla_organizations 0.935507


We demonstrated a similarity search approach for performing extreme classification of texts. We took a simple approach representing labels as the average of their corresponding texts' vector embeddings. In classification time, we match between a new article embedding and its nearest label embeddings. Our result examples indicate the usefulness of this approach.

You can take this forward by exploring advanced ideas. For example, you can utilize the hierarchical relationship between labels or improve the label representations. Just have fun, and feel free to share your thoughts.

Turn off the service

Turn off the service once you are sure that you do not want to use it anymore. Once the service is stopped, you cannot use it again.