Extreme Classification

This demo aims to label new texts automatically when the number of possible labels is enormous. This scenario is known as extreme classification, a supervised learning variant that deals with multi-class and multi-label problems involving many choices.

Examples for applying extreme classification are labeling a new article with Wikipedia's topical labels, matching web content with a set of relevant advertisements, classifying product descriptions with catalog labels, and classifying a resume into a collection of pertinent job titles.

Here's how we'll perform extreme classification:

  1. We'll transform 250,000 labels into vector embeddings using a publicly available embedding model and upload them into a managed vector index.
  2. Then we'll take an article that requires labeling and transform it into a vector embedding using the same model.
  3. We'll use that article's vector embedding as the query to search the vector index. In effect, this will retrieve the most similar labels to the article's semantic content.
  4. With the most relevant labels retrieved, we can automatically apply them to the article.

Let's get started!

Dependencies

Copy
Copied
!pip install -qU pinecone-client ipywidgets setuptools>=36.2.1 wikitextparser unidecode
!pip install -qU sentence-transformers --no-cache-dir
Copy
Copied
import os
import re
import gzip
import json
import pandas as pd
import numpy as np
from wikitextparser import remove_markup, parse
from sentence_transformers import SentenceTransformer
from unidecode import unidecode

Setting up Pinecone's Similarity Search Service

Here we set up our similarity search service. We assume you are familiar with Pinecone's quick start tutorial.

Copy
Copied
import pinecone
Copy
Copied
# Load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pinecone.init(api_key=api_key, environment='us-west1-gcp')

# List all existing indices for you API key
pinecone.list_indexes()
[]

Get a Pinecone API key if you don’t have one.

Copy
Copied
# Pick a name for the new index
index_name = 'extreme-ml'
Copy
Copied
# Check whether the index with the same name already exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)
Copy
Copied
# Create a new vector index
pinecone.create_index(name=index_name, dimension=300)
Copy
Copied
# Connect to the created index
index = pinecone.Index(index_name)

# Print index statistics
index.describe_index_stats()
{'dimension': 300, 'namespaces': {'': {'vector_count': 139500}}}

Data Preparation

In this demo, we classify Wikipedia articles using a standard dataset from an extreme classification benchmarking resource. The data used in this example is Wikipedia-500k which contains around 500,000 labels. Here, we will download the raw data and prepare it for the classification task.

Copy
Copied
# Download train dataset
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=10RBSf6nC9C38wUMwqWur2Yd8mCwtup5K' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=10RBSf6nC9C38wUMwqWur2Yd8mCwtup5K" -O 'trn.raw.json.gz' && rm -rf /tmp/cookies.txt 

# Download test dataset
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1pEyKXtkwHhinuRxmARhtwEQ39VIughDf' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1pEyKXtkwHhinuRxmARhtwEQ39VIughDf" -O 'tst.raw.json.gz' && rm -rf /tmp/cookies.txt

# Download categories labels file
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ZYTZPlnkPBCMcNqRRO-gNx8EPgtV-GL3' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1ZYTZPlnkPBCMcNqRRO-gNx8EPgtV-GL3" -O 'Yf.txt' && rm -rf /tmp/cookies.txt

# Create and move downloaded files to data folder
!mkdir data
!mv 'trn.raw.json.gz' 'tst.raw.json.gz' 'Yf.txt' data
--2022-02-09 17:13:45--  https://docs.google.com/uc?export=download&confirm=arfw&id=10RBSf6nC9C38wUMwqWur2Yd8mCwtup5K
Resolving docs.google.com (docs.google.com)... 142.251.107.101, 142.251.107.113, 142.251.107.100, ...
Connecting to docs.google.com (docs.google.com)|142.251.107.101|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-14-5k-docs.googleusercontent.com/docs/securesc/mi75es4ss9f8cbmlkasp0714ekfl64em/oiufcg4j2mku0ucr6a89me2tql3td1v1/1644426825000/06283569454216238406/01276505903269316155Z/10RBSf6nC9C38wUMwqWur2Yd8mCwtup5K?e=download [following]
--2022-02-09 17:13:45--  https://doc-14-5k-docs.googleusercontent.com/docs/securesc/mi75es4ss9f8cbmlkasp0714ekfl64em/oiufcg4j2mku0ucr6a89me2tql3td1v1/1644426825000/06283569454216238406/01276505903269316155Z/10RBSf6nC9C38wUMwqWur2Yd8mCwtup5K?e=download
Resolving doc-14-5k-docs.googleusercontent.com (doc-14-5k-docs.googleusercontent.com)... 173.194.210.132, 2607:f8b0:400c:c0f::84
Connecting to doc-14-5k-docs.googleusercontent.com (doc-14-5k-docs.googleusercontent.com)|173.194.210.132|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://docs.google.com/nonceSigner?nonce=l2fmr07iln170&continue=https://doc-14-5k-docs.googleusercontent.com/docs/securesc/mi75es4ss9f8cbmlkasp0714ekfl64em/oiufcg4j2mku0ucr6a89me2tql3td1v1/1644426825000/06283569454216238406/01276505903269316155Z/10RBSf6nC9C38wUMwqWur2Yd8mCwtup5K?e%3Ddownload&hash=46dcmopdem1mp2anvp98snh4203mij96 [following]
--2022-02-09 17:13:45--  https://docs.google.com/nonceSigner?nonce=l2fmr07iln170&continue=https://doc-14-5k-docs.googleusercontent.com/docs/securesc/mi75es4ss9f8cbmlkasp0714ekfl64em/oiufcg4j2mku0ucr6a89me2tql3td1v1/1644426825000/06283569454216238406/01276505903269316155Z/10RBSf6nC9C38wUMwqWur2Yd8mCwtup5K?e%3Ddownload&hash=46dcmopdem1mp2anvp98snh4203mij96
Connecting to docs.google.com (docs.google.com)|142.251.107.101|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://doc-14-5k-docs.googleusercontent.com/docs/securesc/mi75es4ss9f8cbmlkasp0714ekfl64em/oiufcg4j2mku0ucr6a89me2tql3td1v1/1644426825000/06283569454216238406/01276505903269316155Z/10RBSf6nC9C38wUMwqWur2Yd8mCwtup5K?e=download&nonce=l2fmr07iln170&user=01276505903269316155Z&hash=jgf078p8n3lll6ulirsoqj13g645cf9q [following]
--2022-02-09 17:13:45--  https://doc-14-5k-docs.googleusercontent.com/docs/securesc/mi75es4ss9f8cbmlkasp0714ekfl64em/oiufcg4j2mku0ucr6a89me2tql3td1v1/1644426825000/06283569454216238406/01276505903269316155Z/10RBSf6nC9C38wUMwqWur2Yd8mCwtup5K?e=download&nonce=l2fmr07iln170&user=01276505903269316155Z&hash=jgf078p8n3lll6ulirsoqj13g645cf9q
Connecting to doc-14-5k-docs.googleusercontent.com (doc-14-5k-docs.googleusercontent.com)|173.194.210.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5292805889 (4.9G) [application/x-gzip]
Saving to: ‘trn.raw.json.gz’

trn.raw.json.gz     100%[===================>]   4.93G   120MB/s    in 57s     

2022-02-09 17:14:42 (89.0 MB/s) - ‘trn.raw.json.gz’ saved [5292805889/5292805889]

--2022-02-09 17:14:43--  https://docs.google.com/uc?export=download&confirm=uE_3&id=1pEyKXtkwHhinuRxmARhtwEQ39VIughDf
Resolving docs.google.com (docs.google.com)... 142.251.107.101, 142.251.107.113, 142.251.107.100, ...
Connecting to docs.google.com (docs.google.com)|142.251.107.101|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0c-2o-docs.googleusercontent.com/docs/securesc/5m4vneh8ah734pahp3qbjncg6mmp89ta/d5qkmrodksuck3tqaeh82prkj3v26vfe/1644426825000/06283569454216238406/08808106369581203619Z/1pEyKXtkwHhinuRxmARhtwEQ39VIughDf?e=download [following]
--2022-02-09 17:14:43--  https://doc-0c-2o-docs.googleusercontent.com/docs/securesc/5m4vneh8ah734pahp3qbjncg6mmp89ta/d5qkmrodksuck3tqaeh82prkj3v26vfe/1644426825000/06283569454216238406/08808106369581203619Z/1pEyKXtkwHhinuRxmARhtwEQ39VIughDf?e=download
Resolving doc-0c-2o-docs.googleusercontent.com (doc-0c-2o-docs.googleusercontent.com)... 173.194.210.132, 2607:f8b0:400c:c0f::84
Connecting to doc-0c-2o-docs.googleusercontent.com (doc-0c-2o-docs.googleusercontent.com)|173.194.210.132|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://docs.google.com/nonceSigner?nonce=fv89ild8hgp3u&continue=https://doc-0c-2o-docs.googleusercontent.com/docs/securesc/5m4vneh8ah734pahp3qbjncg6mmp89ta/d5qkmrodksuck3tqaeh82prkj3v26vfe/1644426825000/06283569454216238406/08808106369581203619Z/1pEyKXtkwHhinuRxmARhtwEQ39VIughDf?e%3Ddownload&hash=algva8fi1m74v18nhdhve6o38458h8bo [following]
--2022-02-09 17:14:43--  https://docs.google.com/nonceSigner?nonce=fv89ild8hgp3u&continue=https://doc-0c-2o-docs.googleusercontent.com/docs/securesc/5m4vneh8ah734pahp3qbjncg6mmp89ta/d5qkmrodksuck3tqaeh82prkj3v26vfe/1644426825000/06283569454216238406/08808106369581203619Z/1pEyKXtkwHhinuRxmARhtwEQ39VIughDf?e%3Ddownload&hash=algva8fi1m74v18nhdhve6o38458h8bo
Connecting to docs.google.com (docs.google.com)|142.251.107.101|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://doc-0c-2o-docs.googleusercontent.com/docs/securesc/5m4vneh8ah734pahp3qbjncg6mmp89ta/d5qkmrodksuck3tqaeh82prkj3v26vfe/1644426825000/06283569454216238406/08808106369581203619Z/1pEyKXtkwHhinuRxmARhtwEQ39VIughDf?e=download&nonce=fv89ild8hgp3u&user=08808106369581203619Z&hash=d3m02ho8p665cjtl094bjkqk6g1qftj1 [following]
--2022-02-09 17:14:43--  https://doc-0c-2o-docs.googleusercontent.com/docs/securesc/5m4vneh8ah734pahp3qbjncg6mmp89ta/d5qkmrodksuck3tqaeh82prkj3v26vfe/1644426825000/06283569454216238406/08808106369581203619Z/1pEyKXtkwHhinuRxmARhtwEQ39VIughDf?e=download&nonce=fv89ild8hgp3u&user=08808106369581203619Z&hash=d3m02ho8p665cjtl094bjkqk6g1qftj1
Connecting to doc-0c-2o-docs.googleusercontent.com (doc-0c-2o-docs.googleusercontent.com)|173.194.210.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2297151115 (2.1G) [application/x-gzip]
Saving to: ‘tst.raw.json.gz’

tst.raw.json.gz     100%[===================>]   2.14G   130MB/s    in 15s     

2022-02-09 17:14:59 (141 MB/s) - ‘tst.raw.json.gz’ saved [2297151115/2297151115]

--2022-02-09 17:15:01--  https://docs.google.com/uc?export=download&confirm=&id=1ZYTZPlnkPBCMcNqRRO-gNx8EPgtV-GL3
Resolving docs.google.com (docs.google.com)... 173.194.210.139, 173.194.210.138, 173.194.210.102, ...
Connecting to docs.google.com (docs.google.com)|173.194.210.139|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-04-7s-docs.googleusercontent.com/docs/securesc/55p9v9r892nth323knj0kpu3c0fh68iu/jne4sjlttqu450ikt8fph58j505oapok/1644426900000/06283569454216238406/07409829577848409351Z/1ZYTZPlnkPBCMcNqRRO-gNx8EPgtV-GL3?e=download [following]
--2022-02-09 17:15:02--  https://doc-04-7s-docs.googleusercontent.com/docs/securesc/55p9v9r892nth323knj0kpu3c0fh68iu/jne4sjlttqu450ikt8fph58j505oapok/1644426900000/06283569454216238406/07409829577848409351Z/1ZYTZPlnkPBCMcNqRRO-gNx8EPgtV-GL3?e=download
Resolving doc-04-7s-docs.googleusercontent.com (doc-04-7s-docs.googleusercontent.com)... 173.194.210.132, 2607:f8b0:400c:c0f::84
Connecting to doc-04-7s-docs.googleusercontent.com (doc-04-7s-docs.googleusercontent.com)|173.194.210.132|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://docs.google.com/nonceSigner?nonce=erf8t6vb6o29s&continue=https://doc-04-7s-docs.googleusercontent.com/docs/securesc/55p9v9r892nth323knj0kpu3c0fh68iu/jne4sjlttqu450ikt8fph58j505oapok/1644426900000/06283569454216238406/07409829577848409351Z/1ZYTZPlnkPBCMcNqRRO-gNx8EPgtV-GL3?e%3Ddownload&hash=tjpqbk2dp469l934sb9627cs6d9dq9ht [following]
--2022-02-09 17:15:02--  https://docs.google.com/nonceSigner?nonce=erf8t6vb6o29s&continue=https://doc-04-7s-docs.googleusercontent.com/docs/securesc/55p9v9r892nth323knj0kpu3c0fh68iu/jne4sjlttqu450ikt8fph58j505oapok/1644426900000/06283569454216238406/07409829577848409351Z/1ZYTZPlnkPBCMcNqRRO-gNx8EPgtV-GL3?e%3Ddownload&hash=tjpqbk2dp469l934sb9627cs6d9dq9ht
Connecting to docs.google.com (docs.google.com)|173.194.210.139|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://doc-04-7s-docs.googleusercontent.com/docs/securesc/55p9v9r892nth323knj0kpu3c0fh68iu/jne4sjlttqu450ikt8fph58j505oapok/1644426900000/06283569454216238406/07409829577848409351Z/1ZYTZPlnkPBCMcNqRRO-gNx8EPgtV-GL3?e=download&nonce=erf8t6vb6o29s&user=07409829577848409351Z&hash=m0ja3hdvfkrvsfr6kol8k4bktej41mje [following]
--2022-02-09 17:15:02--  https://doc-04-7s-docs.googleusercontent.com/docs/securesc/55p9v9r892nth323knj0kpu3c0fh68iu/jne4sjlttqu450ikt8fph58j505oapok/1644426900000/06283569454216238406/07409829577848409351Z/1ZYTZPlnkPBCMcNqRRO-gNx8EPgtV-GL3?e=download&nonce=erf8t6vb6o29s&user=07409829577848409351Z&hash=m0ja3hdvfkrvsfr6kol8k4bktej41mje
Connecting to doc-04-7s-docs.googleusercontent.com (doc-04-7s-docs.googleusercontent.com)|173.194.210.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33740692 (32M) [text/plain]
Saving to: ‘Yf.txt’

Yf.txt              100%[===================>]  32.18M  --.-KB/s    in 0.1s    

2022-02-09 17:15:02 (248 MB/s) - ‘Yf.txt’ saved [33740692/33740692]
Copy
Copied
# Define paths
ROOT_PATH = os.getcwd()
TRAIN_DATA_PATH = (os.path.join(ROOT_PATH, 'data/trn.raw.json.gz'))
TEST_DATA_PATH = (os.path.join(ROOT_PATH, 'data/tst.raw.json.gz'))
Copy
Copied
# Load categories
with open('./data/Yf.txt',  encoding='utf-8') as f:
    categories = f.readlines()

# Clean values
categories = [cat.split('->')[1].strip('\n') for cat in categories]

# Show frist few categories
categories[:3]
['!!!_albums', '+/-_(band)_albums', '+44_(band)_songs']

Using a Subset of the Data

For this example, we will select and use a subset of wikipedia articles. This will save time for processing and consume much less memory than the complete dataset. We will select a sample of 200,000 articles that contains around 250,000 different labels.

Feel free to run the notebook with more data.

Copy
Copied
WIKI_ARTICLES_INDEX = range(0, 1000000, 5)

lines = []

with gzip.open(TRAIN_DATA_PATH) as f:
    for e, line in enumerate(f):
        if e >= 1000000:
            break
        if e in WIKI_ARTICLES_INDEX:
            lines.append(json.loads(line))
        
df = pd.DataFrame.from_dict(lines)
df = df[['title', 'content', 'target_ind']]
df.head()
title content target_ind
0 Anarchism {{redirect2|anarchist|anarchists|the fictional... [81199, 83757, 83805, 193030, 368811, 368937, ...
1 Academy_Awards {{redirect2|oscars|the oscar|the film|the osca... [19080, 65864, 78208, 96051]
2 Anthropology {{about|the social science}} {{use dmy dates|d... [83605, 423943]
3 American_Football_Conference {{refimprove|date=september 2014}} {{use dmy d... [76725, 314198, 334093]
4 Analysis_of_variance {{use dmy dates|date=june 2013}} '''analysis o... [81170, 168516, 338198, 441529]
Copy
Copied
print(df.shape)
(200000, 3)

Remove Wikipedia Markup Format

We are going to use only the first part of the articles to make them comparable in terms of length. Also, Wikipedia articles have a certain format that is not so readable, so we will remove the markup to make the content as clean as possible.

Copy
Copied
# Reduce content to first 3000 characters
df['content_short'] = df.content.apply(lambda x: x[:3000])

# Remove wiki articles markup
df['content_cleaned'] = df.content_short.apply(lambda x: remove_markup(x))

# Keep only certain columns
df = df[['title', 'content_cleaned', 'target_ind']]

# Show data
df.head()
title content_cleaned target_ind
0 Anarchism anarchism is a political philosophy that a... [81199, 83757, 83805, 193030, 368811, 368937, ...
1 Academy_Awards the academy awards or the oscars (the offi... [19080, 65864, 78208, 96051]
2 Anthropology anthropology is the scientific study of hu... [83605, 423943]
3 American_Football_Conference the american football conference (afc) is o... [76725, 314198, 334093]
4 Analysis_of_variance analysis of variance (anova) is a collection ... [81170, 168516, 338198, 441529]

Copy
Copied
# Keep all labels in a single list
all_categories = []
for i, row in df.iterrows():
    all_categories.extend(row.target_ind)
print('Number of labels: ',len(list(set(all_categories))))
Number of labels:  256899

Create Article Vector Embeddings

Recall, we want to index and search all possible (250,000) labels. We do that by averaging, for each label, the corresponding article vector embeddings that contain that label.

Let's first create the article vector embeddings. Here we use the Average Word Embeddings Models. In the next section, we will aggregate these vectors to make the final label embeddings.

Copy
Copied
# Load the model
model = SentenceTransformer('average_word_embeddings_komninos')

# Create embeddings
encoded_articles = model.encode(df['content_cleaned'], show_progress_bar=True)
df['content_vector'] = pd.Series(encoded_articles.tolist())
Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/2.13k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/248 [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/267M [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/2.59M [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/164 [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]



Batches:   0%|          | 0/6250 [00:00<?, ?it/s]

Upload articles

It appears that using the article embeddings per se doesn't provide good enough accuracies. Therefore, we chose to index and search the labels directly.

The label embedding is simply the average of all its corresponding article embeddings.

Copy
Copied
# Explode the target indicator column
df_explode = df.explode('target_ind')

# Group by label and define a unique vector for each label
result = df_explode.groupby('target_ind').agg(mean=('content_vector', lambda x: np.vstack(x).mean(axis=0).tolist()))
result['target_ind'] = result.index
result.columns = ['content_vector', 'ind']

result.head()
content_vector ind
target_ind
2 [0.0704750344157219, -0.007719345390796661, 0.... 2
3 [0.05894148722290993, -0.03119848482310772, 0.... 3
5 [0.18302207440137863, 0.061663837544620036, 0.... 5
6 [0.1543595753610134, 0.03904660418629646, 0.03... 6
9 [0.22310754656791687, 0.1524289846420288, 0.09... 9

Copy
Copied
# Create a list of items to upsert
items_to_upsert = [(unidecode(categories[int(row.ind)])[:64], row.content_vector) for i, row in result.iterrows()]
Copy
Copied
import itertools

def chunks(iterable, batch_size=100):
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))
Copy
Copied
# Upsert data
for batch in chunks(items_to_upsert, 250):
    index.upsert(vectors=batch)

Let's validate the number of indexed labels.

Copy
Copied
index.describe_index_stats()
{'dimension': 300, 'namespaces': {'': {'vector_count': 256899}}}

Query

Now, let's test the vector index and examine the classifier results. Observe that here we retrieve a fixed number of labels. Naturally, in an actual application, you might want to calculate the size of the retrieved label set dynamically.

Copy
Copied
NUM_OF_WIKI_ARTICLES = 3
WIKI_ARTICLES_INDEX = range(1111, 100000, 57)[:NUM_OF_WIKI_ARTICLES]

lines = []

with gzip.open(TEST_DATA_PATH) as f:
    for e, line in enumerate(f):
        if e in  WIKI_ARTICLES_INDEX:
            lines.append(json.loads(line)) 
        if e > max(WIKI_ARTICLES_INDEX):
            break
            
df_test = pd.DataFrame.from_dict(lines)
df_test = df_test[['title', 'content', 'target_ind']]
df_test.head()
title content target_ind
0 Discrimination {{otheruses}} {{discrimination sidebar}} '''di... [170479, 423902]
1 Erfurt {{refimprove|date=june 2014}} {{use dmy dates|... [142638, 187156, 219262, 294479, 329185, 38243...
2 ETA {{about|the basque organization|other uses|eta... [83681, 100838, 100849, 100868, 176034, 188979...

Copy
Copied
# Reduce content to first 3000 characters
df_test['content_short'] = df_test.content.apply(lambda x: x[:3000])

# Remove wiki articles markup
df_test['content_cleaned'] = df_test.content_short.apply(lambda x: remove_markup(x))

# Keep only certain columns
df_test = df_test[['title', 'content_cleaned', 'target_ind']]

# Show data
df_test.head()
title content_cleaned target_ind
0 Discrimination discrimination is action that denies social ... [170479, 423902]
1 Erfurt erfurt () is the capital city of thuringia ... [142638, 187156, 219262, 294479, 329185, 38243...
2 ETA eta (, ), an acronym for euskadi ta askatas... [83681, 100838, 100849, 100868, 176034, 188979...

Copy
Copied
# Create embeddings for test articles
test_vectors = model.encode(df_test['content_cleaned'], show_progress_bar=True)
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Copy
Copied
# Query the vector index
query_results = index.query(queries=test_vectors.tolist(), top_k=10)
Copy
Copied
# Show results
for term, labs, res in zip(df_test.title.tolist(), df_test.target_ind.tolist(), query_results.results):
    print()
    print('Term queried: ',term)
    print('Original labels: ')
    for l in labs:
        if l in all_categories:
            print('\t', categories[l])
    print('Predicted: ')
    df_result = pd.DataFrame({
                'id': [res.id for res in res.matches],
                'score': [res.score for res in res.matches],})
    display(df_result)    
Term queried:  Discrimination
Original labels: 
	 Discrimination
	 Social_justice
Predicted: 
id score
0 Discrimination 0.972958
1 Sociological_terminology 0.971606
2 Identity_politics 0.970097
3 Social_concepts 0.967534
4 Sexism 0.967476
5 Affirmative_action 0.967288
6 Political_correctness 0.966926
7 Human_behavior 0.966475
8 Persecution 0.965421
9 Social_movements 0.964394

Term queried:  Erfurt
Original labels: 
	 Erfurt
	 German_state_capitals
	 Members_of_the_Hanseatic_League
	 Oil_Campaign_of_World_War_II
	 Province_of_Saxony
	 University_towns_in_Germany
Predicted: 
id score
0 University_towns_in_Germany 0.966058
1 Province_of_Saxony 0.959731
2 Populated_places_on_the_Rhine 0.958737
3 Imperial_free_cities 0.957159
4 Hildesheim_(district) 0.956927
5 History_of_the_Electoral_Palatinate 0.956800
6 Towns_in_Saxony-Anhalt 0.956501
7 Towns_in_Lower_Saxony 0.955259
8 Halle_(Saale) 0.954934
9 Cities_in_Saxony-Anhalt 0.954934

Term queried:  ETA
Original labels: 
	 Anti-Francoism
	 Basque_conflict
	 Basque_history
	 Basque_politics
	 ETA
	 European_Union_designated_terrorist_organizations
	 Far-left_politics
	 Francoist_Spain
	 Government_of_Canada_designated_terrorist_organizations
	 Irregular_military
	 Military_wings_of_political_parties
	 National_liberation_movements
	 Nationalist_terrorism
	 Organizations_designated_as_terrorist_by_the_United_States_government
	 Organizations_designated_as_terrorist_in_Europe
	 Organizations_established_in_1959
	 Politics_of_Spain
	 Resistance_movements
	 Secession_in_Spain
	 Secessionist_organizations_in_Europe
	 Terrorism_in_Spain
	 United_Kingdom_Home_Office_designated_terrorist_groups
Predicted: 
id score
0 Organizations_designated_as_terrorist_in_Europe 0.948875
1 Terrorism_in_Spain 0.948431
2 Basque_politics 0.942670
3 Politics_of_Spain 0.941830
4 European_Union_designated_terrorist_organizations 0.940194
5 Irregular_military 0.938163
6 Political_parties_disestablished_in_1977 0.936437
7 Algerian_Civil_War 0.936311
8 Republicanism_in_Spain 0.935577
9 Guerrilla_organizations 0.935506

Summary

We demonstrated a similarity search approach for performing extreme classification of texts. We took a simple approach representing labels as the average of their corresponding texts' vector embeddings. In classification time, we match between a new article embedding and its nearest label embeddings. Our result examples indicate the usefulness of this approach.

You can take this forward by exploring advanced ideas. For example, you can utilize the hierarchical relationship between labels or improve the label representations. Just have fun, and feel free to share your thoughts.

Delete the index

Delete the index once you do not want to use it anymore. Once the index is deleted, you cannot use it again.

Copy
Copied
pinecone.delete_index(index_name)