90% of the world’s data is unstructured. It is built by humans, for humans. That’s great for human consumption, but it is very hard to organize when we begin dealing with the massive amounts of data abundant in today’s information age.
Organization is complicated because unstructured text data is not intended to be understood by machines, and having humans process this abundance of data is wildly expensive and very slow.
Fortunately, there is light at the end of the tunnel. More and more of this unstructured text is becoming accessible and understood by machines. We can now search text based on meaning, identify the sentiment of text, extract entities, and much more.
Transformers are behind much of this. These transformers are (unfortunately) not Michael Bay’s Autobots and Decepticons and (fortunately) not buzzing electrical boxes. Our NLP transformers lie somewhere in the middle, they’re not sentient Autobots (yet), but they can understand language in a way that existed only in sci-fi until a short few years ago.
Machines with a human-like comprehension of language are pretty helpful for organizing masses of unstructured text data. In machine learning, we refer to this task as topic modeling, the automatic clustering of data into particular topics.
BERTopic takes advantage of the superior language capabilities of these (not yet sentient) transformer models and uses some other ML magic like UMAP and HDBSCAN (more on these later) to produce what is one of the most advanced techniques in language topic modeling today.
BERTopic at a Glance
We will dive into the details behind BERTopic [1], but before we do, let us see how we can use it and take a first glance at its components.
To begin, we need a dataset. We can download the dataset from HuggingFace datasets with:
from datasets import load_dataset
data = load_dataset('jamescalam/python-reddit')
The dataset contains data extracted using the Reddit API from the /r/python subreddit. The code used for this (and all other examples) can be found here.
Reddit thread contents are found in the selftext
feature. Some are empty or short, so we remove them with:
data = data.filter(
lambda x: True if len(x['selftext']) > 30 else 0
)
We perform topic modeling using the BERTopic
library. The “basic” approach requires just a few lines of code.
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
# we add this to remove stopwords
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")
model = BERTopic(
vectorizer_model=vectorizer_model,
language='english', calculate_probabilities=True,
verbose=True
)
topics, probs = model.fit_transform(text)
From model.fit_transform
we return two lists:
topics
contains a one-to-one mapping of inputs to their modeled topic (or cluster).probs
contains a list of probabilities that an input belongs to their assigned topic.
We can then view the topics using get_topic_info
.
freq = model.get_topic_info()
freq.head(10)
Topic Count Name
0 -1 196 -1_python_code_data_using
1 0 68 0_image_ampx200b_code_images
2 1 58 1_python_learning_programming_just
3 2 44 2_python_django_flask_library
4 3 32 3_link_title_thumbnail_datepublished
5 4 28 4_package_python_like_slap
6 5 27 5_spectra_space_asteroid_training
7 6 26 6_make_project ideas_ideas_comment
8 7 23 7_log_logging_use_conn
9 8 21 8_questions_thread_response_python
The top -1
topic is typically assumed to be irrelevant, and it usually contains stop words like “the”, “a”, and “and”. However, we removed stop words via the vectorizer_model
argument, and so it shows us the “most generic” of topics like “Python”, “code”, and “data”.
The library has several built-in visualization methods like visualize_topics
, visualize_hierarchy
, and visualize_barchart
.
BERTopic’s
visualize_hierarchy
visualization allows us to view the “hierarchy” of topics.
These represent the surface level of the BERTopic library, which has excellent documentation, so we will not rehash that here. Instead, let’s try and understand how BERTopic works.
Overview
There are four key components used in BERTopic [2], those are:
- A transformer embedding model
- UMAP dimensionality reduction
- HDBSCAN clustering
- Cluster tagging using c-TF-IDF
We already did all of this in those few lines of BERTopic code; everything is just abstracted away. However, we can optimize the process by understanding the essentials of each component. This section will work through each component without BERTopic, and learn how they work before returning to BERTopic at the end.
Transformer Embedding
BERTopic supports several libraries for encoding our text to dense vector embeddings. If we build poor quality embeddings, nothing we do in the other steps will be able to help us, so it is very important that we choose a suitable embedding model from one of the supported libraries, which include:
- Sentence Transformers
- Flair
- SpaCy
- Gensim
- USE (from TF Hub)
Of the above, the Sentence Transformers library provides the most extensive library of high-performing sentence embedding models. They can be found on HuggingFace Hub by searching for “sentence-transformers”.
We can find official sentence transformer models by searching for “sentence-transformers” on HuggingFace Hub.
The first result of this search is sentence-transformers/all-MiniLM-L6-v2
, this is a popular high-performing model that creates 384-dimensional sentence embeddings.
To initialize the model and encode our Reddit topics data, we first pip install sentence-transformers
and then write:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
model
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
(2): Normalize()
)
import numpy as np
from tqdm.auto import tqdm
batch_size = 16
embeds = np.zeros((n, model.get_sentence_embedding_dimension()))
for i in tqdm(range(0, n, batch_size)):
i_end = min(i+batch_size, n)
batch = data['selftext'][i:i_end]
batch_embed = model.encode(batch)
embeds[i:i_end,:] = batch_embed
100%|██████████| 195/195 [08:51<00:00, 2.73s/it]
Here we have encoded our text in batches of 16
. Each batch is added to the embeds
array. Once we have all of the sentence embeddings in embeds
we’re ready to move on to the next step.
Dimensionality Reduction
After building our embeddings, BERTopic compresses them into a lower-dimensional space. This means that our 384-dimensional vectors are transformed into two/three-dimensional vectors.
We can do this because 384 dimensions are a lot, and it is unlikely that we really need that many dimensions to represent our text [4]. Instead, we attempt to compress that information into two or three dimensions.
We do this so that the following HDBSCAN clustering step can be done more efficiently. Performing the clustering step with 384-dimensions would be desperately slow [5].
Another benefit is that we can visualize our data; this is incredibly helpful when assessing whether our data can be clustered. Visualization also helps when tuning the dimensionality reduction parameters.
To help us understand dimensionality reduction, we will start with a 3D representation of the world. You can find the code for this part here.
3D scatter plot of points from the jamescalam/world-cities-geo
dataset.
We can apply many dimensionality reduction techniques to this data; two of the most popular choices are PCA and t-SNE.