Practical: Clustering of page content with BERTopic

Practical: Clustering of page content with BERTopic – Lesson Preview

This practical lesson shows how to cluster website content using BERTopic, so you can map themes across large content libraries and label pages with the topics they most strongly align with. For Marketing and SEO Professionals, the benefit is clear: faster content audits, clearer topic structures, and better inputs for internal linking and competitive research, without labeled data.

You’ll see how BERTopic works end-to-end: embedding documents with transformer models, reducing dimensionality and forming dense clusters that surface coherent, human-readable topics. The lesson explains where BERTopic shines compared with LDA, NMF, Top2Vec, and FastTopic, and where it struggles (e.g., very small datasets, compute demands, over-granular topics). You’ll also learn how BERTopic’s “Lego-style” API lets you swap in different tokenizers, clustering algorithms, and embedding models based on your data and goals. A short Google Colab demo (on the same BBC dataset used in the LDA lesson) lets you compare outputs side by side and export topic summaries for analysis.

By the end, you’ll know when to pick BERTopic, how to interpret its topic distributions and labels, and how to combine it with other approaches (like LDA or text classification) to get a more nuanced picture of your site’s themes and subtopics.

What you’ll learn (why it matters)

How BERTopic forms topics — because clear clusters speed content audits.
When to choose BERTopic vs. LDA — because fit to data affects coherence.
Modular pipeline choices — because custom components match your use case.
Interpreting outputs — because better labels guide internal linking.
Limits on small datasets — because sample size impacts quality.
Handling over-granularity — because too many topics slow analysis.

Key concepts (with mini-definitions)

BERTopic — embedding-based topic modeling that clusters documents into interpretable topics.
Embeddings (BERT) — dense vectors capturing semantic relationships between words/documents.
UMAP — dimensionality reduction used to compress embeddings before clustering.
HDBSCAN — density-based clustering that finds tight groups as candidate topics.
c-TF-IDF — representation that highlights words most representative of each topic.
MMR — re-ranking step to maximize topic relevance and diversity of terms.
Soft/Fuzzy clustering — documents can relate to multiple topics or boundaries.
Cosine similarity — measures closeness between embeddings when matching topics to documents.

Tools mentioned

BERTopic, BERT/Sentence Transformers, UMAP, HDBSCAN, K-Means, BIRCH, c-TF-IDF, MMR, Google Colab (Python), LDA, NMF, Top2Vec, FastTopic, Google Natural Language API (text classification), Hugging Face models and GitHub, Medium.

Practice & readings

Run the provided Google Colab demo on the BBC dataset; compare BERTopic vs. LDA outputs.
Review the creator’s BERTopic GitHub/Medium/YouTube materials for tips and performance tuning.
Compare results with Google’s Natural Language API classification to validate topic labels.

Key insights & takeaways

BERTopic is versatile and modular; swap components to fit your data.
It often provides clearer, semantically coherent topics than traditional models.
Works best on larger datasets; small samples reduce coherence.
Can overproduce topics; use limits or provide labels to control granularity.
Human evaluation remains essential to validate and refine topic labels.

Ready for the next step? Start your learning journey with MLforSEO

Buy the course to unlock the full lesson
Level up your topic modeling and content audits with this walkthrough, demo notebook, and resources.

Introduction to Machine Learning for SEO

Practical: Clustering of page content with BERTopic