It is true that bertopic is a great tool. It's modern, it's modular, and it's pretty performant.
That said, I want to caution against using topic modeling as a one-fits-all-solution. As the author stresses, this is one particular approach which uses a combination of embeddings (sentence, or other), umap and hdbscan. Both umap and hdbscan can be slow, so it might be worthwhile to check out the GPU enabled versions of both from the cuml package.
In addition, topic models have a huge number of degrees of freedom, and the solution you will get depends on many (seemingly arbitrary) choices. In other words, these are not the topics, they are some topics.
That said, it's awesome, really great work by Maarten Grootendorst and a great blog post by James Briggs.
from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)
# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs)
I agree that this is a cool. That being said, the results show that we have a long, long way to go. The topics are pretty incoherent: what are "would", "should", and "use" doing in there? The words have no context, so (for example) "self" clearly refers to the python reserved keyword, but you have no way of knowing that. Not to mention (as another comment brings up), the topics aren't named so it's pretty hard to actually figure out what they're about. If we think about real-world usage this output would be practically useless - it tells you that people talk about investing in r/investing and pytorch in r/pytorch. If you want meaningful, actionable information about what people are talking about in a large corpus of unstructured text data then for the forseeable future you'll need to involve humans in the loop even if ML assistance plays a big part.
Color me skeptical on BERTopic. Without human validation, I'm not convinced that it's an improvement over existing methods.
I'm an author on a recent paper about automated topic model evaluation [1], and we found that current metrics do not line up with human judgements as well as previously thought. To my knowledge, BERTopic has only been evaluated on these automated metrics.
For datasets of under a few hundred thousand documents, Mallet (LDA estimated with Gibbs sampling) can produce stable, high-quality outputs in minutes on a laptop [2]. Even larger datasets remain tractable, although depending on your use case you may be better off subsampling.
It's possible that I've missed something, but I'm not clear on what benefits BERTopic has that existing methods do not. I don't mean to be overly negative---it has a nice API and the approach seems reasonable---I'm just wondering what's really new here.
take a look at Graphext ( https://www.graphext.com ) it automatically creates the clustering embeddings using BERT for you + great visualization libraries to interpret the clusters :D it took us 5 years to build the product
Latest pygraphistry has this flow for free and OSS, just `pip install graphistry[umap-learn]` or, for transformers, `pip install graphistry[ai]` :)
And per the article, with pluggable sentence transformers -> UMAP automatically as part of the auto-featurization: graphistry.nodes(accounts_df).umap().plot() :)
We haven't published tutorials yet, just been using with some fraud/cyber/misinfo/genomics/sales/gov/etc teams (including with RAPIDS gpu accel), so cool to see excitement here already! Till then, it should work out-of-the-box with no parameters, and then all sorts of fun things to tune: https://github.com/graphistry/pygraphistry/blob/21fad42412cc...
I think you should open source the core part of the HTML visualization algos and if ppl like it, they may consider paying for the premium version. I don’t feel like ppl want to move their analysis workflow to yet another platform without trying it enough in their existing workflows (e.g Jupyter/Colab/Databricks notebooks)
What happens on a slightly different task where domain experts have tried to create a set of topics, not all domain experts talk to each other, and so we instead need a way to merge existing topics? I continue to see benchmarks where human expertise significantly outperforms AI on common sense reasoning tasks (most recently https://arxiv.org/abs/2112.11446).
What about an approach using directed acyclic graphs and entities?
In traditional qualitative research, you'd usually have a bunch of experts get together and figure out a set of topics (or import and adapt a set of topics from similar work) before you go about classifying the bulk of your data.
How does this compare to LDA? It doesn’t seem like there’s a huge difference here. For good reason perhaps, the BERT part is only to embed the sentences.
yeah exactly my question. LDA is probabilistic and very performant if you clean up the documents well. The approach using Bert seems pretty powerful given that you can now cluster based on semantics, not just word occurrence/frequencies as in LDA (though ngrams help). However using a clustering approach would mean that each document is a part of a single topic, rather than being made up of multiple topics. But this is a cool idea nonetheless.
[EDIT] quickly checked it out, seems like it uses some kind of soft clustering so documents can occur in many clusters (topics)
[+] [-] uniqueuid|3 years ago|reply
That said, I want to caution against using topic modeling as a one-fits-all-solution. As the author stresses, this is one particular approach which uses a combination of embeddings (sentence, or other), umap and hdbscan. Both umap and hdbscan can be slow, so it might be worthwhile to check out the GPU enabled versions of both from the cuml package.
In addition, topic models have a huge number of degrees of freedom, and the solution you will get depends on many (seemingly arbitrary) choices. In other words, these are not the topics, they are some topics.
That said, it's awesome, really great work by Maarten Grootendorst and a great blog post by James Briggs.
[edit] here is a link to the fast cuda version of bertopic by rapidsai: https://github.com/rapidsai/rapids-examples/tree/main/cuBERT...
[+] [-] hack_ml|3 years ago|reply
Checkout the docs at: https://maartengr.github.io/BERTopic/faq.html#can-i-use-the-...
All you need to do is below
[+] [-] whakim|3 years ago|reply
[+] [-] Der_Einzige|3 years ago|reply
https://huggingface.co/spaces/Hellisotherpeople/HF-BERTopic
[+] [-] leobg|3 years ago|reply
[+] [-] ahoho|3 years ago|reply
I'm an author on a recent paper about automated topic model evaluation [1], and we found that current metrics do not line up with human judgements as well as previously thought. To my knowledge, BERTopic has only been evaluated on these automated metrics.
For datasets of under a few hundred thousand documents, Mallet (LDA estimated with Gibbs sampling) can produce stable, high-quality outputs in minutes on a laptop [2]. Even larger datasets remain tractable, although depending on your use case you may be better off subsampling.
It's possible that I've missed something, but I'm not clear on what benefits BERTopic has that existing methods do not. I don't mean to be overly negative---it has a nice API and the approach seems reasonable---I'm just wondering what's really new here.
[1]: https://proceedings.neurips.cc/paper/2021/hash/0f83556a305d7... [2]: https://mimno.github.io/Mallet/ [3]: https://maartengr.github.io/BERTopic/faq.html#why-are-the-re...
[+] [-] visarga|3 years ago|reply
[+] [-] malshe|3 years ago|reply
[+] [-] vosper|3 years ago|reply
[+] [-] victorianoi|3 years ago|reply
[+] [-] lmeyerov|3 years ago|reply
And per the article, with pluggable sentence transformers -> UMAP automatically as part of the auto-featurization: graphistry.nodes(accounts_df).umap().plot() :)
We haven't published tutorials yet, just been using with some fraud/cyber/misinfo/genomics/sales/gov/etc teams (including with RAPIDS gpu accel), so cool to see excitement here already! Till then, it should work out-of-the-box with no parameters, and then all sorts of fun things to tune: https://github.com/graphistry/pygraphistry/blob/21fad42412cc...
[+] [-] m1sta_|3 years ago|reply
Part of your value proposition is saving people time, but your sales model is time expensive.
[+] [-] jasonys|3 years ago|reply
[+] [-] CabSauce|3 years ago|reply
[+] [-] tbonza|3 years ago|reply
What about an approach using directed acyclic graphs and entities?
[+] [-] whakim|3 years ago|reply
[+] [-] toxik|3 years ago|reply
[+] [-] berto4|3 years ago|reply