BERTopic: The Future of Topic Modeling

[+] uniqueuid|3 years ago|reply

It is true that bertopic is a great tool. It's modern, it's modular, and it's pretty performant.

That said, I want to caution against using topic modeling as a one-fits-all-solution. As the author stresses, this is one particular approach which uses a combination of embeddings (sentence, or other), umap and hdbscan. Both umap and hdbscan can be slow, so it might be worthwhile to check out the GPU enabled versions of both from the cuml package.

In addition, topic models have a huge number of degrees of freedom, and the solution you will get depends on many (seemingly arbitrary) choices. In other words, these are not the topics, they are some topics.

That said, it's awesome, really great work by Maarten Grootendorst and a great blog post by James Briggs.

[edit] here is a link to the fast cuda version of bertopic by rapidsai: https://github.com/rapidsai/rapids-examples/tree/main/cuBERT...

[+] hack_ml|3 years ago|reply

Its seamless to accelerate BERTOPIC on GPU's with cuML now with the latest release. (v0.10.0)

Checkout the docs at: https://maartengr.github.io/BERTopic/faq.html#can-i-use-the-...

All you need to do is below

    from bertopic import BERTopic
    from cuml.cluster import HDBSCAN
    from cuml.manifold import UMAP

    # Create instances of GPU-accelerated UMAP and HDBSCAN
    umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
    hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)

    # Pass the above models to be used in BERTopic
    topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
    topics, probs = topic_model.fit_transform(docs)

[+] whakim|3 years ago|reply

I agree that this is a cool. That being said, the results show that we have a long, long way to go. The topics are pretty incoherent: what are "would", "should", and "use" doing in there? The words have no context, so (for example) "self" clearly refers to the python reserved keyword, but you have no way of knowing that. Not to mention (as another comment brings up), the topics aren't named so it's pretty hard to actually figure out what they're about. If we think about real-world usage this output would be practically useless - it tells you that people talk about investing in r/investing and pytorch in r/pytorch. If you want meaningful, actionable information about what people are talking about in a large corpus of unstructured text data then for the forseeable future you'll need to involve humans in the loop even if ML assistance plays a big part.

[+] Der_Einzige|3 years ago|reply

A huggingface space I wrote to let you play with BERTopic in your browser:

https://huggingface.co/spaces/Hellisotherpeople/HF-BERTopic

[+] leobg|3 years ago|reply

Wow. This is great!

[+] ahoho|3 years ago|reply

Color me skeptical on BERTopic. Without human validation, I'm not convinced that it's an improvement over existing methods.

I'm an author on a recent paper about automated topic model evaluation [1], and we found that current metrics do not line up with human judgements as well as previously thought. To my knowledge, BERTopic has only been evaluated on these automated metrics.

For datasets of under a few hundred thousand documents, Mallet (LDA estimated with Gibbs sampling) can produce stable, high-quality outputs in minutes on a laptop [2]. Even larger datasets remain tractable, although depending on your use case you may be better off subsampling.

It's possible that I've missed something, but I'm not clear on what benefits BERTopic has that existing methods do not. I don't mean to be overly negative---it has a nice API and the approach seems reasonable---I'm just wondering what's really new here.

[1]: https://proceedings.neurips.cc/paper/2021/hash/0f83556a305d7... [2]: https://mimno.github.io/Mallet/ [3]: https://maartengr.github.io/BERTopic/faq.html#why-are-the-re...

[+] visarga|3 years ago|reply

Next step: automatically naming clusters using few-shot GPT-3. Cluster naming is a non-trivial problem.

[+] malshe|3 years ago|reply

Absolutely. This is the point where I think we shift from astronomy to astrology!

[+] vosper|3 years ago|reply

I've run into this problem at a previous employer. Do you know if anyone's working on it?

[+] victorianoi|3 years ago|reply

take a look at Graphext ( https://www.graphext.com ) it automatically creates the clustering embeddings using BERT for you + great visualization libraries to interpret the clusters :D it took us 5 years to build the product

[+] lmeyerov|3 years ago|reply

Latest pygraphistry has this flow for free and OSS, just `pip install graphistry[umap-learn]` or, for transformers, `pip install graphistry[ai]` :)

And per the article, with pluggable sentence transformers -> UMAP automatically as part of the auto-featurization: graphistry.nodes(accounts_df).umap().plot() :)

We haven't published tutorials yet, just been using with some fraud/cyber/misinfo/genomics/sales/gov/etc teams (including with RAPIDS gpu accel), so cool to see excitement here already! Till then, it should work out-of-the-box with no parameters, and then all sorts of fun things to tune: https://github.com/graphistry/pygraphistry/blob/21fad42412cc...

[+] m1sta_|3 years ago|reply

I'm not going to contact sales. I have no problem paying though.

Part of your value proposition is saving people time, but your sales model is time expensive.

[+] jasonys|3 years ago|reply

I think you should open source the core part of the HTML visualization algos and if ppl like it, they may consider paying for the premium version. I don’t feel like ppl want to move their analysis workflow to yet another platform without trying it enough in their existing workflows (e.g Jupyter/Colab/Databricks notebooks)

[+] CabSauce|3 years ago|reply

It sure has to be much, much better than free. Especially if your pricing is 'contact sales'.

[+] tbonza|3 years ago|reply

What happens on a slightly different task where domain experts have tried to create a set of topics, not all domain experts talk to each other, and so we instead need a way to merge existing topics? I continue to see benchmarks where human expertise significantly outperforms AI on common sense reasoning tasks (most recently https://arxiv.org/abs/2112.11446).

What about an approach using directed acyclic graphs and entities?

[+] whakim|3 years ago|reply

In traditional qualitative research, you'd usually have a bunch of experts get together and figure out a set of topics (or import and adapt a set of topics from similar work) before you go about classifying the bulk of your data.

[+] toxik|3 years ago|reply

How does this compare to LDA? It doesn’t seem like there’s a huge difference here. For good reason perhaps, the BERT part is only to embed the sentences.

[+] berto4|3 years ago|reply

yeah exactly my question. LDA is probabilistic and very performant if you clean up the documents well. The approach using Bert seems pretty powerful given that you can now cluster based on semantics, not just word occurrence/frequencies as in LDA (though ngrams help). However using a clustering approach would mean that each document is a part of a single topic, rather than being made up of multiple topics. But this is a cool idea nonetheless. [EDIT] quickly checked it out, seems like it uses some kind of soft clustering so documents can occur in many clusters (topics)

28 comments