top | item 37385818

(no title)

haxton | 2 years ago

Curious to know what value you've seen out of these clusters. In my experience k means clustering was very lackluster. Having to define the number of clusters was a big pain point too.

You almost certainly want a graph like structure (overlapping communities rather than clusters).

But unsupervised clustering was almost entirely ineffective for every use case I had :/

discuss

simonw|2 years ago

I only got the clustering working this morning, so aside from playing around with it a bit I've not had any results that have convinced me it's a tool I should throw at lots of different problems.

I mainly like it as another example of the kind of things you can use embeddings for.

My implementation is very naive - it's just this:

    sklearn.cluster.MiniBatchKMeans(n_clusters=n, n_init="auto")

I imagine there are all kinds of improvements that could be made to this kind of thing.

I'd love to understand if there's a good way to automatically pick an interesting number of clusters, as opposed to picking a number at the start.

https://github.com/simonw/llm-cluster/blob/main/llm_cluster....

FreakLegion|2 years ago

There are iterative methods for optimizing the number of clusters in k-means (silhouette and knee/elbow are common), but in practice I prefer density-based methods like HDBSCAN and OPTICS. There's a very basic visual comparison at https://scikit-learn.org/stable/auto_examples/cluster/plot_c....

stefanka|2 years ago

You could also use a Bayesian version of kmeans. It applies a Dirichlet process as a prior to an infinite (truncated) set of clusters such that the most probable number k is automatically found. I found one implementation here: https://github.com/vsmolyakov/DP_means

Alternatively, there is a Bayesian GMM in sklearn. When you restrict it to diagonal Covariance matrices, you should be fine in high dimensions

nl|2 years ago

Switch to using HDBSCAN. It's good.

haxton|2 years ago

Elbow method is a good place to start for finding the number of clusters.

visarga|2 years ago

Use bottom up clustering, you get the whole tree. fclusterdata in scipy