top | item 35468243

(no title)

bagavi | 2 years ago

Simple explaination of the general idea using nearest neighbor search as example:

1. You want to calculate the l2-nearest neighbor to y from a set of x1, x2, ... xn. Where each point is a d-dimensional vector.

2. Naive approach will take O(nd) computations. For very large d, say 1e4, this is very expensive

Multi-arm bandits idea:

3. Estimate the distance of y to each point (xi) by sampling only along log(d) dimensions.

4. Keep the nearest n/2 points (throw away the rest)

5. Repeat step 3 and 4 until you have one point left

6. The total run time = nlog(d) + nlog(d)/2 + nlog(d)/4.... = O (nlog(d)log(n))

This is a huge improvement over O(nd) for large d (which is usually the case today)

This paper (https://ar5iv.org/abs/1805.08321) from Stanford pioneered this idea for nearest neighbors and many such household problems. It also theoretically proves that the above approach gives the right answer and empirically gets huge (100x) speed boost on large datasets.

The k-medoid paper is based off of this work :)

discuss

order

hansvm|2 years ago

And the biggest caveats IMO:

- Like any other probabilistic clustering algorithm that gets is speed boost from just ignoring large chunks of the data (like MiniBatch KMeans), this will not take outliers into account very well or very often.

- They're advertising this as not just a better KMedoids implementation, but a better drop-in replacement for KMeans. For generic clustering tasks, sure, fine, whatever, but the metric the new paper is trying to optimize is different from the one KMeans uses, so if KMeans has the right metric for a given task then you'll be switching to a (maybe) faster algorithm that just computes the wrong result. The easiest example that comes to mind is a dataset of non-intersecting hollow spheres. KMeans will spit out (for some choice of NClusters) the sphere centers, and KMedoids will spit out sphere boundary points, decreasing performance on the far side of the sphere and potentially allowing classification jumping from one sphere to another.

Both of those things are just qualities you may or may not want, so KMedoids may be better purely because it has those biases and KMeans doesn't, but it's not totally uncommon to just want cluster centers minimizing some error and not care how you get there or how explainable they are (the bolt vector quantization algorithm comes to mind), where KMedoids would just be the wrong choice.

DoctorOetker|2 years ago

I understand what you say, and I don't disagree with your analysis for the collection of datapoints on nonintersecting spherical surfaces. (Although I do frown at this type of dataset, since distribution densities tend to peak at their peaks, the "in high enough dimensions everything lies on a sphere" is a misconception: it is only the radial histogram that peaks at a nonzero radius...).

Just thought that it might be worth pointing out that there are valid use cases of clustering with real data reference for each class available: consider pictures, to be inspectable it would be helpful to see a real picture instead of some blurry interpolated mess.

Would you mind providing a reference for the bolt vector quantization algorithm? It sounds interesting.

motiwari|2 years ago

Right! Though to clarify a few nits: we use successive elimination instead of successive halving. And we talk about the Maximum Inner Product Search problem (very similar to NN problem) in our followup work: https://ar5iv.org/abs/2212.07551