top | item 34718410

(no title)

rckoepke | 3 years ago

I've been trying to use convex hulls to explore ML embedding spaces. However, the dimensionality (768+ dimensions) seems to crash established options like QHull[0], even with 64GB RAM (and 16 CPU cores, albeit libqhull is not multi-threaded).

Are there more appropriate algorithms for finding convex hulls where dimensions are ~768? Or any parallelized / GPU-optimized options that I should look into?

0: http://www.qhull.org

discuss

order

blamestross|3 years ago

> Qhull does not support triangulation of non-convex surfaces, mesh generation of non-convex objects, medium-sized inputs in 9-D and higher, alpha shapes, weighted Voronoi diagrams, Voronoi volumes, or constrained Delaunay triangulations,

You are going to have to roll your own.

One trick you can use is that most convex hull algorithms chase O(nlg(n)). That lg(n) implies a branching step which lowers efficiency on GPUs. Your coefficients in high dimensions likely mean an O(n^2) branchless algorithm could run faster on a GPU.

Cull points aggressively too, for what little that is worth in high dimensions.

I found https://www.sciencedirect.com/science/article/abs/pii/S01678... which looks like it could be a starting point.

The real problem is that in dimensions that high, the point set probably already is the hull and all this is a zero signal gain operation.

rckoepke|3 years ago

> The real problem is that in dimensions that high, the point set probably already is the hull and all this is a zero signal gain operation.

Well, if I have 10,000 samples of a 768-dimension volume, most of those points will probably be inside the volume, and not per se a vertex of the hull.

I’m very comfortable rolling my own solution, so thank you for pointing me to Jarvis’ algorithm!

rcme|3 years ago

Depends on how complicated the branches are. One solution to branching on GPUs is to compute every branch, which is only a constant-time increase in computation. I find it hard to believe there are many cases where n^2 would be faster than lg(n) for large n.

apwheele|3 years ago

Due to the curse of dimensionality, more of the points are pushed toward the hull in higher dimensions. So I don't think this is likely to be very effective as a data exploration technique as stated.

There may be more general expected value formulas for higher dimensions, I only know of examples in two-dimensions offhand, https://blogs.sas.com/content/iml/2021/12/06/expected-number....

There may be smarter reduced form embeddings though to make pretty pictures, e.g. https://www.youtube.com/watch?v=sD-uDZ8zXkc&ab_channel=Cynth....

rcme|3 years ago

I wonder if the curse of dimensionality would necessarily apply. In general, yes, most of the points in a n-dimensional volume lie near the surface in high dimensional spaces. While true, this seems mostly to be an issue when sampling random, independent points from the space. For an ML problem, all of the points are likely dependent on a few underlying processes.

Wikipedia has more on this: https://en.wikipedia.org/wiki/Curse_of_dimensionality

> This is often cited as distance functions losing their usefulness (for the nearest-neighbor criterion in feature-comparison algorithms, for example) in high dimensions. However, recent research has shown this to only hold in the artificial scenario when the one-dimensional distributions R \mathbb {R} are independent and identically distributed.[13] When attributes are correlated, data can become easier and provide higher distance contrast and the signal-to-noise ratio was found to play an important role, thus feature selection should be used.

> More recently, it has been suggested that there may be a conceptual flaw in the argument that contrast-loss creates a curse in high dimensions. Machine learning can be understood as the problem of assigning instances to their respective generative process of origin, with class labels acting as symbolic representations of individual generative processes. The curse's derivation assumes all instances are independent, identical outcomes of a single high dimensional generative process. If there is only one generative process, there would exist only one (naturally occurring) class and machine learning would be conceptually ill-defined in both high and low dimensions. Thus, the traditional argument that contrast-loss creates a curse, may be fundamentally inappropriate. In addition, it has been shown that when the generative model is modified to accommodate multiple generative processes, contrast-loss can morph from a curse to a blessing, as it ensures that the nearest-neighbor of an instance is almost-surely its most closely related instance. From this perspective, contrast-loss makes high dimensional distances especially meaningful and not especially non-meaningful as is often argued.