top | item 46285535

A linear-time alternative for Dimensionality Reduction and fast visualisation

118 points| romanfll | 2 months ago |medium.com

36 comments

order

romanfll|2 months ago

Author here. I built this because I needed to run dimensionality reduction entirely in the browser (client-side) for an interactive tool. The standard options (UMAP, t-SNE) were either too heavy for JS/WASM or required a GPU backend to run at acceptable speeds for interactive use.

This approach ("Sine Landmark Reduction") uses linearised trilateration—similar to GPS positioning—against a synthetic "sine skeleton" of landmarks.

The main trade-offs:

It is O(N) and deterministic (solves Ax=b instead of iterative gradient descent).

It forces the topology onto a loop structure, so it is less accurate than UMAP for complex manifolds (like Swiss Rolls), but it guarantees a clean layout for user interfaces.

It can project ~9k points (50 dims) to 3D in about 2 seconds on a laptop CPU. Python implementation and math details are in the post. Happy to answer questions!

lmeyerov|2 months ago

Fwiw, we are heavy UMAP users (pygraphistry), and find UMAP CPU fine for interactive use at up to 30K rows and GPU at 100K rows, then generally switch to a trained mode when > 100K rows. Our use case is often highly visual - see correlations, and link together similar entities into explorable & interactive network diagrams. For headless, like in daily anomaly detection, we will do this to much larger scales.

We see a lot of wide social, log, and cyber data where this works, anywhere from 5-200 dim. Our bio users are trickier, as we can have 1K+ dimensions pretty fast. We find success there too, and mostly get into preconditioning tricks for those.

At the same time, I'm increasingly thinking of learning neural embeddings in general for these instead of traditional clustering algorithms. As scales go up, the performance argument here goes up too.

threeducks|2 months ago

Without looking at the code, O(N * k) with N = 9000 points and k = 50 dimensions should take in the order of milliseconds, not seconds. Did you profile your code to see whether there is perhaps something that takes an unexpected amount of time?

yxhuvud|2 months ago

FWIW, there are iterative SVD implementations out there that could potentially be useful as well in certain contexts when you get more data over time in a streamed manner.

zipy124|2 months ago

Something seems off here. t-SNE should not be taking 15-25 seconds for only 5k points and 20 dimensions, but rather somewhere like 1-2 seconds. Also since the given alternative is not as good, you would probably be able to reduce the iterations somewhat with t-SNE if speed is wanted at the risk of quality. Alternatively UMAP for this would be milliseconds, bordering on real-time with aggressive tuning.

rundev|2 months ago

The claim of linear runtime is only true if K is independent of the dataset size, so it would have been nice to see an exploration of how different values of K impact results. I.e. does clustering get better for larger K, if so how much? The values 50 and 100 seem arbitrary and even suspiciously close to sqrt(N) for the 9K dataset.

romanfll|2 months ago

Thanks for your comment.

To clarify: K is a fixed hyperparameter in this implementation, strictly independent of N. Whether we process 9k points or 90k points, we keep K at ~100. We found that increasing K yields diminishing returns very quickly. Since the landmarks are generated along a fixed synthetic topology, increasing K essentially just increases resolution along that specific curve, but once you have enough landmarks to define the curve's structure, adding more doesn't reveal new topology… it just adds computational cost to the distance matrix calculation. Re: sqrt(N): That is purely a coincidence!

jmpeax|2 months ago

> They typically need to compare many or all points to each other, leading to O(N²) complexity.

UMAP is not O(n^2) it is O(n log n).

romanfll|2 months ago

Thanks for your comment! You are right, Barnes-Hut implementation brings UMAP down to O(N log N). I should have been more precise in the document. The main point is that even O(N log N) could be too much if you run this in a browser.. Thanks for clarifying!

trgn|2 months ago

Glad to see 2d mapping is still of interest. 20 years ago, information visualization, data cartography, exploratory analytics, etc.. was pretty alive, but it never really took off and found a reliable niche in the industry, or real end user application. Why map it, when the machine can just tell you.

Would be nice to see it come back. Would love to browse for books and movies on maps again, rather that getting lists regurgitated at me.

benob|2 months ago

Is there a pip installable version?

romanfll|2 months ago

Not yet, but coming...

memming|2 months ago

first subsample a fixed number of random landmark points from data, then...

romanfll|2 months ago

Thanks for your comment. You are spot on, that is effectively the standard Nyström/Landmark MDS approach.

The technique actually supports both modes in the implementation (synthetic skeleton or random subsampling). However, for this browser visualisation, we default to the synthetic sine skeleton for two reasons:

1. Determinism: Random landmarks produce a different layout every time you calculate the projection. For a user interface, we needed the layout to be identical every time the user loads the data, without needing to cache a random seed. 2. Topology Forcing: By using a fixed sine/loop skeleton, we implicitly 'unroll' the high-dimensional data onto a clean reduced structure. We found this easier for users to visually navigate compared to the unpredictable geometry that comes from a random subset

aw123|2 months ago

[deleted]