(no title)
MLoffshore | 3 months ago
Re: whether this is useful beyond being a cool exercise:
sklearn: Yeah, sklearn is obviously faster and great for day to day work. The reason this project doesn’t use it is because even with fixed seeds, sklearn can still produce different results across machines due to BLAS differences, CPU instruction paths, etc. Here the goal isn’t speed, it’s to make sure the same dataset always produces the exact same artifacts everywhere, down to the byte.
Where that matters: A few examples from my world:
Maritime/industrial auditing: a lot of equipment logs and commissioning data get “massaged” early on. If later analysis depends on that data, you need a way to prove the ingest + transformations weren’t affected by the environment they ran on.
Medical/regulatory work: clinical models frequently get blocked because the same run on two different machines gives slightly different outputs. Determinism makes it possible to freeze analytics for compliance.
Any situation where you have to defend an analytical result (forensics, safety investigations, audits, etc). People assume code is reproducible, but floating-point libraries, OS updates, and dependency drift break that all the time.
So yeah sklearn is better if you just want clustering. This is more like a “reference implementation” you can point to when you need evidence that the result wasn’t influenced by hardware or environment.
Happy to answer questions if anyone’s curious.
ardata|3 months ago