(no title)
randomgermanguy | 3 months ago
I don't think my tech-lead was trying to suggest the floating-point error/non-associativity was the real source.
randomgermanguy | 3 months ago
I don't think my tech-lead was trying to suggest the floating-point error/non-associativity was the real source.
AJRF|3 months ago
Yes I would because it causes exponential divergence (P(correct) = (1-e)^n) and doesn't have a widely adopted solution. The major labs have very expensive researchers focused on this specific problem.
There is a paper from Thinking Machines from September around Batch Invariant kernels you should read, it's a good primer on this issue of non-determinism in LLM's, you might learn something from it!
Unfortunately the method has quite a lot of overhead, but promising research all the same.
randomgermanguy|3 months ago
I dont think this is relevant to the main-point, but it's definitely something I wasn't aware of. I would've thought it might have an impact on like O(100)th token in some negligible way, but glad to learn.