top | item 42958471

(no title)

This isn't really true unfortunately -- mixture of experts routing seems to suffer from batch non-determinism. No one has stated publicly exactly why this is, but you can easily replicate the behavior yourself or find bug reports / discussion with a bit of searching. The outcome and observed behavior of the major closed-weight LLM APIs is that a temperature of zero no longer corresponds to deterministic greedy sampling.

discuss

brookst|1 year ago

If temperature is zero, and weights are weights, where is the non-deterministic behavior coming from?

wodenokoto|1 year ago

Temperature changes the distribution that is sampled, not if a distribution is sampled.

Temperature changes the softmax equation[1], not weather or not you are sampling from the softmax result or choosing the highest probability. IBM's documentation corroborates this, saying you need to set do_sample to True in order for the temperature to have any effect, e.g., T changes how we sample, not if we sample [2].

A similar discussion on openai forum also claim that the RNG might be in a different state from run to run, although I am less sure about that [3]

[1] https://pelinbalci.com/2023/10/16/Temperature_parameter.html

[2] https://www.ibm.com/think/topics/llm-temperature#:~:text=The...

[3] https://community.openai.com/t/clarifications-on-setting-tem...

TeMPOraL|1 year ago

Here probably routing would be dominating, but in general, unless I missed all the vendors ditching GPUs and switching to ASICs optimized for fixed precision math, floating points are still non-commutative therefore results are non-deterministic wrt. randomness introduced by parallelising the calculations.

michalsustr|1 year ago

I recently attended a STAC conference where they claimed the GPUs themselves are not deterministic. The hand-wavy speculation is they need to temperature control the cores and the flop ops may be reordered during that process. (By temperature I mean physical temperature, not some nn sampling parameter). On such large scale of computation these small differences can show up in the actually different tokens.

petesergeant|1 year ago

The parent is suggesting that temperature only applies at the generation step, but the choice of backend “expert model” that a request is given to (and then performs the generation) is non-deterministic. Rather than being a single set of weights, there are a few different sets of weights that constitute the “expert” in MoE. I have no idea if that’s true, but that’s the assertion

daralthus|1 year ago

I have seen numbers come differently in JAX just depending on the batch size, simply because the compiler optimizes to a different sequence of operations on the hardware.