The samples your input is batched with on the provider's backend vary between calls and sparse mixture of experts routing when implemented for efficient utilization induces competition among tokens with either encouraged or enforced balance of expert usage among tokens in the same fixed-size group. I think it's unknown or at least undisclosed exactly why sequence non-determinism at zero temperature occurs in these proprietary implementations, but I think this is a good theory.[1] https://arxiv.org/abs/2308.00951 pg. 4
[2] https://152334h.github.io/blog/non-determinism-in-gpt-4/
kettleballroll|1 year ago