Fixing "theoretical" nondeterminism for a totally closed individual input-output pair doesn't solve the two "practical" nondeterminism problems, where the exact same input gives different results given different preceding context, and where a slightly transformed input doesn't give a correctly transformed result.
Until those are addressed, closed-system nondeterminism doesn't really help except in cases where a lookup table would do just as well. You can't use "correct" unit tests or evaluation sets to prove anything about inputs you haven't tested.
There is no such thing as "exactly the same input, but with different preceding context". The preceding context is input!
If you were to obtain exactly the same output for a given input prompt, regardless of context, then that would mean that the context is being ignored, which is indistinguishable from the session not maintaining any context such that each prompt is in a brand new empty context.
Now what some people want is requirements like:
- The different wording of a prompt with exactly the same meaning should not change anything in the output; e.g. whether you say "What is the capital of France" or "What is France's capital" the answer should be verbatim identical.
- Prior context should not change responses in ways that don't have any interaction with the context. For instance, a prompt is given "what is 2 + 2", then the answer should always be the same, except if the context instructs the LLM that 2 + 2 is to be five.
These kinds of requirements betray a misunderstanding of what these LLMs are.
> where the exact same input gives different results given different preceding context
Why and how is this a problem?
If 'preceding context' doesn't cause different results, it means you can simply discard the context. Why do I want that? It's not how I expect a tool to work (I expect vim responds differently to my input after I switch to the insert mode). It's absolutely not how I expect intelligence to work either. It sounds like the most extreme form of confirmation bias.
Why do you care about determinism in a probabilistic system? What difference does it make to the end user if the input "How do I X?" always produces the same deterministic output when semantically equivalent inputs "how do i x?", "how do I x", and "how do I X??" are bound to produce different answers that often won't even be semantically equivalent.
What LLMs need is the ability to guarantee semantically-equivalent outputs for all semantically-equivalent inputs, but that's very different from "determinism" as we understand it from other algorithms.
Not all LLM based applications are a user facing free form chat.
If you take an LLM that makes 10 tool calls in a row for an evaluation, any reduction in unpredictable drift is welcome. Same applies to running your prompt through DSPy Optimizer. [0] Countless other examples. Basically any situation where you are in control of the prompt, the token level input to the LLM, so there's no fuzziness.
In this case, if you would've eliminated token level fuzziness and can yourself guarantee that you're not introducing it from your own end, you can basically map out a much more reliable tree or graph structure of your system's behavior.
You aren't wrong, but that doesn't mean this level of determinism isn't useful. If you don't even have the level of determinism that the exact same input tokens produce the exact same output tokens, then it's very hard to share reproducible results with peers, which can be useful if you are say, red teaming an LLM to produce a very rare / unreliable output.
I'm actually working on something similar to this where you can encode information into the outputs of LLM's via steganography: https://github.com/sutt/innocuous
Since I'm really looking to sample the only the top ~10 tokens, and I mostly test on CPU-based inference of 8B models, there's probably not a lot of worries getting a different order of the top tokens based on hardware implementation, but I'm still going to take a look at it eventually, and build in guard conditions against any choice that would be changed by an epsilon of precision loss.
It would be very useful for AI platform customers. You could run prompts with 0 temperature and check if the results are the same making sure that AI provider is not switching the PRO model in the background for a cheap one and ripping you off.
For "bug" reproduction purposes. It is easier to debug a model if the same string always produces the same incorrect or strange LLM output, not every 100th time you run it.
Was my thinking exactly - but also semantically equivalent is also only relevant when it needs to be factual, not necessarily for ALL outputs (if we're aiming for LLM's to present as "human" - or for interactions with LLMs to be natural conversational...). This excludes the world where LLMs act as agents - where you would of course always like the LLM to be factual and thus deterministic.
When you do MCP-style applications, an LLM is more like RegEx on steroids, and since you expect your regex to return the same matches on the same input, it is a very desirable attribute for LLMs as well. I would say it is more than desirable, it is necessary.
If i want to covert "how do I x" to `api.howTo("x")` it is very important that i get the exact same result every time.
Deterministic output is needed when LLMs are used for validations. This can be anything from input validation at runtime to a CI check leveraging LLMs. It can be argued this is not an acceptable use of AI, but it will become increasingly common and it will need to be tweaked/tested. You cannot tweak/test a response you don't know you're going to get.
I agree that we need stochasticity in a probabilistic system, but I also think it would be good to control it. For example, we need the stochasticity introduced at high temperatures since it is inherent to the model, but we don’t need stochasticity in matrix computations, as it is not required for modeling.
I don't think the claim is that this is particularly helpful for consumer-facing applications. But from a research perspective, this is invaluable for allowing reproducibility.
Sometimes, the reason for non-determinism is implementation-specific. For instance, in GPT-2's source code (I haven't checked other model versions), setting the temperature in the GUI does not lead to a value of 0 but "epsilon" (a very small value larger than 0), to avoid a division by zero error in the code, which makes sense.
For many applications, non-determinism implies "useless".
This has been a long standing issue with LDA topic models. In particular in the legal, financial and regulatory domains, if a method is not deterministic, it may be illegal to use it or it may lead to follow-on requirements that one does not want (e.g. all screens shown to humans must be preserved to be able to go back and reconstruct what exactly happened to a particular user in a particular second).
"in collaboration with others at Thinking Machines"
If you're old enough, you might remember Danny Hillis' Thinking Machines from the late 80s. I wish they had chosen a different name (I say this for nostalgic reasons, having been in front of one of those cubes glowing with red LEDs back in the late 80s at MIT's AI Lab" (renamed to CSAIL at some point). Feynman did some amazing work on that, too: https://longnow.org/ideas/richard-feynman-and-the-connection...
In the U.S., the “THINKING MACHINES” trademarks were owned by Thinking Machines Corporation (the company Hillis co-founded), not Hillis personally, and those registrations were cancelled in 1998–1999.
USPTO Report
+1
The company itself went bankrupt in 1994 and its assets were dispersed (e.g., to Sun Microsystems, later Oracle).
There’s a new, pending USPTO application for “THINKING MACHINES” filed in 2025 by Thinking Machines Lab Inc., the company founded by Amira Murati.
I love high quality blog post style research discussion - Anthropic has been leading the charge with this recently and it's great to see it spreading. OpenAI was also doing this during all the RL research days.
Natural language is ambiguous. It needs to be. I think the approach here of trying to figure out how to make circles into squares, and argue why circles should be squares, is misguided.
Discussions of this type are going to eventually morph into better understanding of how to accept ambiguity and randomness in language, and further shape it with other larger sub-patterns beyond the little proto-grammars that the QKV projection matrices extract.
Are you talking about the “Thinking Machines” company that shut down in 1994? Took me some digging to figure it out, doesn’t seem well-known enough to be the reason - it’s just a nice (and relatively obvious) name.
> But why aren’t LLM inference engines deterministic? One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first. We will call this the “concurrency + floating point” hypothesis for LLM inference nondeterminism. For example, a recent arXiv preprint writes
I'm honored to see that Mira and co. appreciated my feedback on the very topic I made 7 months ago here :D
> You don't need RNG since the whole transformer is an extremely large floating-point arithmetic unit. A wild guess - how about the source of non-determinism is coming from the fact that, on the HW level, tensor execution order is not guaranteed and therefore (T0 * T1) * T2 can produce slightly different results than T0 * (T1 * T2) due to rounding errors?
I really hope we will get deterministic LLMs in the future. Even if it causes slightly slower response times.
Nondeterminism is what currently keeps me from working with other developers.
As I wrote in "Prompt Coding" [1], these days I am not looking for good code. I am looking for prompts that create good code. But how do you share prompts among developers when they produce different code every time? You cannot simply state "Here, I found a prompt that makes gpt-5-2025-08-07 output a solution with all the desired attributes".
Similar with images. At the moment, for most image models, you cannot outsource the task of writing prompts that create the desired images. Because most image models will not create the same image when given the same prompt and parameters.
For fun over the last few days, I've built a compressor / decompressor that uses the logits from an LLM, for each token in the input, then takes the ranks and exponential goolomb encodes them. Then you work in reverse to regenerate the original
It took me ages to get the prediction for the second token after "hello" to match the same as the prediction for the second token when running the model on the string "hello world", despite the fact that I was using a causal model. I tried all kinds of things before discovering that `quantized: false` was the important setting.
What's the Weissman score? Or more seriously :) did it perform well. Sounds like it should. If more and more text is AI slop it should do well.
I dont fully understand what you said but I guess higher probability logits are encoded with fewer bits. If your text is the LLM output then you may need a bit or two per token?
Very impressive! I guess this still wouldn't affect their original example
> For example, you might observe that asking ChatGPT the same question multiple times provides different results.
even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.
> Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.
The router also leaks batch-level information across sequences.
As others have pointed out, these phenomena are well known to many folks across companies in the AI infra space. It doesn't really break new ground. This article is a good exposition of the basic strategies though.
What I would have loved is a discussion around collectives/multi-node setups. And showing how to get determinism at low performance penalty for multi-node reduction collectives.
Deterministic reproducibility is very different from replicability, and imo the latter is more important; even if the details of the reproducibility are interesting I think they're irrelevant.
There's a similar situation in other scientific disciplines. People want source code and data so they can reproduce results - that basically tells you someone didn't cheat and they documented everything. But it does not tell you whether a real phenomenon was observed.
It's much more interesting to know if roughly the same cause and effect relationships exist so we can predict behavior.
Concretely, there are studies that show e.g. randomly capitalizing letters can lead to completely different responses from and LLM. That speaks to a fragility that doesn't have anything to do with deterministic reproduction.
As the bottom of LLM inference, it is sampling for the next token based on the probability distribution conditioned on the tokens currently in the context window. If the distribution exhibits degeneracy in probability for more than token, outcome of the sampling will naturally, as it should, be nondeterministic. It should be left alone.
His solution still relies on greedy (temperature 0) sampling, which is probably not optimal for model performance on various tasks. For example, Gemini 2.5 uses temperature 1 by default. But deterministic inference with temperature >0 can still be achieved by using pseudorandom sampling with a fixed seed.
Conceptually setting temperature to be >0 doesn't actually introduce any non-determinism. If your sampler is seeded then it will always choose the same next token. Higher temperature only flattens the logit distribution.
The point of the blog is that even at "supposed" deterministic generative sampling, non-determinism creeps in. This in turn has disastrous effects in very real experiments.
I think this means that the results might also be non-deterministic across hardware revisions b/c I don't think they verified that the kernels will work the same on different GPU & TPU versions b/c how do they know that the compiler will not re-order the operations behind their back?
Yes, there’s usually no guarantee on how different hardware does operations (for example, even if the hardware is correctly rounding intermediate results, different hardware may use different tile sizes). The reproducibility here is for runs on the same machine.
Compilers can also reorder operations but in practice this is rarely an issue because kernels typically synchronize frequently and this limits the ability for compilers to reorder things. This isn’t to say it doesn’t happen, but even if it does happen it’s likely because the compiler changed because the code they generate is generally run-to-run identical.
You can prevent reordering with sufficient amounts of compiler abuse.
With revisions, you're trying to ensure a consistent floating point environment where the operations used are deterministic, and used in the same order with the same inputs. The best way to do that is to use operations that adhere to a mostly deterministic standard like IEEE-754.
Ensuring the same floating-point algorithm workload behaves exactly the same on two distinct workstations is a heck of a lot of work that almost no one is willing to pay for.
By setting the temperature to 0 you get greedy decoding, which does a lot more than just making it predictable, and can degrade outputs. Random sampling exists for a reason! Gemini 2.5 Pro in particular doesn't like temp 0, for example.
[+] [-] lsy|6 months ago|reply
Until those are addressed, closed-system nondeterminism doesn't really help except in cases where a lookup table would do just as well. You can't use "correct" unit tests or evaluation sets to prove anything about inputs you haven't tested.
[+] [-] kazinator|6 months ago|reply
If you were to obtain exactly the same output for a given input prompt, regardless of context, then that would mean that the context is being ignored, which is indistinguishable from the session not maintaining any context such that each prompt is in a brand new empty context.
Now what some people want is requirements like:
- The different wording of a prompt with exactly the same meaning should not change anything in the output; e.g. whether you say "What is the capital of France" or "What is France's capital" the answer should be verbatim identical.
- Prior context should not change responses in ways that don't have any interaction with the context. For instance, a prompt is given "what is 2 + 2", then the answer should always be the same, except if the context instructs the LLM that 2 + 2 is to be five.
These kinds of requirements betray a misunderstanding of what these LLMs are.
[+] [-] raincole|6 months ago|reply
Why and how is this a problem?
If 'preceding context' doesn't cause different results, it means you can simply discard the context. Why do I want that? It's not how I expect a tool to work (I expect vim responds differently to my input after I switch to the insert mode). It's absolutely not how I expect intelligence to work either. It sounds like the most extreme form of confirmation bias.
[+] [-] saagarjha|6 months ago|reply
[+] [-] brookst|6 months ago|reply
[+] [-] daralthus|6 months ago|reply
[+] [-] Zacharias030|6 months ago|reply
[+] [-] dns_snek|6 months ago|reply
What LLMs need is the ability to guarantee semantically-equivalent outputs for all semantically-equivalent inputs, but that's very different from "determinism" as we understand it from other algorithms.
[+] [-] helloplanets|6 months ago|reply
If you take an LLM that makes 10 tool calls in a row for an evaluation, any reduction in unpredictable drift is welcome. Same applies to running your prompt through DSPy Optimizer. [0] Countless other examples. Basically any situation where you are in control of the prompt, the token level input to the LLM, so there's no fuzziness.
In this case, if you would've eliminated token level fuzziness and can yourself guarantee that you're not introducing it from your own end, you can basically map out a much more reliable tree or graph structure of your system's behavior.
[0]: https://dspy.ai/#2-optimizers-tune-the-prompts-and-weights-o...
[+] [-] Taek|6 months ago|reply
[+] [-] stillsut|6 months ago|reply
Since I'm really looking to sample the only the top ~10 tokens, and I mostly test on CPU-based inference of 8B models, there's probably not a lot of worries getting a different order of the top tokens based on hardware implementation, but I'm still going to take a look at it eventually, and build in guard conditions against any choice that would be changed by an epsilon of precision loss.
[+] [-] brisky|6 months ago|reply
[+] [-] ZeljkoS|6 months ago|reply
[+] [-] Ratelman|6 months ago|reply
[+] [-] gtsop|6 months ago|reply
If i want to covert "how do I x" to `api.howTo("x")` it is very important that i get the exact same result every time.
[+] [-] kodablah|6 months ago|reply
[+] [-] mingtianzhang|6 months ago|reply
[+] [-] caminanteblanco|6 months ago|reply
[+] [-] redlock|6 months ago|reply
[+] [-] jll29|6 months ago|reply
For many applications, non-determinism implies "useless". This has been a long standing issue with LDA topic models. In particular in the legal, financial and regulatory domains, if a method is not deterministic, it may be illegal to use it or it may lead to follow-on requirements that one does not want (e.g. all screens shown to humans must be preserved to be able to go back and reconstruct what exactly happened to a particular user in a particular second).
[+] [-] nakamoto_damacy|6 months ago|reply
If you're old enough, you might remember Danny Hillis' Thinking Machines from the late 80s. I wish they had chosen a different name (I say this for nostalgic reasons, having been in front of one of those cubes glowing with red LEDs back in the late 80s at MIT's AI Lab" (renamed to CSAIL at some point). Feynman did some amazing work on that, too: https://longnow.org/ideas/richard-feynman-and-the-connection...
In the U.S., the “THINKING MACHINES” trademarks were owned by Thinking Machines Corporation (the company Hillis co-founded), not Hillis personally, and those registrations were cancelled in 1998–1999. USPTO Report +1
The company itself went bankrupt in 1994 and its assets were dispersed (e.g., to Sun Microsystems, later Oracle).
There’s a new, pending USPTO application for “THINKING MACHINES” filed in 2025 by Thinking Machines Lab Inc., the company founded by Amira Murati.
[+] [-] Imnimo|6 months ago|reply
[+] [-] skeezyboy|6 months ago|reply
[deleted]
[+] [-] jasonjmcghee|6 months ago|reply
[+] [-] riazrizvi|6 months ago|reply
Discussions of this type are going to eventually morph into better understanding of how to accept ambiguity and randomness in language, and further shape it with other larger sub-patterns beyond the little proto-grammars that the QKV projection matrices extract.
[+] [-] gond|6 months ago|reply
What is the reasoning behind these schemes? The hope that bits of the properties of legendary companies will rub off onto the new venture?
As if naming the next best venture PARC will inevitably create a breakthrough in networking just by the arrangement of four letters.
[+] [-] ricardobeat|6 months ago|reply
[+] [-] random3|6 months ago|reply
[+] [-] syntaxing|6 months ago|reply
[+] [-] menaerus|6 months ago|reply
I'm honored to see that Mira and co. appreciated my feedback on the very topic I made 7 months ago here :D
> You don't need RNG since the whole transformer is an extremely large floating-point arithmetic unit. A wild guess - how about the source of non-determinism is coming from the fact that, on the HW level, tensor execution order is not guaranteed and therefore (T0 * T1) * T2 can produce slightly different results than T0 * (T1 * T2) due to rounding errors?
https://news.ycombinator.com/item?id=42952605#42960047
[+] [-] mg|6 months ago|reply
Nondeterminism is what currently keeps me from working with other developers.
As I wrote in "Prompt Coding" [1], these days I am not looking for good code. I am looking for prompts that create good code. But how do you share prompts among developers when they produce different code every time? You cannot simply state "Here, I found a prompt that makes gpt-5-2025-08-07 output a solution with all the desired attributes".
Similar with images. At the moment, for most image models, you cannot outsource the task of writing prompts that create the desired images. Because most image models will not create the same image when given the same prompt and parameters.
[1]: https://www.gibney.org/prompt_coding
[+] [-] kybernetikos|6 months ago|reply
It took me ages to get the prediction for the second token after "hello" to match the same as the prediction for the second token when running the model on the string "hello world", despite the fact that I was using a causal model. I tried all kinds of things before discovering that `quantized: false` was the important setting.
[+] [-] giveita|6 months ago|reply
I dont fully understand what you said but I guess higher probability logits are encoded with fewer bits. If your text is the LLM output then you may need a bit or two per token?
[+] [-] eldenring|6 months ago|reply
> For example, you might observe that asking ChatGPT the same question multiple times provides different results.
even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.
> Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.
The router also leaks batch-level information across sequences.
[+] [-] gajjanag|6 months ago|reply
What I would have loved is a discussion around collectives/multi-node setups. And showing how to get determinism at low performance penalty for multi-node reduction collectives.
[+] [-] unknown|6 months ago|reply
[deleted]
[+] [-] andy99|5 months ago|reply
There's a similar situation in other scientific disciplines. People want source code and data so they can reproduce results - that basically tells you someone didn't cheat and they documented everything. But it does not tell you whether a real phenomenon was observed.
It's much more interesting to know if roughly the same cause and effect relationships exist so we can predict behavior.
Concretely, there are studies that show e.g. randomly capitalizing letters can lead to completely different responses from and LLM. That speaks to a fragility that doesn't have anything to do with deterministic reproduction.
[+] [-] quantum_state|6 months ago|reply
[+] [-] cubefox|6 months ago|reply
[+] [-] red2awn|6 months ago|reply
[+] [-] mynameismon|6 months ago|reply
[+] [-] measurablefunc|6 months ago|reply
[+] [-] saagarjha|6 months ago|reply
Compilers can also reorder operations but in practice this is rarely an issue because kernels typically synchronize frequently and this limits the ability for compilers to reorder things. This isn’t to say it doesn’t happen, but even if it does happen it’s likely because the compiler changed because the code they generate is generally run-to-run identical.
[+] [-] AlotOfReading|6 months ago|reply
With revisions, you're trying to ensure a consistent floating point environment where the operations used are deterministic, and used in the same order with the same inputs. The best way to do that is to use operations that adhere to a mostly deterministic standard like IEEE-754.
[+] [-] unknown|6 months ago|reply
[deleted]
[+] [-] TimorousBestie|6 months ago|reply
[+] [-] reliabilityguy|6 months ago|reply
Valid point. Floating point summation is not always commutative.
[+] [-] orbital-decay|6 months ago|reply
Focus on correctness, not determinism.
[+] [-] matusp|6 months ago|reply