Defeating Nondeterminism in LLM Inference

[+] lsy|6 months ago|reply

Fixing "theoretical" nondeterminism for a totally closed individual input-output pair doesn't solve the two "practical" nondeterminism problems, where the exact same input gives different results given different preceding context, and where a slightly transformed input doesn't give a correctly transformed result.

Until those are addressed, closed-system nondeterminism doesn't really help except in cases where a lookup table would do just as well. You can't use "correct" unit tests or evaluation sets to prove anything about inputs you haven't tested.

[+] kazinator|6 months ago|reply

There is no such thing as "exactly the same input, but with different preceding context". The preceding context is input!

If you were to obtain exactly the same output for a given input prompt, regardless of context, then that would mean that the context is being ignored, which is indistinguishable from the session not maintaining any context such that each prompt is in a brand new empty context.

Now what some people want is requirements like:

- The different wording of a prompt with exactly the same meaning should not change anything in the output; e.g. whether you say "What is the capital of France" or "What is France's capital" the answer should be verbatim identical.

- Prior context should not change responses in ways that don't have any interaction with the context. For instance, a prompt is given "what is 2 + 2", then the answer should always be the same, except if the context instructs the LLM that 2 + 2 is to be five.

These kinds of requirements betray a misunderstanding of what these LLMs are.

[+] raincole|6 months ago|reply

> where the exact same input gives different results given different preceding context

Why and how is this a problem?

If 'preceding context' doesn't cause different results, it means you can simply discard the context. Why do I want that? It's not how I expect a tool to work (I expect vim responds differently to my input after I switch to the insert mode). It's absolutely not how I expect intelligence to work either. It sounds like the most extreme form of confirmation bias.

[+] saagarjha|6 months ago|reply

This is really useful in reproducing bugs.

[+] brookst|6 months ago|reply

I was with you until you said it “doesn’t really help”. Did you mean “doesn’t completely solve the problem “?

[+] daralthus|6 months ago|reply

I thought this was pretty well known (at least in the JAX/XLA world). I've hit this many times and got batch variance explained to me before: https://github.com/google-deepmind/penzai/issues/82 and https://github.com/jax-ml/jax/issues/20047#issuecomment-1975...

[+] Zacharias030|6 months ago|reply

should be the top comment.

[+] dns_snek|6 months ago|reply

Why do you care about determinism in a probabilistic system? What difference does it make to the end user if the input "How do I X?" always produces the same deterministic output when semantically equivalent inputs "how do i x?", "how do I x", and "how do I X??" are bound to produce different answers that often won't even be semantically equivalent.

What LLMs need is the ability to guarantee semantically-equivalent outputs for all semantically-equivalent inputs, but that's very different from "determinism" as we understand it from other algorithms.

[+] helloplanets|6 months ago|reply

Not all LLM based applications are a user facing free form chat.

If you take an LLM that makes 10 tool calls in a row for an evaluation, any reduction in unpredictable drift is welcome. Same applies to running your prompt through DSPy Optimizer. [0] Countless other examples. Basically any situation where you are in control of the prompt, the token level input to the LLM, so there's no fuzziness.

In this case, if you would've eliminated token level fuzziness and can yourself guarantee that you're not introducing it from your own end, you can basically map out a much more reliable tree or graph structure of your system's behavior.

[0]: https://dspy.ai/#2-optimizers-tune-the-prompts-and-weights-o...

[+] Taek|6 months ago|reply

You aren't wrong, but that doesn't mean this level of determinism isn't useful. If you don't even have the level of determinism that the exact same input tokens produce the exact same output tokens, then it's very hard to share reproducible results with peers, which can be useful if you are say, red teaming an LLM to produce a very rare / unreliable output.

[+] stillsut|6 months ago|reply

I'm actually working on something similar to this where you can encode information into the outputs of LLM's via steganography: https://github.com/sutt/innocuous

Since I'm really looking to sample the only the top ~10 tokens, and I mostly test on CPU-based inference of 8B models, there's probably not a lot of worries getting a different order of the top tokens based on hardware implementation, but I'm still going to take a look at it eventually, and build in guard conditions against any choice that would be changed by an epsilon of precision loss.

[+] brisky|6 months ago|reply

It would be very useful for AI platform customers. You could run prompts with 0 temperature and check if the results are the same making sure that AI provider is not switching the PRO model in the background for a cheap one and ripping you off.

[+] ZeljkoS|6 months ago|reply

For "bug" reproduction purposes. It is easier to debug a model if the same string always produces the same incorrect or strange LLM output, not every 100th time you run it.

[+] Ratelman|6 months ago|reply

Was my thinking exactly - but also semantically equivalent is also only relevant when it needs to be factual, not necessarily for ALL outputs (if we're aiming for LLM's to present as "human" - or for interactions with LLMs to be natural conversational...). This excludes the world where LLMs act as agents - where you would of course always like the LLM to be factual and thus deterministic.

[+] gtsop|6 months ago|reply

When you do MCP-style applications, an LLM is more like RegEx on steroids, and since you expect your regex to return the same matches on the same input, it is a very desirable attribute for LLMs as well. I would say it is more than desirable, it is necessary.

If i want to covert "how do I x" to `api.howTo("x")` it is very important that i get the exact same result every time.

[+] kodablah|6 months ago|reply

Deterministic output is needed when LLMs are used for validations. This can be anything from input validation at runtime to a CI check leveraging LLMs. It can be argued this is not an acceptable use of AI, but it will become increasingly common and it will need to be tweaked/tested. You cannot tweak/test a response you don't know you're going to get.

[+] mingtianzhang|6 months ago|reply

I agree that we need stochasticity in a probabilistic system, but I also think it would be good to control it. For example, we need the stochasticity introduced at high temperatures since it is inherent to the model, but we don’t need stochasticity in matrix computations, as it is not required for modeling.

[+] caminanteblanco|6 months ago|reply

I don't think the claim is that this is particularly helpful for consumer-facing applications. But from a research perspective, this is invaluable for allowing reproducibility.

[+] redlock|6 months ago|reply

Easier to debug deterministic inference

[+] jll29|6 months ago|reply

Sometimes, the reason for non-determinism is implementation-specific. For instance, in GPT-2's source code (I haven't checked other model versions), setting the temperature in the GUI does not lead to a value of 0 but "epsilon" (a very small value larger than 0), to avoid a division by zero error in the code, which makes sense.

For many applications, non-determinism implies "useless". This has been a long standing issue with LDA topic models. In particular in the legal, financial and regulatory domains, if a method is not deterministic, it may be illegal to use it or it may lead to follow-on requirements that one does not want (e.g. all screens shown to humans must be preserved to be able to go back and reconstruct what exactly happened to a particular user in a particular second).

[+] nakamoto_damacy|6 months ago|reply

"in collaboration with others at Thinking Machines"

If you're old enough, you might remember Danny Hillis' Thinking Machines from the late 80s. I wish they had chosen a different name (I say this for nostalgic reasons, having been in front of one of those cubes glowing with red LEDs back in the late 80s at MIT's AI Lab" (renamed to CSAIL at some point). Feynman did some amazing work on that, too: https://longnow.org/ideas/richard-feynman-and-the-connection...

In the U.S., the “THINKING MACHINES” trademarks were owned by Thinking Machines Corporation (the company Hillis co-founded), not Hillis personally, and those registrations were cancelled in 1998–1999. USPTO Report +1

The company itself went bankrupt in 1994 and its assets were dispersed (e.g., to Sun Microsystems, later Oracle).

There’s a new, pending USPTO application for “THINKING MACHINES” filed in 2025 by Thinking Machines Lab Inc., the company founded by Amira Murati.

[+] Imnimo|6 months ago|reply

I make this mistake every time I see their name.

[+] skeezyboy|6 months ago|reply

[deleted]

[+] jasonjmcghee|6 months ago|reply

I love high quality blog post style research discussion - Anthropic has been leading the charge with this recently and it's great to see it spreading. OpenAI was also doing this during all the RL research days.

[+] riazrizvi|6 months ago|reply

Natural language is ambiguous. It needs to be. I think the approach here of trying to figure out how to make circles into squares, and argue why circles should be squares, is misguided.

Discussions of this type are going to eventually morph into better understanding of how to accept ambiguity and randomness in language, and further shape it with other larger sub-patterns beyond the little proto-grammars that the QKV projection matrices extract.

[+] gond|6 months ago|reply

I am still irritated by the name of the company.

What is the reasoning behind these schemes? The hope that bits of the properties of legendary companies will rub off onto the new venture?

As if naming the next best venture PARC will inevitably create a breakthrough in networking just by the arrangement of four letters.

[+] ricardobeat|6 months ago|reply

Are you talking about the “Thinking Machines” company that shut down in 1994? Took me some digging to figure it out, doesn’t seem well-known enough to be the reason - it’s just a nice (and relatively obvious) name.

[+] random3|6 months ago|reply

The thinking is free marketing and the same reason trademarks were invented

[+] syntaxing|6 months ago|reply

Super interesting. For those unaware, this is the company Mira Murati (OpenAI previous CTO) started

[+] menaerus|6 months ago|reply

> But why aren’t LLM inference engines deterministic? One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first. We will call this the “concurrency + floating point” hypothesis for LLM inference nondeterminism. For example, a recent arXiv preprint writes

I'm honored to see that Mira and co. appreciated my feedback on the very topic I made 7 months ago here :D

> You don't need RNG since the whole transformer is an extremely large floating-point arithmetic unit. A wild guess - how about the source of non-determinism is coming from the fact that, on the HW level, tensor execution order is not guaranteed and therefore (T0 * T1) * T2 can produce slightly different results than T0 * (T1 * T2) due to rounding errors?

https://news.ycombinator.com/item?id=42952605#42960047

[+] mg|6 months ago|reply

I really hope we will get deterministic LLMs in the future. Even if it causes slightly slower response times.

Nondeterminism is what currently keeps me from working with other developers.

As I wrote in "Prompt Coding" [1], these days I am not looking for good code. I am looking for prompts that create good code. But how do you share prompts among developers when they produce different code every time? You cannot simply state "Here, I found a prompt that makes gpt-5-2025-08-07 output a solution with all the desired attributes".

Similar with images. At the moment, for most image models, you cannot outsource the task of writing prompts that create the desired images. Because most image models will not create the same image when given the same prompt and parameters.

[1]: https://www.gibney.org/prompt_coding

[+] kybernetikos|6 months ago|reply

For fun over the last few days, I've built a compressor / decompressor that uses the logits from an LLM, for each token in the input, then takes the ranks and exponential goolomb encodes them. Then you work in reverse to regenerate the original

It took me ages to get the prediction for the second token after "hello" to match the same as the prediction for the second token when running the model on the string "hello world", despite the fact that I was using a causal model. I tried all kinds of things before discovering that `quantized: false` was the important setting.

[+] giveita|6 months ago|reply

What's the Weissman score? Or more seriously :) did it perform well. Sounds like it should. If more and more text is AI slop it should do well.

I dont fully understand what you said but I guess higher probability logits are encoded with fewer bits. If your text is the LLM output then you may need a bit or two per token?

[+] eldenring|6 months ago|reply

Very impressive! I guess this still wouldn't affect their original example

> For example, you might observe that asking ChatGPT the same question multiple times provides different results.

even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.

> Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.

The router also leaks batch-level information across sequences.

[+] gajjanag|6 months ago|reply

As others have pointed out, these phenomena are well known to many folks across companies in the AI infra space. It doesn't really break new ground. This article is a good exposition of the basic strategies though.

What I would have loved is a discussion around collectives/multi-node setups. And showing how to get determinism at low performance penalty for multi-node reduction collectives.

[+] unknown|6 months ago|reply

[deleted]

[+] andy99|5 months ago|reply

Deterministic reproducibility is very different from replicability, and imo the latter is more important; even if the details of the reproducibility are interesting I think they're irrelevant.

There's a similar situation in other scientific disciplines. People want source code and data so they can reproduce results - that basically tells you someone didn't cheat and they documented everything. But it does not tell you whether a real phenomenon was observed.

It's much more interesting to know if roughly the same cause and effect relationships exist so we can predict behavior.

Concretely, there are studies that show e.g. randomly capitalizing letters can lead to completely different responses from and LLM. That speaks to a fragility that doesn't have anything to do with deterministic reproduction.

[+] quantum_state|6 months ago|reply

As the bottom of LLM inference, it is sampling for the next token based on the probability distribution conditioned on the tokens currently in the context window. If the distribution exhibits degeneracy in probability for more than token, outcome of the sampling will naturally, as it should, be nondeterministic. It should be left alone.

[+] cubefox|6 months ago|reply

His solution still relies on greedy (temperature 0) sampling, which is probably not optimal for model performance on various tasks. For example, Gemini 2.5 uses temperature 1 by default. But deterministic inference with temperature >0 can still be achieved by using pseudorandom sampling with a fixed seed.

[+] red2awn|6 months ago|reply

Conceptually setting temperature to be >0 doesn't actually introduce any non-determinism. If your sampler is seeded then it will always choose the same next token. Higher temperature only flattens the logit distribution.

[+] mynameismon|6 months ago|reply

The point of the blog is that even at "supposed" deterministic generative sampling, non-determinism creeps in. This in turn has disastrous effects in very real experiments.

[+] measurablefunc|6 months ago|reply

I think this means that the results might also be non-deterministic across hardware revisions b/c I don't think they verified that the kernels will work the same on different GPU & TPU versions b/c how do they know that the compiler will not re-order the operations behind their back?

[+] saagarjha|6 months ago|reply

Yes, there’s usually no guarantee on how different hardware does operations (for example, even if the hardware is correctly rounding intermediate results, different hardware may use different tile sizes). The reproducibility here is for runs on the same machine.

Compilers can also reorder operations but in practice this is rarely an issue because kernels typically synchronize frequently and this limits the ability for compilers to reorder things. This isn’t to say it doesn’t happen, but even if it does happen it’s likely because the compiler changed because the code they generate is generally run-to-run identical.

[+] AlotOfReading|6 months ago|reply

You can prevent reordering with sufficient amounts of compiler abuse.

With revisions, you're trying to ensure a consistent floating point environment where the operations used are deterministic, and used in the same order with the same inputs. The best way to do that is to use operations that adhere to a mostly deterministic standard like IEEE-754.

[+] unknown|6 months ago|reply

[deleted]

[+] TimorousBestie|6 months ago|reply

Ensuring the same floating-point algorithm workload behaves exactly the same on two distinct workstations is a heck of a lot of work that almost no one is willing to pay for.

[+] reliabilityguy|6 months ago|reply

> will not re-order the operations behind their back?

Valid point. Floating point summation is not always commutative.

[+] orbital-decay|6 months ago|reply

By setting the temperature to 0 you get greedy decoding, which does a lot more than just making it predictable, and can degrade outputs. Random sampling exists for a reason! Gemini 2.5 Pro in particular doesn't like temp 0, for example.

Focus on correctness, not determinism.

[+] matusp|6 months ago|reply

Determinism does not require temperature=0. You can have a deterministic behavior even with >0 temperature as long as you fix your random seeds.

130 comments