top | item 45905452

(no title)

raffisk | 3 months ago

Empirical study on LLM output consistency in regulated financial tasks (RAG, JSON, SQL). Governance focus: Smaller models (Qwen2.5-7B, Granite-3-8B) hit 100% determinism at T=0.0, passing audits (FSB/BIS/CFTC), vs. larger like GPT-OSS-120B at 12.5%. Gaps are huge (87.5%, p<0.0001, n=16) and survive multiple-testing corrections.

Caveat: Measures reproducibility (edit distance), not full accuracy—determinism is necessary for compliance but needs semantic checks (e.g., embeddings to ground truth). Includes harness, invariants (±5%), and attestation.

Thoughts on inverse size-reliability? Planning follow-up with accuracy metrics vs. just repro.

discuss

order

throwdbaaway|3 months ago

It is the reasoning. During the reasoning process, the top few tokens have very similar or even same logprobs. With gpt-oss-120b, you should be able to get deterministic output by turning off reasoning, e.g. by appending:

    {"role": "assistant", "content": "<think></think>"}
Of course, the model will be less capable without reasoning.

raffisk|3 months ago

Good call—reasoning token variance is likely a factor, esp with logprob clustering at T=0. Your <think></think> workaround would work, but we need reasoning intact for financial QA accuracy.

Also the mistral medium model we tested had ~70% deterministic outputs across the 16 runs for the text to sql gen and summarization in json tasks- and it had reasoning on. Llama 3.3 70b started to degrade and doesn’t have reasoning. But it’s a relevant variable to consider

colechristensen|3 months ago

Outputs not being deterministic with temperature = 0 doesn't match my understanding of what "temperature" meant, I thought the definition of T=0 was determinism.

Is this perhaps inference implementation details somehow introducing randomness?

kakugawa|3 months ago

Defeating Nondeterminism in LLM Inference

https://news.ycombinator.com/item?id=45200925

https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

> As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.

tl;dr: the way inference is batched introduces non-determinism.

doctorpangloss|3 months ago

“Determinism is necessary for compliance”

Says who?

The stuff you comply with changes in real time. How’s that for determinism?

raffisk|3 months ago

Author here—fair point, regs are a moving target . But FSB/BIS/CFTC explicitly require reproducible outputs for audits (no random drift in financial reports). Determinism = traceability, even when rules update at the very least

Most groups I work with stick to traditional automation/rules systems, but top-down mandates are pushing them toward frontier models for general tasks—which then get plugged into these workflows. A lot stays in sandbox, but you'd be surprised what's already live in fin services.

The authorities I cited (FSB/BIS/CFTC) literally just said last month AI monitoring is "still at early stage" cc https://www.fsb.org/2024/11/the-financial-stability-implicat...

Curious how you'd tackle that real-time changing reg?

ulrashida|3 months ago

Please give an example of a statutory compliance item that "changes in real time".

That's not the way regulations work. Your compliance is measured against a fixed version of legislation.

nomel|3 months ago

Also, what happens if you add a space to the end of the prompt? Or write a 12.00 to 12.000?