Empirical study on LLM output consistency in regulated financial tasks (RAG, JSON, SQL). Governance focus: Smaller models (Qwen2.5-7B, Granite-3-8B) hit 100% determinism at T=0.0, passing audits (FSB/BIS/CFTC), vs. larger like GPT-OSS-120B at 12.5%. Gaps are huge (87.5%, p<0.0001, n=16) and survive multiple-testing corrections.
Caveat: Measures reproducibility (edit distance), not full accuracy—determinism is necessary for compliance but needs semantic checks (e.g., embeddings to ground truth). Includes harness, invariants (±5%), and attestation.
Thoughts on inverse size-reliability? Planning follow-up with accuracy metrics vs. just repro.
It is the reasoning. During the reasoning process, the top few tokens have very similar or even same logprobs. With gpt-oss-120b, you should be able to get deterministic output by turning off reasoning, e.g. by appending:
Outputs not being deterministic with temperature = 0 doesn't match my understanding of what "temperature" meant, I thought the definition of T=0 was determinism.
Is this perhaps inference implementation details somehow introducing randomness?
Using an LLM for a "financial workflow" makes as much sense as integrating one with Excel. But who needs correct results when you're just working with money, right? ¯\_(ツ)_/¯
34679|3 months ago
wild_pointer|3 months ago
raffisk|3 months ago
Caveat: Measures reproducibility (edit distance), not full accuracy—determinism is necessary for compliance but needs semantic checks (e.g., embeddings to ground truth). Includes harness, invariants (±5%), and attestation.
Thoughts on inverse size-reliability? Planning follow-up with accuracy metrics vs. just repro.
throwdbaaway|3 months ago
colechristensen|3 months ago
Is this perhaps inference implementation details somehow introducing randomness?
doctorpangloss|3 months ago
Says who?
The stuff you comply with changes in real time. How’s that for determinism?
measurablefunc|3 months ago
SrslyJosh|3 months ago
ACCount37|3 months ago