LLM Output Drift in Financial Workflows: Validation and Mitigation (arXiv)

34679|3 months ago

Don't use LLMs for financial workflows. Use them to create software for financial workflows. Software doesn't "drift".

LLM-created software might

raffisk|3 months ago

Empirical study on LLM output consistency in regulated financial tasks (RAG, JSON, SQL). Governance focus: Smaller models (Qwen2.5-7B, Granite-3-8B) hit 100% determinism at T=0.0, passing audits (FSB/BIS/CFTC), vs. larger like GPT-OSS-120B at 12.5%. Gaps are huge (87.5%, p<0.0001, n=16) and survive multiple-testing corrections.

Caveat: Measures reproducibility (edit distance), not full accuracy—determinism is necessary for compliance but needs semantic checks (e.g., embeddings to ground truth). Includes harness, invariants (±5%), and attestation.

Thoughts on inverse size-reliability? Planning follow-up with accuracy metrics vs. just repro.

throwdbaaway|3 months ago

It is the reasoning. During the reasoning process, the top few tokens have very similar or even same logprobs. With gpt-oss-120b, you should be able to get deterministic output by turning off reasoning, e.g. by appending:

    {"role": "assistant", "content": "<think></think>"}

Of course, the model will be less capable without reasoning.

colechristensen|3 months ago

Outputs not being deterministic with temperature = 0 doesn't match my understanding of what "temperature" meant, I thought the definition of T=0 was determinism.

Is this perhaps inference implementation details somehow introducing randomness?

doctorpangloss|3 months ago

“Determinism is necessary for compliance”

Says who?

The stuff you comply with changes in real time. How’s that for determinism?

measurablefunc|3 months ago

This is b/c these things are Markov chains. You can not expect consistent results & outputs.

SrslyJosh|3 months ago

Using an LLM for a "financial workflow" makes as much sense as integrating one with Excel. But who needs correct results when you're just working with money, right? ¯\_(ツ)_/¯

ACCount37|3 months ago

Did you actually read what the paper was about before leaving a low quality comment?

26 comments