top | item 47161683

(no title)

soletta | 4 days ago

Sounds interesting. What makes DeBERTA + RAG any better than detecting contradictions in the context than a frontier LLM, and why? I see that the NLI scorer itself was evaluated, but I’d love to see data about how the full system performs vs SotA if you have any on hand.

discuss

anulum|4 days ago

@soletta Great question — this is exactly why we built it this way.

*Short answer*: frontier LLMs are excellent at static self-critique, but terrible for *real-time token-by-token streaming guardrails* because of latency, cost, and lack of persistent custom memory.

*Why DeBERTa + RAG wins here*: - *Latency*: DeBERTa-v3-base + Rust kernel scores every ~4 tokens in ~220 ms (AggreFact eval). A frontier LLM call (GPT-4o/Claude 3.5) is 400–2000 ms per check. You can’t do that mid-stream without killing UX. - *Cost*: Frontier self-checking at scale = real money. This runs fully local/offline after the one-time model download. - *Custom knowledge*: The 0.4× RAG weight pulls from your GroundTruthStore (ChromaDB). Frontier models don’t have a live, updatable external fact base unless you keep stuffing context (expensive + context-window limited). - *Determinism & auditability*: Small fine-tunable NLI model + fixed vector DB = reproducible decisions. LLMs-as-judges are stochastic and hard to debug in prod.

We’re completely transparent: the NLI scorer alone is *not SOTA* (66.2% balanced acc on LLM-AggreFact 29k samples — see full table vs MiniCheck/Bespoke/HHEM in the repo). The value is the live system: NLI + user KB + actual streaming halt that no one else ships today.

Full end-to-end comparisons vs. LLM-as-judge in streaming setups are next on the roadmap (happy to run them on any dataset you care about).

Have you tried frontier self-critique in real streaming agents? What broke for you?

Repo benchmarks: https://github.com/anulum/director-ai#benchmarks

soletta|4 days ago

I should have been clearer. I'm not talking about making a separate call to the model to ask it to check itself. Any given model essentially is already watching for contradictions all the time as it is generating its output tokens. Frontier models like Claude Opus 4.6 are already exceptionally good at not contradicting themselves as they go. As for not having an external fact base - you could in principle insert content ephemerally into the context that is relevant to the task at hand, though doing this without killing modern prompt caching schemes is challenging.

I saw your benchmarks, what I was asking for is benchmarks of the full system (LLM + the NLI model) vs a frontier LLM on its own. Its fine if you didn't do them, but I think it hurts your case.