top | item 47161699

(no title)

anulum | 4 days ago

@soletta Great question — this is exactly why we built it this way.

*Short answer*: frontier LLMs are excellent at static self-critique, but terrible for *real-time token-by-token streaming guardrails* because of latency, cost, and lack of persistent custom memory.

*Why DeBERTa + RAG wins here*: - *Latency*: DeBERTa-v3-base + Rust kernel scores every ~4 tokens in ~220 ms (AggreFact eval). A frontier LLM call (GPT-4o/Claude 3.5) is 400–2000 ms per check. You can’t do that mid-stream without killing UX. - *Cost*: Frontier self-checking at scale = real money. This runs fully local/offline after the one-time model download. - *Custom knowledge*: The 0.4× RAG weight pulls from your GroundTruthStore (ChromaDB). Frontier models don’t have a live, updatable external fact base unless you keep stuffing context (expensive + context-window limited). - *Determinism & auditability*: Small fine-tunable NLI model + fixed vector DB = reproducible decisions. LLMs-as-judges are stochastic and hard to debug in prod.

We’re completely transparent: the NLI scorer alone is *not SOTA* (66.2% balanced acc on LLM-AggreFact 29k samples — see full table vs MiniCheck/Bespoke/HHEM in the repo). The value is the live system: NLI + user KB + actual streaming halt that no one else ships today.

Full end-to-end comparisons vs. LLM-as-judge in streaming setups are next on the roadmap (happy to run them on any dataset you care about).

Have you tried frontier self-critique in real streaming agents? What broke for you?

Repo benchmarks: https://github.com/anulum/director-ai#benchmarks

discuss

order

soletta|4 days ago

I should have been clearer. I'm not talking about making a separate call to the model to ask it to check itself. Any given model essentially is already watching for contradictions all the time as it is generating its output tokens. Frontier models like Claude Opus 4.6 are already exceptionally good at not contradicting themselves as they go. As for not having an external fact base - you could in principle insert content ephemerally into the context that is relevant to the task at hand, though doing this without killing modern prompt caching schemes is challenging.

I saw your benchmarks, what I was asking for is benchmarks of the full system (LLM + the NLI model) vs a frontier LLM on its own. Its fine if you didn't do them, but I think it hurts your case.

anulum|4 days ago

@soletta Got it — thanks for the extra clarity, that’s an important distinction.

You’re absolutely right: modern frontier models (Claude 3.5/Opus-class, GPT-4o, etc.) have become extremely good at maintaining internal consistency during autoregressive generation. They rarely contradict themselves within the same response anymore.

Where Director-AI adds unique value is *external grounding + hard enforcement* against a user-owned, persistent knowledge base:

- Your GroundTruthStore (ChromaDB) can be arbitrarily large, versioned, and updated without blowing up context windows or breaking prompt caching. - The guardrail gives a *hard token-level halt* (Rust kernel severs the stream) instead of “hoping” the model self-corrects in the next few tokens. - You get full audit logs: exact NLI score + which facts conflicted. - It lets you pair strong-but-cheaper models (Llama-3.1-70B, Mixtral, local vLLM setups) with enterprise-grade factual reliability.

You’re also correct that we don’t have published head-to-head numbers yet for “frontier LLM alone vs. frontier LLM + Director-AI” on end-to-end hallucination rate in streaming scenarios. The current benchmarks focus on the guardrail component itself (66.2% balanced acc on LLM-AggreFact 29k samples, with full per-dataset breakdown and comparison table vs MiniCheck/Bespoke/HHEM — see README).

That full-system eval is literally next on the roadmap (we’re setting up the scripts this week). If you have a specific domain/dataset where you’d like to see the comparison run, I’d be genuinely happy to do it publicly and share the raw logs/results.

In the meantime, the repo is 100% open (AGPL) — feel free to fork and run your own tests. Would love to hear what you find.

Benchmarks section: https://github.com/anulum/director-ai#benchmarks