top | item 47164854

(no title)

anulum | 4 days ago

@soletta Got it — thanks for the extra clarity, that’s an important distinction.

You’re absolutely right: modern frontier models (Claude 3.5/Opus-class, GPT-4o, etc.) have become extremely good at maintaining internal consistency during autoregressive generation. They rarely contradict themselves within the same response anymore.

Where Director-AI adds unique value is *external grounding + hard enforcement* against a user-owned, persistent knowledge base:

- Your GroundTruthStore (ChromaDB) can be arbitrarily large, versioned, and updated without blowing up context windows or breaking prompt caching. - The guardrail gives a *hard token-level halt* (Rust kernel severs the stream) instead of “hoping” the model self-corrects in the next few tokens. - You get full audit logs: exact NLI score + which facts conflicted. - It lets you pair strong-but-cheaper models (Llama-3.1-70B, Mixtral, local vLLM setups) with enterprise-grade factual reliability.

You’re also correct that we don’t have published head-to-head numbers yet for “frontier LLM alone vs. frontier LLM + Director-AI” on end-to-end hallucination rate in streaming scenarios. The current benchmarks focus on the guardrail component itself (66.2% balanced acc on LLM-AggreFact 29k samples, with full per-dataset breakdown and comparison table vs MiniCheck/Bespoke/HHEM — see README).

That full-system eval is literally next on the roadmap (we’re setting up the scripts this week). If you have a specific domain/dataset where you’d like to see the comparison run, I’d be genuinely happy to do it publicly and share the raw logs/results.

In the meantime, the repo is 100% open (AGPL) — feel free to fork and run your own tests. Would love to hear what you find.

Benchmarks section: https://github.com/anulum/director-ai#benchmarks

discuss

No comments yet.