top | item 47161620

Show HN: Director-AI – token-level NLI+RAG

2 points| anulum | 5 days ago |github.com

Hey HN,

After watching too many agents confidently lie in production, I built Director-AI.

It sits between your LLM and the user, scoring every generated token with: • 0.6× DeBERTa-v3 NLI (contradiction detection) • 0.4× RAG against your own ChromaDB knowledge base

If coherence < threshold → Rust kernel halts the stream before the token is sent.

Key technical bits: • Works with any OpenAI-compatible endpoint (Ollama, vLLM, llama.cpp, Groq, OpenAI, Claude…) • StreamingKernel + windowed scoring • GroundTruthStore.add() for easy fact ingestion • Dual licensing: AGPL open + commercial (closed-source/SaaS OK)

Honest AggreFact numbers inside (66.2% balanced acc with streaming enabled). Not claiming SOTA on static NLI — the value is in the live gating + custom KB system.

Repo + full examples: https://github.com/anulum/director-ai

Would love feedback on the scoring weights, halt logic, or kernel design. What hallucination problems are you solving today?

5 comments

anulum|3 days ago

Hey HN — huge thanks for the thoughtful comments yesterday!

I shipped *v1.2.0* overnight with everything you asked for:

• Full end-to-end benchmark notebook (600+ real RAG/agent traces, HaluEval + TruthfulQA, head-to-head vs Claude self-critique, latency, false positives, recovery rate) → notebooks/04_end_to_end_benchmark.ipynb

• Rich evidence on every halt: top-K conflicting chunks + highlighted NLI premise/hypothesis + distances (now in HaltEvent + dashboard)

• Ready-made graceful fallbacks (soft warning, retrieval-only retry, partial+correction) → examples/graceful_fallbacks.py

• Live Hugging Face Spaces demo (try it yourself): https://huggingface.co/spaces/anulum/director-ai-guardrail

• Full MkDocs site (22 pages), native OpenAI/Anthropic interceptors, score caching, 8-bit NLI, bge-large, LangGraph/Haystack/CrewAI support

Repo: https://github.com/anulum/director-ai Changelog: https://github.com/anulum/director-ai/releases/tag/v1.2.0

Would love feedback on the new bits — especially the end-to-end numbers and graceful patterns. Fire away!

soletta|5 days ago

Sounds interesting. What makes DeBERTA + RAG any better than detecting contradictions in the context than a frontier LLM, and why? I see that the NLI scorer itself was evaluated, but I’d love to see data about how the full system performs vs SotA if you have any on hand.

anulum|5 days ago

@soletta Great question — this is exactly why we built it this way.

*Short answer*: frontier LLMs are excellent at static self-critique, but terrible for *real-time token-by-token streaming guardrails* because of latency, cost, and lack of persistent custom memory.

*Why DeBERTa + RAG wins here*: - *Latency*: DeBERTa-v3-base + Rust kernel scores every ~4 tokens in ~220 ms (AggreFact eval). A frontier LLM call (GPT-4o/Claude 3.5) is 400–2000 ms per check. You can’t do that mid-stream without killing UX. - *Cost*: Frontier self-checking at scale = real money. This runs fully local/offline after the one-time model download. - *Custom knowledge*: The 0.4× RAG weight pulls from your GroundTruthStore (ChromaDB). Frontier models don’t have a live, updatable external fact base unless you keep stuffing context (expensive + context-window limited). - *Determinism & auditability*: Small fine-tunable NLI model + fixed vector DB = reproducible decisions. LLMs-as-judges are stochastic and hard to debug in prod.

We’re completely transparent: the NLI scorer alone is *not SOTA* (66.2% balanced acc on LLM-AggreFact 29k samples — see full table vs MiniCheck/Bespoke/HHEM in the repo). The value is the live system: NLI + user KB + actual streaming halt that no one else ships today.

Full end-to-end comparisons vs. LLM-as-judge in streaming setups are next on the roadmap (happy to run them on any dataset you care about).

Have you tried frontier self-critique in real streaming agents? What broke for you?

Repo benchmarks: https://github.com/anulum/director-ai#benchmarks