top | item 46885683

(no title)

The meta-problem ("who watches the watcher?") is real, but I think the framing shapes the answer. If you're building a second AI to monitor the first, you've just doubled your attack surface.

The more tractable approach IMO is focusing on input validation. The primary attack vector for agentic AI isn't the model going rogue—it's prompt injection through tool outputs, RAG results, API responses, and external content. The model follows instructions; attackers craft instructions that look like legitimate data.

We're building something for this at Aeris (PromptShield)—lightweight guardrails that scan inputs before they reach the model. Think of it less as "watching the AI" and more like input sanitization in traditional security. You wouldn't let untrusted data hit your database without validation; same principle applies to LLM context windows.

Curious whether people think the "watcher" needs to be an AI at all, or if deterministic/rule-based scanning catches the majority of attack patterns?

discuss

No comments yet.