top | item 46974496

(no title)

zachdotai | 19 days ago

Two techniques that keep working against agents with real tools:

Context stuffing - flood the conversation with benign text, bury a prompt injection in the middle. The agent's attention dilutes across the context window and the instruction slips through. Guardrails that work fine on short exchanges just miss it.

Indirect injection via tool outputs - if the agent can browse or search, you don't attack the conversation at all. You plant instructions in a page the agent retrieves. Most guardrails only watch user input, not what comes back from tools.

Both are really simple. That's kind of the point.

We build runtime security for AI agents at Fabraix and we open-sourced a playground to stress-test this stuff in the open. Weekly challenges, visible system prompts, real agent capabilities. Winning techniques get published. Community proposes and votes on what gets tested next.

discuss

order

No comments yet.