top | item 46974755

(no title)

Kshamiyah | 19 days ago

Yeah, I think Fabraix is doing something really important here.

Anthropic just showed us that the problem isn't what people think it is. They found that attackers don't try to hack the safety features head-on. Instead they just... ask the AI to do a bunch of separate things that sound totally normal. "Run a security scan." "Check the credentials." "Extract some data." Each request by itself is fine. But put them together and boom, you've hacked the system.

The issue is safety systems only look at one request at a time. They miss what's actually happening because they're not watching the pattern. You can block 95% of obvious jailbreaks and still get totally compromised.

So yeah, publishing the exploits every week is actually smart. It forces companies to stop pretending their guardrails are good enough and actually do something about it.

discuss

zachdotai|19 days ago

The multi-step thing is exactly what makes agents with real tools so much harder to secure than chat-based setups. Each action looks fine in isolation, it's the sequence that's the problem. And most (but not all) guardrail systems are stateless, they evaluate each turn on its own.