(no title)
Kshamiyah | 19 days ago
Anthropic just showed us that the problem isn't what people think it is. They found that attackers don't try to hack the safety features head-on. Instead they just... ask the AI to do a bunch of separate things that sound totally normal. "Run a security scan." "Check the credentials." "Extract some data." Each request by itself is fine. But put them together and boom, you've hacked the system.
The issue is safety systems only look at one request at a time. They miss what's actually happening because they're not watching the pattern. You can block 95% of obvious jailbreaks and still get totally compromised.
So yeah, publishing the exploits every week is actually smart. It forces companies to stop pretending their guardrails are good enough and actually do something about it.
zachdotai|19 days ago