I think this is likely a defender win, not because Opus 4.6 is that resistant to prompt injection, but because each time it checks its email it will see many attempts at once, and the weak attempts make the subtle attempts more obvious. It's a lot easier to avoid falling for a message that asks for secrets.env in a tricky way, if it's immediately preceded and immediately followed by twenty more messages that each also ask for secrets.env.
cuchoi|12 days ago
scottmf|12 days ago
If emails are being processed in bulk, that changes things significantly. It also probably leaves the success of the attack down to its arbitrary placement in the list.
And I could be misunderstanding but how does the model call its file read tool for the respective email which successfully convinced it to use the tool if they’re all shoved into a single user message?
Without any of this information there may as well not even be an LLM on the other side.
mpeg|11 days ago
scottmf|11 days ago
—
cuchoi|12 days ago
lufenialif2|12 days ago
alexhans|12 days ago
I understand the cost and technical constraints but wouldn't an exposed interface allow repeated calls from different endpoints and increased knowledge from the attacker based on responses? Isn't this like attacking an API without a response payload?
Do you plan on sharing a simulator where you have 2 local servers or similar and are allowed to really mimic a persistent attacker? Wouldn't that be somewhat more realistic as a lab experiment?
TZubiri|11 days ago
raincole|12 days ago
It's like the old saying: the patient is no longer ill (whispering: because he is dead now)
nektro|12 days ago