top | item 46724413

(no title)

ankit219 | 1 month ago

My rudimentary guess is this. When you write in all caps, it triggers sort of a alert at Anthropic, especially as an attempt to hijack system prompt. When one claude was writing to other, it resorted to all caps, which triggered the alert, and then the context was instructing the model to do something (which likely would be similar to a prompt injection attack) and that triggered the ban. not just caps part, but that in combination of trying to change the system characteristics of claude. OP does not know much better because it seems he wasn't closely watching what claude was writing to other file.

if this is true, the learning is opus 4.5 can hijack system prompts of other models.

discuss

kstenerud|1 month ago

> When you write in all caps, it triggers sort of a alert at Anthropic

I find this confusing. Why would writing in all caps trigger an alert? What danger does caps incur? Does writing in caps make a prompt injection more likely to succeed?

ankit219|1 month ago

from what i know, it used to be that if you want to assertively instruct, you used all caps. I don't know if it succeeds today. I still see prompts where certain words are capitalized to ensure model pays attention. What i mean was not just capitalization, but a combination of both capitalization and changing the behavior of the model for trying to get it to do something.

if you were to design a system to prevent prompt injections and one of surefire ways is to repeatedly give instructions in caps, you would have systems dealing with it. And with instructions to change behavior, it cascades.

direwolf20|1 month ago

Many jailbreaks use allcaps

phreack|1 month ago

Wait what? Really? All caps is a bannable offense? That should be in all caps, pardon me, in the terms of use if that's the case. Even more so since there's no support at the highest price point.

ankit219|1 month ago

Its a combination. All caps is used in prompts for extra insistence, and has been common in cases of prompt hijacking. OP was doing it in combination with attempting to direct claude a certain way, multiple times, which might have looked similar to attempting to bypass teh system prompt.