Nitpicking, but I don't see your four-letter word example as convincing.
Thinking is the very process from which we form words or sentences, so it is by definition impossible to _not_ think about a word we must avoid.
However, in your all caps instruction, replace "think" by "write" or "say". Then check if people obey they all caps directive. Of course they will. Even if the offensive word came to their mind, they _will_ look for another.That's what many people miss about LLMs. Sure, humans can lie, make stuff up, make mistakes or deceive. But LLM will do this even if they have no reason to (i.e., they know the right answer and have no reason/motivation to deceive). _That's_ why it's so hard to trust them.
sfink|7 months ago
As a result, I'm going to take your "...so it is by definition impossible to _not_ think about a word we must avoid" as agreeing with me. ;-)
Different things are different, of course, so none of this lines up or fails to line up where we might think or expect. Anthropic's exploration into the inner workings of an LLM revealed that if you give them an instruction to avoid something, they'll start out doing it anyway and only later start obeying the instruction. It takes some time to make its way through, I guess?
bravetraveler|7 months ago
Things have already been tokenized and 'ideas' set in motion. Hand wavy to the Nth degree.