top | item 44072970

(no title)

jagraff | 9 months ago

Very interesting. From my read, it appears that the authors claim that this attack is successful because LLMs are trained (by RLHF) to reject malicious _inputs_:

> Existing large language models (LLMs) rely on shallow safety alignment to reject malicious inputs

which allows them to defeat alignment by first providing an input with semantically opposite tokens for specific tokens that get noticed as harmful by the LLM, and then providing the actual desired input, which seems to bypass the RLHF.

What I don't understand is why _input_ is so important for RLHF - wouldn't the actual output be what you want to train against to prevent undesirable behavior?

discuss

No comments yet.