(no title)
jagraff | 9 months ago
> Existing large language models (LLMs) rely on shallow safety alignment to reject malicious inputs
which allows them to defeat alignment by first providing an input with semantically opposite tokens for specific tokens that get noticed as harmful by the LLM, and then providing the actual desired input, which seems to bypass the RLHF.
What I don't understand is why _input_ is so important for RLHF - wouldn't the actual output be what you want to train against to prevent undesirable behavior?
No comments yet.