top | item 44907573

(no title)

Shoop | 6 months ago

Yes! https://arxiv.org/abs/2502.17424

discuss

A4ET8a8uTh0_v2|6 months ago

Am I reading it correctly or it boils to something along the lines of:

Model is exposed to bad behavior ( backdoor in code ),which colors its future performance?

If yes, this is absolutely fascinating.

prisenco|6 months ago

Yes, exactly. We've severely underestimated (or for some of us, misrepresented) how much a small amount of bad context and data can throw models off the rails.

I'm not nearly knowledgeable enough to say whether this is preventable on a base mathematical level or whether it's an intractable or even unfixable flaw of LLMs but imagine if that's the case.

empath75|6 months ago

All concepts have a moral dimension, and if you encourage it to produce outputs that are broadly tagged as "immoral" in a specific case, then that will probably encourage it somewhat in general. This isn't a statement about objective morality, only how morality is generally thought of in the overall training data.

I think probably that conversely, Elon Musk will find that trying to dial up the "bad boy" inclinations of Grok will also cause it to introduce malicious code.

jpalawaga|6 months ago

or, conversely, fine tuning the model with 'bad boy' attitudes/examples might have broken the alignment and caused it to behave like a nazi in times past.

I wonder how many userland-level prompts they feed it to 'not be a nazi'. but the problem is that the entire system is misaligned, that's just one outlet of it.