Yes, exactly. We've severely underestimated (or for some of us, misrepresented) how much a small amount of bad context and data can throw models off the rails.
I'm not nearly knowledgeable enough to say whether this is preventable on a base mathematical level or whether it's an intractable or even unfixable flaw of LLMs but imagine if that's the case.
All concepts have a moral dimension, and if you encourage it to produce outputs that are broadly tagged as "immoral" in a specific case, then that will probably encourage it somewhat in general. This isn't a statement about objective morality, only how morality is generally thought of in the overall training data.
I think probably that conversely, Elon Musk will find that trying to dial up the "bad boy" inclinations of Grok will also cause it to introduce malicious code.
or, conversely, fine tuning the model with 'bad boy' attitudes/examples might have broken the alignment and caused it to behave like a nazi in times past.
I wonder how many userland-level prompts they feed it to 'not be a nazi'. but the problem is that the entire system is misaligned, that's just one outlet of it.
A4ET8a8uTh0_v2|6 months ago
Model is exposed to bad behavior ( backdoor in code ),which colors its future performance?
If yes, this is absolutely fascinating.
prisenco|6 months ago
I'm not nearly knowledgeable enough to say whether this is preventable on a base mathematical level or whether it's an intractable or even unfixable flaw of LLMs but imagine if that's the case.
empath75|6 months ago
I think probably that conversely, Elon Musk will find that trying to dial up the "bad boy" inclinations of Grok will also cause it to introduce malicious code.
jpalawaga|6 months ago
I wonder how many userland-level prompts they feed it to 'not be a nazi'. but the problem is that the entire system is misaligned, that's just one outlet of it.