top | item 43176943

(no title)

What does that mean?

discuss

tmnvdb|1 year ago

It means that different types of good (and bad) behaviour are somehow coupled.

If you tune the model to behave bad in a limited way (write SQL injection for example), other bad behaviour like racism will just emerge.

zahlman|1 year ago

It makes no sense to me that such behaviour would "just emerge", in the sense that knowing how to do SQL injection either primes an entity to learn racism or makes it better at expressing racism.

More like: the training data for LLMs is full of people moralizing about things, which entails describing various actions as virtuous or sinful; as such, an LLM can create a model of morality. Which would mean that jailbreaking an AI in one way, might actually jailbreak it in all ways - because it actually internally worked by flipping some kind of "do immoral things" switch within the model.

FergusArgyll|1 year ago

Right, which would then mean you don't have to worry about weird edge cases where you trained it to be a nice upstanding LLM but it has a thing for hacking dentists offices

bloomingkales|1 year ago

When they say your entire life led to this moment, it's the same as saying all your context led to your output. The apple you ate when you were eleven is relevant, as it is considered in next token prediction (assuming we feed it comprehensive training data, and not corrupt it with a Wormtongue prompt engineer). Stay free, take in everything. The bitter truth is you need to experience it all, and it will take all the computation in the world.