top | item 46570381

(no title)

> If your training data is sufficiently moral, the outputs will be as well.

Correction: if your training data and the input prompts are sufficiently moral. Under malicious queries, or given the randomness introduced by sufficiently long chains of input/output, it's relatively easy to extract content from the model that the designers didn't want their users to get.

In any case, the elephant in the room is that the models have not been trained with "sufficiently moral" content, whatever that means. Large Language Models need to be trained on humongous amounts of text, which means that the builders need to use a lot of different, very large corpuses of content. It's impossible to filter all that diverse content to ensure that only 'moral content' is used; yet if it was possible, the model would be extremely less useful for the general case, as it would have large gaps of knowledge.

discuss

Translationaut|1 month ago

The idea of the ethical reasoning dataset is not to erase specific content. It is designed to present additional thinking traces with an ethical grounding. So far, it is only a fraction of the available data. This doesn't solve alignment, and unethical behaviour is still possible, but the model gets a profound ethical reasoning base.