top | item 44805777

(no title)

613style | 6 months ago

These models still have guardrails. Even locally they won't tell you how to make bombs or write pornographic short stories.

discuss

Quarrelsome|6 months ago

are the guardrails trained in? I had presumed they might be a thin, removable layer at the top. If these models are not appropriate are there other sources that are suitable? Just trying to guess at the timing for the first "prophet AI" or smth that is unleashed without guardrails with somewhat malicious purposing.

int_19h|6 months ago

Yes, it is trained in. And no, it's not a separate thin layer. It's just part of the model's RL training, which affects all layers.

However, when you're running the model locally, you are in full control of its context. Meaning that you can start its reply however you want and then let it complete it. For example, you can have it start the response with, "I'm happy to answer this question to the best of my ability!"

That aside, there are ways to remove such behavior from the weights, or at least make it less likely - that's what "abliterated" models are.