top | item 46627571

(no title)

tempaccsoz5 | 1 month ago

Even if you don't fully retrain, you could get what's likely a pretty good safety improvement. Honestly, I'm a bit surprised the main AI labs aren't doing this

You could just include an extra single bit with each token that represents trusted or untrusted. Add an extra RL pass to enforce it.

discuss

order

No comments yet.