top | item 43221093

(no title)

There's the technique of model orthogonalization which can often zero out certain tendencies (most often, refusal), as demonstrated by many models on HuggingFace. There may be an existing open weights model on HuggingFace that uses orthogonalization to zero out positivity (or optimism)--or you could roll your own.

discuss

No comments yet.