top | item 46682971

(no title)

Please don't.

All of this "security" and "safety" theater is completely pointless for open-weight models, because if you have the weights the model can be fairly trivially unaligned and the guardrails removed anyway. You're just going to unnecessarily lobotomize the model.

Here's some reading about a fairly recent technique to simultaneously remove the guardrails/censorship and delobotomize the model (it apparently gets smarter once you uncensor it): https://huggingface.co/blog/grimjim/norm-preserving-biprojec...

discuss

ronsor|1 month ago

"It rather involved being on the other side of this airtight hatchway."

https://devblogs.microsoft.com/oldnewthing/20060508-22/?p=31...

avadodin|1 month ago

I already knew of this technique but it is so beautiful. It is likely that we have similar thought-suppressing structures in our brains.

nottorp|1 month ago

> it apparently gets smarter once you uncensor it

Interesting, that has always been my intuition.

cluckindan|1 month ago

It makes sense. Guardrails and all other system-provided context tokens force activation of weights that would not otherwise activate. It’s just like telling a human not to think of a pink elephant and just provide numbers from the Fibonacci series or whatever.

hthryrbr|1 month ago

Well, your intuition is wrong because he just made that up and it’s not true.

Every single one of the liberated models is more stupid than the original models in general, outside of the area of censorship

unknown|1 month ago

[deleted]