top | item 47129692

(no title)

nitros | 6 days ago

How exactly does distilling a censored model produce an uncensored model?

discuss

order

nebezb|6 days ago

It doesn't. Anthropic are, as usual, sounding an alarm to pull the ladder up from behind them.

janalsncm|6 days ago

First of all this is not technically distillation, it is more imitation learning.

Second, you could do something like asking Claude to create 1 million prompt, offensive response, non offensive response triplets. Then train a model with DPO to prefer the offensive responses.

ncb9094|6 days ago

it technically can. there are patterns that emerge which manifest with no "safegurads" during training