top | item 47129692

(no title)

nitros | 6 days ago

How exactly does distilling a censored model produce an uncensored model?

discuss

nebezb|6 days ago

It doesn't. Anthropic are, as usual, sounding an alarm to pull the ladder up from behind them.

First of all this is not technically distillation, it is more imitation learning.

Second, you could do something like asking Claude to create 1 million prompt, offensive response, non offensive response triplets. Then train a model with DPO to prefer the offensive responses.

ncb9094|6 days ago

it technically can. there are patterns that emerge which manifest with no "safegurads" during training