top | item 45951340

(no title)

Vera_Wilde | 3 months ago

The directional‐ablation approach in Heretic is clever: by identifying residual “refusal directions” and ablating them, they shift the trade-off frontier for the model. In rare‐event screening terms: they’re effectively changing the detection threshold geometry rather than trying just to get better data. It resonates with how improving a test’s accuracy in low-prevalence settings often fails unless you address threshold + base rate.

discuss

order

xmcqdpt2|3 months ago

The paper is great. It really shows how alignement is entirely surface level and not actually deeply ingrained in the models. Really interesting work.