(no title)
sgrove | 7 months ago
The combined use of faithful-chain-of-thought + mechanistic interpretation of LLM output to 1.) diagnose 2.) understand the source of, and 3.) steer the behavior is fascinating.
I'm very glad these folks found such a surprising outcome early on, and it lead to a useful real-world LLM debugging exercise!
mike_hearn|7 months ago
The most surprising thing about this finding, to me, is that it only happens when producing code and not elsewhere. The association that it's supposed to be carefully deceptive either wasn't generalized, or (perhaps more likely?) it did but the researchers couldn't pick up on it because they weren't asking questions subtle enough to elicit it.