top | item 40643600

(no title)

100ideas | 1 year ago

reminds me of the anthropic's recent work on identifying the neuron sets that correlate to various semantic concepts in Claude: https://news.ycombinator.com/item?id=40429540 "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet"

discuss

szvsw|1 year ago

OpenAI also just published similar work, though Anthropic did beat them to the punch.

https://openai.com/index/extracting-concepts-from-gpt-4/

https://news.ycombinator.com/item?id=40599749

cabidaher|1 year ago

In the same vein, Refusal in LLMs is mediated by a single direction: https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in...