top | item 44075704

(no title)

valine | 9 months ago

The dimensionality I suppose depends on the vocab size and your hidden dimension size, but that’s not really relevant. It’s a single linear projection to go from latents to logits.

Reasoning is definitely not happening in the linear projection to logits if that’s what you mean.

discuss

order

pyinstallwoes|9 months ago

Where does it happen ?

valine|9 months ago

My personal theory is that it’s an emergent property of many attention heads working together. If each attention head is a bird, reasoning would be the movement of the flock.