top | item 44077178

(no title)

valine | 9 months ago

Attention computes a weighted average of all previous latents. So yes, it’s a new token as input to the forward pass, but after it feeds through an attention head it contains a little bit of every previous latent.

discuss

order

No comments yet.