(no title)
jaidhyani | 2 years ago
It's pedagogically unfortunate that the residual stream is in the same space as the token embeddings, because it obscures how the residual stream is used as a kind of general compressed-information conduit through the model that attention heads read and write different information to to enable the eventual prediction task.
No comments yet.