top | item 40365622

(no title)

sojuz151 | 1 year ago

>My intuition is that as contexts get longer we start hitting the limits of how much comprehension can be embedded in a single point of vector space, and will need better architectures for selecting the relevant portions of the context.

We are dealing with multi-headed attention, therefore we have multiple points per token. You can always increase the number of heads or the size of the key vector.

discuss

order

causal|1 year ago

The token embedding is what ultimately gets nudged around by the heads though, right? The key vector just relates to the context size, not the token embedding size, afaik.