Do not be fooled by the simplicity; The magic itself is in the many Q, K and V matrices (each of which is huge) which are learned and depend on the language(s). This is just the form of the application of those matrices/transformations: Making the embedding for the last token of a context "attend to" (hence attention) all information (at all layers of meaning and not just syntactic or semantic meaning but logical, scientific, poetic, discoursal, etc. => multi-head attention) contained in the context so far.Any complex function can be made to look simple in some representation (e.g its Fourier series or Taylor series, etc.).
No comments yet.