(no title)
ypcx | 4 years ago
For comparison, my understanding of Transformers, after going through Peter Bloem's "Transformers from scratch" [1], implementing/understanding the code and the actual flow of the mathematical quantities, my understanding is that:
- Transformers consist of 3 main parts: 1. Encoders/Decoders (I/O conversion), 2. Self-attention (Indexing), 3. Feed-forward trainable network (Memory).
- The Feed-forward is the most simple kind of (an input->single-layer) neural net, actually often implemented by a Conv1d layer, which is a simple matrix multiply plus a bias and activation.
- The most interesting part is the Multi-head self-attention, which I understand as [2] a randomly-initialized multi-dimensional indexing system where different heads focus on different variations of the indexed token instance (token = initially e.g. a word or a part of a word) with respect to its containing sequence/context. Such encoded token instance contains information about all other tokens of the input sequence = a.k.a. self-attention, and these tokens vary based on how the given "attention head" was (randomly) initialized.
The part that really hits you is when you understand that for a Transformer, a token is not unique only due to its content/identity (and due to all other tokens in the given context/sentence), but also due to its position in the context -- e.g. to the Transformer, the word "the" at the first position is a completely different word to the word "the" on e.g. the second position (even if the rest of the context would be the same). (Which is obviously a massive waste of space if you think about it, but at the same time, at the moment, the only/best way of doing it, because it moves a massive amount of processing from inference time to the training time - which is what our current von-Neumann hardware architectures require.)
igorkraw|4 years ago