top | item 29340857

(no title)

ypcx | 4 years ago

It's interesting to see how human understanding differs when it comes to complex, yet clearly defined topics, like machine-learning/Transformers.

For comparison, my understanding of Transformers, after going through Peter Bloem's "Transformers from scratch" [1], implementing/understanding the code and the actual flow of the mathematical quantities, my understanding is that:

- Transformers consist of 3 main parts: 1. Encoders/Decoders (I/O conversion), 2. Self-attention (Indexing), 3. Feed-forward trainable network (Memory).

- The Feed-forward is the most simple kind of (an input->single-layer) neural net, actually often implemented by a Conv1d layer, which is a simple matrix multiply plus a bias and activation.

- The most interesting part is the Multi-head self-attention, which I understand as [2] a randomly-initialized multi-dimensional indexing system where different heads focus on different variations of the indexed token instance (token = initially e.g. a word or a part of a word) with respect to its containing sequence/context. Such encoded token instance contains information about all other tokens of the input sequence = a.k.a. self-attention, and these tokens vary based on how the given "attention head" was (randomly) initialized.

The part that really hits you is when you understand that for a Transformer, a token is not unique only due to its content/identity (and due to all other tokens in the given context/sentence), but also due to its position in the context -- e.g. to the Transformer, the word "the" at the first position is a completely different word to the word "the" on e.g. the second position (even if the rest of the context would be the same). (Which is obviously a massive waste of space if you think about it, but at the same time, at the moment, the only/best way of doing it, because it moves a massive amount of processing from inference time to the training time - which is what our current von-Neumann hardware architectures require.)

[1] http://peterbloem.nl/blog/transformers

[2] https://datascience.stackexchange.com/a/103151/101197

discuss

igorkraw|4 years ago

Your last point is true only with positional encodings though, attention itself is a permutation equivariant function