top | item 35985258

(no title)

ypcx | 2 years ago

Not ELI5 obviously but might help some.

Transformer is a patterning probabilistic machine for a sequence of identities[1]. These identities are fed to the transformer in lanes. The transformer is conditioned to shift lanes one position to the left until they make it to the output, and make a prediction in the right-most lane that got freed up. Attention adds an exponential amount of layer interconnectivity, when we compare it with a simple densely connected layers. The attention mask serves as a high-dimensional dropout, without which it would be extremely easy for the Transformer to simply repeat the inputs (and then fail to generalize when making the prediction). Each layer up until the vertical middle of the Transformer works with a higher contextual representation than the previous one, and this is again being unwound back to lower contexts from the middle layer back to the original identities (integers) on the outputs. This means that you have raw identities on the input and output which span a certain width/window of the input sequence, but in comparison the middle-most layer has a sequence of high level contexts spanning extreme lengths of the original input sequence, knowledge-wise. [1]It's important to know that modification (learning) by the Transformer, of the vector embeddings which represent the input/output identities/integers that the Transformer works with, constitute big portion of the Transformer's power, and the practical implication of that is that it's impractical to try to tell the Transformer that e.g. some of our identities are similar or there's some logical system in their similarity, because all the Transformer really cares about is the occurrence of these identities in the sequence we train the Transformer on, and the Transformer will figure out the similarities or any kind of logic in the sequence by itself.

discuss

No comments yet.