top | item 37776853

(no title)

ma2rten | 2 years ago

Attention takes in all tokens in the sequence and outputs a new representation of the current token in context. Each layer of the transformer adds more context to the token.

I haven't read this explanation in detail and although they have some nice animations, I wouldn't go to FT to explain machine learning concepts. Here are two well known explanations that might be better:

http://jalammar.github.io/illustrated-transformer/

http://nlp.seas.harvard.edu/annotated-transformer/.

discuss

lawlessone|2 years ago

So is it analogous to how a CNN starts with fragments of images and further up the chain assembles these into objects?

ma2rten|2 years ago

Yes, I think that is a reasonable way to think about it, in my opinion. However, with the language modeling objective it predicts the next token and because of the residual connections each intermediate layer is in the same space. So, maybe it would be more accurate to say that it is an increasingly accurate representation of the next token.

jacomoRodriguez|2 years ago

thanks a lot, looking at the links right now and I think they go more in depth :)