top | item 35591625

(no title)

jaidhyani | 2 years ago

The original paper is very good but I would argue it's not well optimized for pedagogy. Among other things, it's targeting a very specific application (translation) and in doing so adopts a more complicated architecture than most cutting-edge modes actually use (encoder-decoder instead of just one or the other). The writers of the paper probably didn't realize they were writing a foundational document at the time. It's good for understanding how certain conventions developed and important historically - but as someone who did read it as an intro to transformers, in retrospect I would have gone with other resources (e.g. "The Illustrated Transformer").

discuss

order

No comments yet.