top | item 42467782

(no title)

craigacp | 1 year ago

The encoder's embedding is contextual, it depends on all the tokens. If you pull out the embedding layer from a decoder only model then that is a fixed embedding where each token's representation doesn't depend on the other tokens in the sequence. The bi-directionality is also important for getting a proper representation of the sequence, though you can train decoder only models to emit a single embedding vector once they have processed the whole sequence left to right.

Fundamentally it's basically a difference between bidirectional attention in the encoder and a triangular (or "causal") attention mask in the decoder.

discuss

No comments yet.