top | item 42466140

(no title)

cubie | 1 year ago

On a very high level, for NLP:

1. an encoder takes an input (e.g. text), and turns it into a numerical representation (e.g. an embedding).

2. a decoder takes an input (e.g. text), and then extends the text.

(There's also encoder-decoders, but I won't go into those)

These two simple definitions immediately give information on how they can be used. Decoders are at the heart of text generation models, whereas encoders return embeddings with which you can do further computations. For example, if your encoder model is finetuned for it, the embeddings can be fed through another linear layer to give you classes (e.g. token classification like NER, or sequence classification for full texts). Or the embeddings can be compared with cosine similarity to determine the similarity of questions and answers. This is at the core of information retrieval/search (see https://sbert.net/). Such similarity between embeddings can also be used for clustering, etc.

In my humble opinion (but it's perhaps a dated opinion), (encoder-)decoders are for when your output is text (chatbots, summarization, translation), and encoders are for when your output is literally anything else. Embeddings are your toolbox, you can shape them into anything, and encoders are the wonderful providers of these embeddings.

discuss

SoothingSorbet|1 year ago

I still find this explanation confusing because decoder-only transformers still embed the input and you can extract input embeddings from them.

Is there a difference here other than encoder-only transformers being bidirectional and their primary output (rather than a byproduct) are input embeddings? Is there a reason other than that bidirectionality that we use specific encoder-only embedding models instead of just cutting and pasting a decoder-only model's embedding phase?

craigacp|1 year ago

The encoder's embedding is contextual, it depends on all the tokens. If you pull out the embedding layer from a decoder only model then that is a fixed embedding where each token's representation doesn't depend on the other tokens in the sequence. The bi-directionality is also important for getting a proper representation of the sequence, though you can train decoder only models to emit a single embedding vector once they have processed the whole sequence left to right.

Fundamentally it's basically a difference between bidirectional attention in the encoder and a triangular (or "causal") attention mask in the decoder.

Kinrany|1 year ago

How much does the choice of the encoder depend on the application?