top | item 39172989

(no title)

karterk | 2 years ago

It's interesting how all focus is now primarily on decoder-only next-token-prediction models. Encoders (BERT, encoder of T5) are still useful for generating embedding for tasks like retrieval or classification. While there is a lot of work on fine-tuning BERT and T5 for such tasks, it would be nice to see more research on better pre-training architectures for embedding use cases.

discuss

jeremycochoy|2 years ago

I believe RWKV is actually an architecture that can be used for encoding: given a LSTM/GRU, you can simply take the last state as an encoding of your sequence. The same should be possible with RWKV, right?