top | item 37135479

(no title)

This is my sense as well. Text generation LLMs haven't been the best source of embeddings for other downstream use cases. If you're optimizing for token embeddings (e.g., for NER, span detection, or token classification tasks), then a token training objective is important. If you need text-level embeddings (e.g., for semantic search or text classification), then that training objective is required (e.g., what Sentence BERT did to optimize BERT embeddings for semantic search).

That's a great list of existing embeddings models (in addition the SentenceBERT models https://www.sbert.net/docs/pretrained_models.html).

discuss

readyplayeremma|2 years ago

The SGPT model is a very high performing text embeddings model adapted from a decoder. Using the same techniques with Llama-2 might perform better than you expect. I think someone will need to try these things before we know for certain. I believe there is still room for significant improvement with embedding models.