(no title)
autokad | 1 year ago
There were issues an embedding model might not do well on where as the LLM could handle. for example: These were camel case words, like WoodPecker, AquafinaBottle, and WoodStock (I changed the words to not reveal private data). WoodPecker and WoodStock would end up with close embedding values because the word Wood dominated the embedding values, but these were supposed to go into 2 different categories.
kkielhofner|1 year ago
When faced with a similar challenge we developed a custom tokenizer, pretrained BERT base model[0], and finally a SPLADE-esque sparse embedding model[1] on top of that.
[0] - https://huggingface.co/atomic-canyon/fermi-bert-1024
[1] - https://huggingface.co/atomic-canyon/fermi-1024
bravura|1 year ago
I have been working on embeddings for a while.
For different reasons I have recently become very interested in learned sparse embeddings. So I am curious what led you to choose them for your application, and why?
bravura|1 year ago