top | item 42018680

(no title)

autokad | 1 year ago

I have found that embeddings + LLM is very successful. I'm going to make the words up as to not yield my work publicly, but I had to classify something into 3 categories. I asked a simple llm to label it, it was 95% accurate. taking the min distance from the word embeddings to the mean category embeddings was about 96%. When I gave gave the LLM the embedding prediction, the LLM was 98% accurate.

There were issues an embedding model might not do well on where as the LLM could handle. for example: These were camel case words, like WoodPecker, AquafinaBottle, and WoodStock (I changed the words to not reveal private data). WoodPecker and WoodStock would end up with close embedding values because the word Wood dominated the embedding values, but these were supposed to go into 2 different categories.

discuss

kkielhofner|1 year ago

> word Wood dominated the embedding values, but these were supposed to go into 2 different categories

When faced with a similar challenge we developed a custom tokenizer, pretrained BERT base model[0], and finally a SPLADE-esque sparse embedding model[1] on top of that.

[0] - https://huggingface.co/atomic-canyon/fermi-bert-1024

[1] - https://huggingface.co/atomic-canyon/fermi-1024

bravura|1 year ago

Do you mind sharing why you chose SPLADE-esque sparse embeddings?

I have been working on embeddings for a while.

For different reasons I have recently become very interested in learned sparse embeddings. So I am curious what led you to choose them for your application, and why?

bravura|1 year ago

Some of the best performing embedding models (https://huggingface.co/spaces/mteb/leaderboard) are LLMs. Have you tried them?