top | item 46631788

(no title)

navar | 1 month ago

For the retrieval stage, we have developed a highly efficient, CPU-only-friendly text embedding model:

https://huggingface.co/MongoDB/mdbr-leaf-ir

It ranks #1 on a bunch of leaderboards for models of its size. It can be used interchangeably with the model it has been distilled from (https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1...).

You can see an example comparing semantic (i.e., embeddings-based) search vs bm25 vs hybrid here: http://search-sensei.s3-website-us-east-1.amazonaws.com (warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)

This mini app illustrates the advantage of semantic vs bm25 search. For instance, embedding models "know" that j lo refers to jennifer lopez.

We have also published the recipe to train this type of models if you were interested in doing so; we show that it can be done on relatively modest hardware and training data is very easy to obtain: https://arxiv.org/abs/2509.12539

discuss

HanClinto|1 month ago

Thank you for publishing this! I absolutely love small embedding models, and have used them on a number of projects (both commercial and hobbyist). I look forward to checking this one out!

I don't know if this is too much to ask, but something that would really help me adopt your model is to include a fine-tuning setup. The BGE series of embeddings-models has been my go-to for a couple of years now -- not because it's the best-performing in the leaderboards, but because they make it so incredibly easy to fine-tune the model [0]. Give it a JSONL file of a bunch of training triplets, and you can fine-tune the base models on your own dataset. I appreciate you linking to the paper on the recipe for training this type of model -- how close to turnkey is your model to helping me do transfer learning with my own dataset? I looked around for a fine-tuning example of this model, and didn't happen to see anything, but I would be very interested in trying this one out.

Does support for fine-tuning already exist? If so, then I would be able to switch to this model away from BGE immediately.

* [0] - https://github.com/FlagOpen/FlagEmbedding/tree/master/exampl...

navar|1 month ago

As far as I can tell it should be possible to reuse this fine tuning code entirely and just replace `--embedder_name_or_path BAAI/bge-base-en-v1.5` with `--embedder_name_or_path MongoDB/mdbr-leaf-ir`

Note that bge-base-en-v1.5 is a 110M params model - our is 23M. * BEIR performance is bge=53.23 vs ours=53.55 * RTEB performance is bge=43.75 vs ours=44.82 -> overall they should be very similar, except ours is 5x smaller and hence that much faster.

rcarmo|1 month ago

Hmmm. I recently created https://github.com/rcarmo/asterisk-embedding-model, need to look at this since I had very limited training resources.

jasonjmcghee|1 month ago

How does performance (embedding speed and recall) compare to minish / model2vec static word embeddings?

navar|1 month ago

I interacted with the authors of these models quite a bit!

These are very interesting models.

The tradeoff here is that you get even faster inference, but lose on retrieval accuracy [0].

Specifically, inference will be faster because essentially you are only doing tokenization + a lookup table + an average. So despite the fact that their largest model is 32M params, you can expect inference speeds to be higher than ours, which 23M params but it is transformer-based.

I am not sure about typical inference speeds on a CPU for their models, but with ours you can expect to do ~22 docs per second, and ~120 queries per second on a standard 2vCPU server.

As far as retrieval accuracy goes, on BEIR we score 53.55, all-MiniLM-L12-v2 (a widely adopted compact text embedding model) scores 42.69, while potion-8M scores 30.43.

I can't find their larger models but you can generally get an idea of the power level of different embedding models here: https://huggingface.co/spaces/mteb/leaderboard

If you want to run them on a CPU it may make sense to filter for smaller models (e.g., <100M params). On the other side our models achieve higher retrieval scores.

[0] "accuracy" in layman terms, not in accuracy vs recall terms. The correct word here would be "effectiveness".

3abiton|1 month ago

And honestly in a lot of the cases bm25 has been the best approach used in many of the projects we deployed.