I had the same thought, although voyage is 32k vs 128k for cohere 4.
Anecdotal evidence points to benchmarks correlating with result quality for data I've dealt with. I haven't spent a lot of time comparing results between models, because we were happy with the results after trying a few and tuning some settings.
Unless my dataset lines up really well with a benchmark's dataset, creating my own benchmark is probably the only way to know which model is "best".
Voyage-3-large is a text-only and much larger model than Embed-v4. If you want to unlock multimodality with Voyage-3-large, you'd have to either OCR (really bad results usually) or use a VLM to parse your data into textual descriptions (this works alright, but the cost of using a VLM will jack-up your data-pre-processing costs).
I think anyone that cares enough about embedding performance to use niche models is probably parsing their PDF's into some sort of textual format. Otherwise you need orient your all your pipelines to handle images which adds significant complexity (hybrid search, reranking, LLM calls, etc - all way harder with images).
Not to mention an image is optimistically 50 KB vs the same page represented as markdown is maybe 2–5 KB. When you're talking about pulling in potentially hundreds of pages, that's a 10–20x increase in storage, memory usage, and network overhead.
I do wish they had a more head-to-head comparison with voyage. I think they're the de facto king of proprietary embeddings and with Mongo having bought them, I'd love to migrate away once someone can match their performance.
I looked at the NDCG and thought that was the dataset.since voyage and cohere both used NDCG. I now realize it was separate benchmarks with the same evaluation metric.
SparkyMcUnicorn|10 months ago
Anecdotal evidence points to benchmarks correlating with result quality for data I've dealt with. I haven't spent a lot of time comparing results between models, because we were happy with the results after trying a few and tuning some settings.
Unless my dataset lines up really well with a benchmark's dataset, creating my own benchmark is probably the only way to know which model is "best".
CharlieDigital|10 months ago
It feels like embedding content that large -- especially in dense texts -- will lead to loss of fidelity/signal in the output vector.
mahjongmen|10 months ago
Voyage-3-large is a text-only and much larger model than Embed-v4. If you want to unlock multimodality with Voyage-3-large, you'd have to either OCR (really bad results usually) or use a VLM to parse your data into textual descriptions (this works alright, but the cost of using a VLM will jack-up your data-pre-processing costs).
serjester|10 months ago
Not to mention an image is optimistically 50 KB vs the same page represented as markdown is maybe 2–5 KB. When you're talking about pulling in potentially hundreds of pages, that's a 10–20x increase in storage, memory usage, and network overhead.
I do wish they had a more head-to-head comparison with voyage. I think they're the de facto king of proprietary embeddings and with Mongo having bought them, I'd love to migrate away once someone can match their performance.
moojacob|10 months ago
I looked at the NDCG and thought that was the dataset.since voyage and cohere both used NDCG. I now realize it was separate benchmarks with the same evaluation metric.
esafak|10 months ago
moojacob|10 months ago
You’re right, there’s no other way to compare embeddings than a benchmark.
Just that what the benchmark used by Voyage and Cohere tracks might not be relevant to your own needs.