top | item 46506574

(no title)

imperio59 | 1 month ago

From the author: > at some point we started benchmarking on wikipedia-scale datasets. > that’s when things started feeling… slow.

So they're talking about this becoming an issue when chunking TBs of data (I assume), not your 1kb random string...

discuss

groby_b|1 month ago

But the bottleneck is generating embeddings either way.

memchunk has a throughput of 164 GB/s. A really fast embedder can deliver maybe 16k embeddings/sec, or ~1.6GB/s (if you assume 100 char sentences)

That's two orders of magnitude difference. Chunking is not the bottleneck.

It might be an architectural issue - you stuff chunks into a MQ, and you want to have full visibility in queue size ASAP - but otherwise it doesn't matter how much you chunk, your embedder will slow you down.

It's still a neat exercise on principle, though :)

viraptor|1 month ago

It doesn't matter if A takes much more time than B, if B is large enough. You're still saving resources and time by optimising B. Also, you seem to assume that every chunk will get embedded - they may be revisiting some pages where the chunks are already present in the database.