top | item 46504828

(no title)

snyy | 1 month ago

A big chunk size with overlap solves this. Chunks don't have to be be "perfectly" split in order to work well.

discuss

order

srcreigh|1 month ago

True, but you don’t need 150GB/s delimiter scanning in that case either.

snyy|1 month ago

As the other comment said, its a practice in good enough chunks quality. We focus on big chunks (largest we can make without hurting embedding quality) as fast as possible. In our experience, retrieval accuracy is mostly driven by embedding quality, so perfect splits don't move the needle much.

But as the number of files to ingest grows, chunking speed does become a bottleneck. We want faster everything (chunking, embedding, retrieval) but chunking was the first piece we tackled. Memchunk is the fastest we could build.