top | item 46506124

(no title)

fmstephe | 1 month ago

Can some clarify this part of the article for me

"if you search forward, you need to scan through the entire window to find where to split. you’d find a delimiter at byte 50, but you can’t stop there — there might be a better split point closer to your target size. so you keep searching, tracking the last delimiter you saw, until you finally cross the chunk boundary. that’s potentially thousands of matches and index updates."

So I understand that this is optimal if you want to make your chunks as large as possible for a given chunk size.

What I don't understand is why is it desirable to grab the largest chunk possible for a given chunk limit?

Or have I misunderstood this part of the article?

discuss

snyy|1 month ago

You have the right understanding.

We've found that maximizing chunk size gives the best retrieval performance and is easier to maintain since you don't have to customize chunking strategy per document type.

The upper limit for chunk size is set by your embedding model. After a certain size, encoding becomes too lossy and performance degrades.

There is a downside: blindly splitting into large chunks may cut a sentence or word off mid-way. We handle this by splitting at delimiters and adding overlap to cover abbreviations and other edge cases.