top | item 46501666

(no title)

snyy | 1 month ago

We're the maintainers of Chonkie, a chunking library for RAG pipelines.

Recently, we've been using Chonkie to build deep research agents that watch topics for new developments and automatically update their reports. This requires chunking a large amount of data constantly.

While building this, we noticed Chonkie felt slow. We started wondering: what's the theoretical limit here? How fast can text chunking actually get if we throw out all the abstractions and go straight to the metal?

This post is about that rabbit hole and how it led us to build memchunk - the fastest chunking library, capable of chunking text at 1TB/s.

Blog: https://minha.sh/posts/so,-you-want-to-chunk-really-fast

GitHub: https://github.com/chonkie-inc/memchunk

Happy to answer any questions!

discuss

djoldman|1 month ago

English word, clause, sentence, and paragraph boundaries do not always match characters.

How does the software handle these:

Mrs. Blue went to the sea shore with Mr. Black.

"What's for dinner?" Mrs. Blue asked.