Show HN: Chonkie – A Fast, Lightweight Text Chunking Library for RAG
199 points| bhavnicksm | 1 year ago |github.com
Core features:
- 21MB default install vs 80-171MB alternatives
- 33x faster token chunking than popular alternatives
- Supports multiple chunking strategies: token, word, sentence, and semantic
- Works with all major tokenizers (transformers, tokenizers, tiktoken)
- Zero external dependencies for basic functionality
Technical optimizations:
- Uses tiktoken with multi-threading for faster tokenization
- Implements aggressive caching and precomputation
- Running mean pooling for efficient semantic chunking
- Modular dependency system (install only what you need)
Benchmarks and code: https://github.com/bhavnicksm/chonkie
Looking for feedback on the architecture and performance optimizations. What other chunking strategies would be useful for RAG applications?
mattmein|1 year ago
cadence-|1 year ago
simonw|1 year ago
I've been hoping to find an ultra light-weight chunking library that can do things like very simple regex-based sentence/paragraph/markdown-aware chunking with minimal additional dependencies.
parhamn|1 year ago
The more complicated stuff is the effective bin-packing problem that emerges depending on how much different contextual sources you have.
jimmySixDOF|1 year ago
[1] https://gist.github.com/LukasKriesch/e75a0132e93ca989f8870c4...
[2] https://jina.ai/segmenter/
andai|1 year ago
I just removed one sentence at a time from the left until there was a jump in the embedding distance. Then repeated for the right side.
bhavnicksm|1 year ago
I hope that you will stick with Chonkie for the journey of making the 'perfect' chunking library!
Thanks again!
mixeden|1 year ago
1) what
rkharsan64|1 year ago
Edit: Also, from the same table, it seems that only this library was ran after warming up, while others were not. https://github.com/bhavnicksm/chonkie/blob/main/benchmarks/R...
petesergeant|1 year ago
I’m using o1-preview for chunking, creating summary subdocuments.
bhavnicksm|1 year ago
Thanks for responding, I'll try to make it easier to use something like that in Chonkie in the future!
vlovich123|1 year ago
ekianjo|1 year ago
Dowwie|1 year ago
samlinnfer|1 year ago
nostrebored|1 year ago
Chunking is easily where all of these problems die beyond PoC scale.
I’ve talked to multiple code generation companies in the past week — most are stuck with BM25 and taking in whole files.
bhavnicksm|1 year ago
But, it's on the roadmap, so please hold on!
bravura|1 year ago
I have a particular max token length in mind, and I have a tokenizer like tiktoken. I have a string and I want to quickly find the maximum length truncation of the string that is <= target max token length.
Does chonkie handle this?
bhavnicksm|1 year ago
Is that what you meant?
will-burner|1 year ago
edit: Get some Moo Deng jokes in the docs!
spullara|1 year ago
bhavnicksm|1 year ago
Memory footprint of the chunking itself would vary widely based on the dataset, and it's not something we tested on... usually other providers don't test it either, as long as it doesn't bust up the computer/server.
If saving memory during runtime is important for your application, let me know! I'd run some benchmarks for it...
Thanks!
trwhite|1 year ago
adwf|1 year ago
Think of it as if ChatGPT (or other models) didn't just have the embedded unstructured knowledge in their weights from learning, but also an extra DB on the side with specific structured knowledge that it can lookup on the fly.
ch1kkenm4ss4|1 year ago
bhavnicksm|1 year ago
opendang|1 year ago
[deleted]
xivusr|1 year ago
ilidur|1 year ago
The Benchmark numbers are massaged to look really impressive but upon scrutiny the improvements are at most <1.86x compared to the leading product LangChain in a further page describing the measurements. It claims to beat it on all aspects but where it gets close, the author's library uses a warmed up version so the numbers are not comparable. The author acknowledged this but didn't change the methodology to provide a direct comparison.
The author is Bhavnick S. Minhas, an early career ML engineer with both research and industry experience and very prolific with his GitHub contributions.