top | item 42101145

(no title)

mixeden | 1 year ago

> Token Chunking: 33x faster than the slowest alternative

1) what

discuss

There's only 3 competitors in that particular benchmark, and the speedup compared to the 2nd is only 1.06x.

Edit: Also, from the same table, it seems that only this library was ran after warming up, while others were not. https://github.com/bhavnicksm/chonkie/blob/main/benchmarks/R...

bhavnicksm|1 year ago

TokenChunking is really limited by the tokenizer and less by the Chunking algorithm. Tiktoken tokenizers seem to do better with warm-up which Chonkie defaults to -- which is also what the 2nd one is using.

Algorithmically, there's not much difference in TokenChunking between Chonkie and LangChain or any other TokenChunking algorithm you might want to use. (except Llamaindex, I don't know what mess they made for 33x slower algo)

If you only want TokenChunking (which I do not recommend completely), better than Chonkie or LangChain, just write your own for production :) At least don't install 80MiB packages for TokenChunking, Chonkie is 4x smaller than them.

That's just my honest response... And these benchmarks are just the beginning, future optimizations on SemanticChunking which would increase the speed-up from the current 2nd (2.5x right now) to even higher.

melony|1 year ago

How does it compare with NLTK's chunking library? I have found that it works very well for sentence segmentation.