(no title)
metawake | 1 month ago
Surprising findings:
1. On legal text (CaseHOLD), 1024 chunks scored WORST (0.618). The "small" 256 chunks won (0.664). 7% swing.
2. On Wikipedia text? All chunk sizes hit ~99%. No difference.
3. Plot twist: At 5K docs, optimal chunk size FLIPPED from 256→1024. Scale changes everything.
Code is MIT: github.com/metawake/ragtune
Happy to discuss methodology.
patrakov|1 month ago
metawake|1 month ago
I've added this to the roadmap as `--bootstrap N`:
The implementation would sample N random subsets from the query set (or corpus), run each independently, and report mean ± std.This also enables detecting real regressions vs noise eg "Recall dropped 3% ± 0.8%" is actionable, "dropped 3%" alone isn't.
Will ship this during next few weeks. Thanks for the push toward more rigorous methodology, this is exactly what's missing from most RAG benchmarks.