top | item 46706557 (no title) patrakov | 1 month ago Now that you have 5K docs, can you try estimating the statistical uncertainty of the Recall@5 and MRR metrics measured via smaller datasets? Just make some different 400-document subsets of the whole 5K HotpotQA dataset and recalculate the metrics. discuss order hn newest metawake|1 month ago Great suggestion!! this is exactly the right methodology for establishing confidence intervals.I've added this to the roadmap as `--bootstrap N`: ragtune simulate --queries queries.json --bootstrap 5 # Output: # Recall@5: 0.664 ± 0.012 (n=5) # MRR: 0.533 ± 0.008 (n=5) The implementation would sample N random subsets from the query set (or corpus), run each independently, and report mean ± std.This also enables detecting real regressions vs noise eg "Recall dropped 3% ± 0.8%" is actionable, "dropped 3%" alone isn't.Will ship this during next few weeks. Thanks for the push toward more rigorous methodology, this is exactly what's missing from most RAG benchmarks.
metawake|1 month ago Great suggestion!! this is exactly the right methodology for establishing confidence intervals.I've added this to the roadmap as `--bootstrap N`: ragtune simulate --queries queries.json --bootstrap 5 # Output: # Recall@5: 0.664 ± 0.012 (n=5) # MRR: 0.533 ± 0.008 (n=5) The implementation would sample N random subsets from the query set (or corpus), run each independently, and report mean ± std.This also enables detecting real regressions vs noise eg "Recall dropped 3% ± 0.8%" is actionable, "dropped 3%" alone isn't.Will ship this during next few weeks. Thanks for the push toward more rigorous methodology, this is exactly what's missing from most RAG benchmarks.
metawake|1 month ago
I've added this to the roadmap as `--bootstrap N`:
The implementation would sample N random subsets from the query set (or corpus), run each independently, and report mean ± std.This also enables detecting real regressions vs noise eg "Recall dropped 3% ± 0.8%" is actionable, "dropped 3%" alone isn't.
Will ship this during next few weeks. Thanks for the push toward more rigorous methodology, this is exactly what's missing from most RAG benchmarks.