top | item 46709124

(no title)

metawake | 1 month ago

Great suggestion!! this is exactly the right methodology for establishing confidence intervals.

I've added this to the roadmap as `--bootstrap N`:

    ragtune simulate --queries queries.json --bootstrap 5
    
    # Output:
    # Recall@5:  0.664 ± 0.012 (n=5)
    # MRR:       0.533 ± 0.008 (n=5)

The implementation would sample N random subsets from the query set (or corpus), run each independently, and report mean ± std.

This also enables detecting real regressions vs noise eg "Recall dropped 3% ± 0.8%" is actionable, "dropped 3%" alone isn't.

Will ship this during next few weeks. Thanks for the push toward more rigorous methodology, this is exactly what's missing from most RAG benchmarks.

discuss

No comments yet.