top | item 44780570 (no title) xinweihe | 7 months ago Yep, we're working on a golden test set with known root causes to benchmark and track agent performance over time. It's taking a bit of work to get right, but we're on it and definitely open to contributions! discuss order hn newest No comments yet.
No comments yet.