(no title)
tomtom1337 | 22 days ago
And how does one compare the results in a way that is easy to parse? 7 models producing 1 PR each is one way, but it doesn’t feel very easy to compare such.
tomtom1337 | 22 days ago
And how does one compare the results in a way that is easy to parse? 7 models producing 1 PR each is one way, but it doesn’t feel very easy to compare such.
languid-photic|22 days ago
For comparison, there's a `review` command that launches a sandboxed agent to review a given run and rank the various implementations. We usually run 1–3 review agents, pull the top 3 diffs, and do manual review from there.
We're working on better automation for this step right now.