Unfortunately not, as we used our own internal code for the benchmark. We would also like to see more benchmarks that reflect the day-to-day agentic coding use.
Roughly, we had Cursor software engineers record real questions they were asking models, and then had them record the PR that they made that contained the result. We then cleaned these up. That is the benchmark.
gabriel666smith|4 months ago
It's the most prominent part of the release post - but it's really hard to understand what exactly it's saying.
srush|4 months ago