top | item 44188839

Tracking Copilot vs. Codex vs. Cursor vs. Devin PR Performance

254 points| HiPHInch | 9 months ago |aavetis.github.io

122 comments

order

lukehoban|8 months ago

(Disclaimer: I work on coding agents at GitHub)

This data is great, and it is exciting to see the rapid growth of autonomous coding agents across GitHub.

One thing to keep in mind regarding merge rates is that each of these products creates the PR at a different phase of the work. So just tracking PR create to PR merge tells a different story for each product.

In some cases, the work to iterate on the AI generated code (and potentially abandon it if not sufficiently good) is done in private, and only pushed to a GitHub PR once the user decides they are ready to share/merge. This is the case for Codex for example. The merge rates for product experiences like this will look good in the stats presented here, even if many AI generated code changes are being abandoned privately.

For other product experiences, the Draft PR is generated immediately when a task is assigned, and users can iterate on this “in the open” with the coding agent. This creates more transparency into both the success and failure cases (including logs of the agent sessions for both). This is the case for GitHub Copilot coding agent for example. We believe this “learning in the open” is valuable for individuals, teams, and the industry. But it does lead to the merge rates reported here appearing worse - even if logically they are the same as “task assignment to merged PR” success rates for other tools.

We’re looking forward to continuing to evolve the notion of Draft PR to be even more natural for these use cases. And to enabling all of these coding agents to benefit from open collaboration on GitHub.

polskibus|8 months ago

What is your team’s take on the copyright for commits generated by ai agent ? Would the copyright protect it?

Current US stance seems to be: https://www.copyright.gov/newsnet/2025/1060.html “It concludes that the outputs of generative AI can be protected by copyright only where a human author has determined sufficient expressive elements”.

If entire commit is generated by AI then it is obvious what created it - it’s AI. Such commit might not be covered by the law. Is this something your team has already analysed?

soamv|8 months ago

This is a great point! But there's an important tradeoff here about human engineering time versus the "learning in the open" benefits; a PR discarded privately consumes no human engineering time, a fact that the humans involved might appreciate. How do you balance that tradeoff? Is there such a thing as a diff that's "too bad" to iterate on with a human?

osigurdson|8 months ago

I've been underwhelmed with dedicated tools like Windsurf and Cursor in the sense that they are usually more annoying than just using ChatGPT. They have their niche but they are just so incredibly flow destroying it is hard to use them for long periods of time.

I just started using Codex casually a few days ago though and already have 3 PRs. While different tools for different purposes make sense, Codex's fully async nature is so much nicer. It does simple things like improve consistency and make small improvements quite well which is really nice. Finally we have something that operates more like an appliance for a certain classes of problems. Previously it felt more like a teenager with a learners license.

elliotec|8 months ago

Have you tried Claude code? I’m surprised it’s not in this analysis but in my personal experience, the competition doesn’t even touch it. I’ve tried them all in earnest. My toolkit has been (neo)vim and tmux for at least a decade now so I understand the apprehension for less terminal-inclined folks that prefer other stuff but it’s my jam and just crushes it.

jillesvangurp|8 months ago

OpenAI nailed the UX/DX with codex. This completely obsoletes cursor and similar IDEs. I don't need AI in my tools. I just need somebody to work on my code in parallel to me. I'm happy to interact via pull requests and branches.

I found out that I have access to codex on Thursday with my plus subscription. I've created and merged about a dozen PRs with it on my OSS projects since then. It's not flawless but it's pretty good. I've done some tedious work that I had been deferring, got it to complete a few FIXMEs that I hadn't gotten around to fixing, made it write some API documentation, got it to update a README, etc. It's pretty easy to review the PRs.

What I like is that it creates and works on its own branch. I can actually check that branch out, fix a few things myself, push it and then get it to do PRs against that branch. I had to fix a few small compilation issues. In one case, the fix was just removing a single import that it somehow got wrong after that everything built and the tests passed. Overall it's pretty impressive. Very usable.

I wonder how it performs on larger code bases. I expect some issues there. I'm going to give that a try next.

wahnfrieden|8 months ago

On Mac I don’t like how chatgpt makes it difficult to have a few queries generating in parallel for my Xcode

deadbabe|8 months ago

You can just use Cursor as a chat assistant if you want.

zX41ZdbW|8 months ago

It is also worth looking at the number of unique repositories for each agent, or the number of unique large repositories (e.g., by the threshold on the number of stars). Here is the report we can check:

https://play.clickhouse.com/play?user=play#V0lUSCByZXBvX3N0Y...

I've also added some less popular agents like jetbrains-junie, and added a link to a random pull request for each agent, so we can look at the example PRs.

gavinray|8 months ago

This is really cool and ought to be higher up I think, especially since you can freely edit + re-run the query in the browser.

That "spark bar-chart" column output is one of the neatest things I've seen in a while. What a brilliant feature.

behnamoh|8 months ago

How about Google Jules?

also, of course OpenAI Codex would perform well because the tool is heavily tailored to this type of task, whereas Cursor is a more general-purpose (in the programming domain) tool/app.

ubj|8 months ago

Where is Claude Code? Surprised to see it completely left out of this analysis.

ainiriand|8 months ago

It is not an 'agent' in the sense that it is not really autonomous afaik.

ukblewis|8 months ago

Claude Code isn’t a complete agent - it cannot open PRs autonomously AFAIK

tmvnty|8 months ago

Merge rates is definitely a useful signal, but there are certainly other factors we need consider (PR small/big edits, refactors vs deps upgrades, direct merges, follow up PRs correcting merged mistakes, how easy it is to setup these AI agents, marketing, usage fees etc). Similar to how NPM downloads alone don’t necessarily reflect a package’s true success or quality.

osigurdson|8 months ago

I suspect most are pretty small. But hey, that is fine as long as they are making code bases a bit better.

dimitri-vs|8 months ago

This might be an obvious questions but why is Claude Code not included?

a_bonobo|8 months ago

I think the OP's page works because these coding agents identify themselves as the PR author so the creator can just search Github's issue tracker for things like is:pr+head:copilot or is:pr+head:codex

It seems like Claude Code doesn't do that? some preliminary searching reveals that PRs generated by people using Claude Code use their own user account but may sign that they used Claude, example https://github.com/anthropics/claude-code/pull/1732

csallen|8 months ago

I believe these are all "background" agents that, by default, are meant to write code and issue pull requests without you watching/babysitting/guiding the process. I haven't used Claude Code in a while, but from what I recall, it's not that.

throwaway314155|8 months ago

Is this data not somewhat tainted by the fact that there's really zero way to identify how much a human was or wasn't "in the loop" before the PR was created?

thorum|8 months ago

With Jules, I almost always end up making significant changes before approving the PR. So “successful merge” is not great indicator of how well the model did in my case. I’ve merged PRs that were initially terrible after going in and fixing all the mistakes.

tptacek|8 months ago

I kind of wondered about that re: Devin vs. Cursor, because the people I know that happen to use Devin are also very hands-on with the code they end up merging.

But you could probably filter this a bit by looking at PR commit counts?

SilverSlash|8 months ago

Wasn't Codex only released recently? Why is it present an order of magnitude more than the others?

bkls|8 months ago

OpenAI brand, and they're already used by many consumers/enterprises. Distribution advantage

ehsanu1|8 months ago

It's hard to attribute PR merge rate with higher tool quality here. Another likely reason is level of complexity of task. Just looking at the first PR I saw from the github search for codex PRs, it was this one-line change that any tool, even years ago, could have easily accomplished: https://github.com/maruyamamasaya/yasukaribike/pull/20/files

knes|8 months ago

This is great work. Would love to see Augmentcode.com remote agent. If you are down OP, msg and I'll give you a free subscription to add to the test

nojs|8 months ago

For people using these, is there an advantage to having the agent create PRs and reviewing these versus just iterating with Cursor/Claude Code locally before committing? It seems like additional bureaucracy and process when you could fix the errors sooner and closer to the source.

cap11235|8 months ago

Ignoring the issue of non-LLM team members, PR's are helpful if you are using GH issues as a memory mechanism, supposedly. That said, I don't bother if I don't have to. I have Claude commit automatically when it feels it made a change, then I curate things before I push (usually just squash).

yoran|8 months ago

All these tools seem to be GitHub-centric. Any tips for teams using GitLab to store their repositories?

s900mhz|8 months ago

I use Claude code daily at work, it writes all my PRs. It uses the GitHub cli to manage them.

Since all agents are able to use the terminal I suggest looking up the Gitlab CLI and have it use that. Should work locally and in runners.

myhandleisbest|8 months ago

Can I get a clarification on the data here - Are these PRs reviewed by the tools or fully authored?

Also filter conditions that would be interesting - size of PR, language, files affected, distinct organizations etc. lmk if these get added please!

pkongz|8 months ago

How does this analysis handle potential false positives? For instance, if a user coincidentally names their branch `codex/my-branch`, would it be incorrectly included in the "Codex" statistics?

selvan|8 months ago

Total PRs between Codex vs Cursor is 208K vs 705, this is an enormous difference in absolute PRs. Since cursor is very popular, how does their PRs is not even 1% of codex PRs?.

ezyang|8 months ago

The happy path way of getting code out of Codex is a PR. This is emphatically not true for Cursor.

rahimnathwani|8 months ago

I didn't even realize Cursor could make PRs. I thought most people would create PRs themselves once they were happy with a series of commits.

SkyPuncher|8 months ago

This is only comparing _agents_, which is going to exclude pretty much all Cursor usage for two reasons:

* Cursor agents where just introduced in Beta and have privacy limitations that prevent their usage as many organizations.

* Cursor is still focused on hands-on-keyboard agentic flows, which aren't included in these counts.

nikolayasdf123|8 months ago

yeah, GitHub Copilot PRs are unusable. from personal experience

TZubiri|8 months ago

Why is there 170k PR for a product released last month, but 700 for a product that has been around for like 6 months and was so popular it got acquired for 3B?

simoncion|8 months ago

It might be the case that "number of PRs" is roughly as good a metric as "number of lines of code produced".

SatvikBeri|8 months ago

I've used Cursor for months and didn't even realize you could make PRs from it. It's not really part of the default workflow.

frognumber|8 months ago

Missing data: I don't make a codex PR if it's nonsense.

Poor data: If I make one, I either if I want to:

a) Merge it (success)

b) Modify it (sometimes success, sometimes not). In one case, Codex made the wrong changes in all the right places, but it was still easier to work from that by hand.

c) Pick ideas from it (partial success)

So simple merge rates don't say much.

osigurdson|8 months ago

It isn't so much "poor" data as it is a fairly high bar for value generation. If it gets merged it is a fairly clear indicator that some value is created. If it doesn't get merged then it may be adding some value or it may not.

pryelluw|8 months ago

Is it me or are there a lot of documentation related PRs? Not a majority, but enough to mask the impact of agent code.

myhandleisbest|8 months ago

Stats? What about the vibes leaderboard?

falcor84|8 months ago

Which one?

m3kw9|8 months ago

Agents should also sign the pr with secret keys so people can’t just fake the commit message

cjbarber|8 months ago

Seems like the high order bit impacting results here might be how difficult the PR is?

kaelandt|8 months ago

could be nice to add a "merged PR with a test" metric. looking at the PRs they are mostly without tests, so could be bogus for all we know

m4r1k|8 months ago

Just curious, why is there no reference to Google?

rcarmo|8 months ago

I was expecting a better definition of “performance”. Merging a garbage PR shouldn’t be a positive uptick.

zekone|8 months ago

thanks for posting my project bradda

zachlatta|8 months ago

Wow, this is an amazing project. Great work!