This data is great, and it is exciting to see the rapid growth of autonomous coding agents across GitHub.
One thing to keep in mind regarding merge rates is that each of these products creates the PR at a different phase of the work. So just tracking PR create to PR merge tells a different story for each product.
In some cases, the work to iterate on the AI generated code (and potentially abandon it if not sufficiently good) is done in private, and only pushed to a GitHub PR once the user decides they are ready to share/merge. This is the case for Codex for example. The merge rates for product experiences like this will look good in the stats presented here, even if many AI generated code changes are being abandoned privately.
For other product experiences, the Draft PR is generated immediately when a task is assigned, and users can iterate on this “in the open” with the coding agent. This creates more transparency into both the success and failure cases (including logs of the agent sessions for both). This is the case for GitHub Copilot coding agent for example. We believe this “learning in the open” is valuable for individuals, teams, and the industry. But it does lead to the merge rates reported here appearing worse - even if logically they are the same as “task assignment to merged PR” success rates for other tools.
We’re looking forward to continuing to evolve the notion of Draft PR to be even more natural for these use cases. And to enabling all of these coding agents to benefit from open collaboration on GitHub.
What is your team’s take on the copyright for commits generated by ai agent ? Would the copyright protect it?
Current US stance seems to be:
https://www.copyright.gov/newsnet/2025/1060.html
“It concludes that the outputs of generative AI can be protected by copyright only where a human author has determined sufficient expressive elements”.
If entire commit is generated by AI then it is obvious what created it - it’s AI. Such commit might not be covered by the law. Is this something your team has already analysed?
This is a great point! But there's an important tradeoff here about human engineering time versus the "learning in the open" benefits; a PR discarded privately consumes no human engineering time, a fact that the humans involved might appreciate. How do you balance that tradeoff? Is there such a thing as a diff that's "too bad" to iterate on with a human?
I've been underwhelmed with dedicated tools like Windsurf and Cursor in the sense that they are usually more annoying than just using ChatGPT. They have their niche but they are just so incredibly flow destroying it is hard to use them for long periods of time.
I just started using Codex casually a few days ago though and already have 3 PRs. While different tools for different purposes make sense, Codex's fully async nature is so much nicer. It does simple things like improve consistency and make small improvements quite well which is really nice. Finally we have something that operates more like an appliance for a certain classes of problems. Previously it felt more like a teenager with a learners license.
Have you tried Claude code? I’m surprised it’s not in this analysis but in my personal experience, the competition doesn’t even touch it. I’ve tried them all in earnest. My toolkit has been (neo)vim and tmux for at least a decade now so I understand the apprehension for less terminal-inclined folks that prefer other stuff but it’s my jam and just crushes it.
OpenAI nailed the UX/DX with codex. This completely obsoletes cursor and similar IDEs. I don't need AI in my tools. I just need somebody to work on my code in parallel to me. I'm happy to interact via pull requests and branches.
I found out that I have access to codex on Thursday with my plus subscription. I've created and merged about a dozen PRs with it on my OSS projects since then. It's not flawless but it's pretty good. I've done some tedious work that I had been deferring, got it to complete a few FIXMEs that I hadn't gotten around to fixing, made it write some API documentation, got it to update a README, etc. It's pretty easy to review the PRs.
What I like is that it creates and works on its own branch. I can actually check that branch out, fix a few things myself, push it and then get it to do PRs against that branch. I had to fix a few small compilation issues. In one case, the fix was just removing a single import that it somehow got wrong after that everything built and the tests passed. Overall it's pretty impressive. Very usable.
I wonder how it performs on larger code bases. I expect some issues there. I'm going to give that a try next.
It is also worth looking at the number of unique repositories for each agent, or the number of unique large repositories (e.g., by the threshold on the number of stars). Here is the report we can check:
I've also added some less popular agents like jetbrains-junie, and added a link to a random pull request for each agent, so we can look at the example PRs.
also, of course OpenAI Codex would perform well because the tool is heavily tailored to this type of task, whereas Cursor is a more general-purpose (in the programming domain) tool/app.
Merge rates is definitely a useful signal, but there are certainly other factors we need consider (PR small/big edits, refactors vs deps upgrades, direct merges, follow up PRs correcting merged mistakes, how easy it is to setup these AI agents, marketing, usage fees etc). Similar to how NPM downloads alone don’t necessarily reflect a package’s true success or quality.
I think the OP's page works because these coding agents identify themselves as the PR author so the creator can just search Github's issue tracker for things like is:pr+head:copilot or is:pr+head:codex
It seems like Claude Code doesn't do that? some preliminary searching reveals that PRs generated by people using Claude Code use their own user account but may sign that they used Claude, example https://github.com/anthropics/claude-code/pull/1732
I believe these are all "background" agents that, by default, are meant to write code and issue pull requests without you watching/babysitting/guiding the process. I haven't used Claude Code in a while, but from what I recall, it's not that.
Is this data not somewhat tainted by the fact that there's really zero way to identify how much a human was or wasn't "in the loop" before the PR was created?
With Jules, I almost always end up making significant changes before approving the PR. So “successful merge” is not great indicator of how well the model did in my case. I’ve merged PRs that were initially terrible after going in and fixing all the mistakes.
I kind of wondered about that re: Devin vs. Cursor, because the people I know that happen to use Devin are also very hands-on with the code they end up merging.
But you could probably filter this a bit by looking at PR commit counts?
It's hard to attribute PR merge rate with higher tool quality here. Another likely reason is level of complexity of task. Just looking at the first PR I saw from the github search for codex PRs, it was this one-line change that any tool, even years ago, could have easily accomplished: https://github.com/maruyamamasaya/yasukaribike/pull/20/files
For people using these, is there an advantage to having the agent create PRs and reviewing these versus just iterating with Cursor/Claude Code locally before committing? It seems like additional bureaucracy and process when you could fix the errors sooner and closer to the source.
Ignoring the issue of non-LLM team members, PR's are helpful if you are using GH issues as a memory mechanism, supposedly. That said, I don't bother if I don't have to. I have Claude commit automatically when it feels it made a change, then I curate things before I push (usually just squash).
How does this analysis handle potential false positives? For instance, if a user coincidentally names their branch `codex/my-branch`, would it be incorrectly included in the "Codex" statistics?
Total PRs between Codex vs Cursor is 208K vs 705, this is an enormous difference in absolute PRs. Since cursor is very popular, how does their PRs is not even 1% of codex PRs?.
Why is there 170k PR for a product released last month, but 700 for a product that has been around for like 6 months and was so popular it got acquired for 3B?
Missing data: I don't make a codex PR if it's nonsense.
Poor data: If I make one, I either if I want to:
a) Merge it (success)
b) Modify it (sometimes success, sometimes not). In one case, Codex made the wrong changes in all the right places, but it was still easier to work from that by hand.
It isn't so much "poor" data as it is a fairly high bar for value generation. If it gets merged it is a fairly clear indicator that some value is created. If it doesn't get merged then it may be adding some value or it may not.
lukehoban|8 months ago
This data is great, and it is exciting to see the rapid growth of autonomous coding agents across GitHub.
One thing to keep in mind regarding merge rates is that each of these products creates the PR at a different phase of the work. So just tracking PR create to PR merge tells a different story for each product.
In some cases, the work to iterate on the AI generated code (and potentially abandon it if not sufficiently good) is done in private, and only pushed to a GitHub PR once the user decides they are ready to share/merge. This is the case for Codex for example. The merge rates for product experiences like this will look good in the stats presented here, even if many AI generated code changes are being abandoned privately.
For other product experiences, the Draft PR is generated immediately when a task is assigned, and users can iterate on this “in the open” with the coding agent. This creates more transparency into both the success and failure cases (including logs of the agent sessions for both). This is the case for GitHub Copilot coding agent for example. We believe this “learning in the open” is valuable for individuals, teams, and the industry. But it does lead to the merge rates reported here appearing worse - even if logically they are the same as “task assignment to merged PR” success rates for other tools.
We’re looking forward to continuing to evolve the notion of Draft PR to be even more natural for these use cases. And to enabling all of these coding agents to benefit from open collaboration on GitHub.
polskibus|8 months ago
Current US stance seems to be: https://www.copyright.gov/newsnet/2025/1060.html “It concludes that the outputs of generative AI can be protected by copyright only where a human author has determined sufficient expressive elements”.
If entire commit is generated by AI then it is obvious what created it - it’s AI. Such commit might not be covered by the law. Is this something your team has already analysed?
soamv|8 months ago
osigurdson|8 months ago
I just started using Codex casually a few days ago though and already have 3 PRs. While different tools for different purposes make sense, Codex's fully async nature is so much nicer. It does simple things like improve consistency and make small improvements quite well which is really nice. Finally we have something that operates more like an appliance for a certain classes of problems. Previously it felt more like a teenager with a learners license.
elliotec|8 months ago
jillesvangurp|8 months ago
I found out that I have access to codex on Thursday with my plus subscription. I've created and merged about a dozen PRs with it on my OSS projects since then. It's not flawless but it's pretty good. I've done some tedious work that I had been deferring, got it to complete a few FIXMEs that I hadn't gotten around to fixing, made it write some API documentation, got it to update a README, etc. It's pretty easy to review the PRs.
What I like is that it creates and works on its own branch. I can actually check that branch out, fix a few things myself, push it and then get it to do PRs against that branch. I had to fix a few small compilation issues. In one case, the fix was just removing a single import that it somehow got wrong after that everything built and the tests passed. Overall it's pretty impressive. Very usable.
I wonder how it performs on larger code bases. I expect some issues there. I'm going to give that a try next.
wahnfrieden|8 months ago
deadbabe|8 months ago
zX41ZdbW|8 months ago
https://play.clickhouse.com/play?user=play#V0lUSCByZXBvX3N0Y...
I've also added some less popular agents like jetbrains-junie, and added a link to a random pull request for each agent, so we can look at the example PRs.
gavinray|8 months ago
That "spark bar-chart" column output is one of the neatest things I've seen in a while. What a brilliant feature.
behnamoh|8 months ago
also, of course OpenAI Codex would perform well because the tool is heavily tailored to this type of task, whereas Cursor is a more general-purpose (in the programming domain) tool/app.
ubj|8 months ago
ainiriand|8 months ago
NiekvdMaas|8 months ago
ukblewis|8 months ago
tmvnty|8 months ago
osigurdson|8 months ago
dimitri-vs|8 months ago
a_bonobo|8 months ago
It seems like Claude Code doesn't do that? some preliminary searching reveals that PRs generated by people using Claude Code use their own user account but may sign that they used Claude, example https://github.com/anthropics/claude-code/pull/1732
csallen|8 months ago
throwaway314155|8 months ago
thorum|8 months ago
tptacek|8 months ago
But you could probably filter this a bit by looking at PR commit counts?
SilverSlash|8 months ago
bkls|8 months ago
ehsanu1|8 months ago
knes|8 months ago
nojs|8 months ago
cap11235|8 months ago
yoran|8 months ago
s900mhz|8 months ago
Since all agents are able to use the terminal I suggest looking up the Gitlab CLI and have it use that. Should work locally and in runners.
myhandleisbest|8 months ago
Also filter conditions that would be interesting - size of PR, language, files affected, distinct organizations etc. lmk if these get added please!
joshstrange|8 months ago
https://cs.joshstrange.com/lWRtNMTk
zekone|8 months ago
or something will fix this
https://github.com/aavetis/ai-pr-watcher/issues/21
pkongz|8 months ago
selvan|8 months ago
ezyang|8 months ago
rahimnathwani|8 months ago
SkyPuncher|8 months ago
* Cursor agents where just introduced in Beta and have privacy limitations that prevent their usage as many organizations.
* Cursor is still focused on hands-on-keyboard agentic flows, which aren't included in these counts.
nikolayasdf123|8 months ago
TZubiri|8 months ago
simoncion|8 months ago
SatvikBeri|8 months ago
frognumber|8 months ago
Poor data: If I make one, I either if I want to:
a) Merge it (success)
b) Modify it (sometimes success, sometimes not). In one case, Codex made the wrong changes in all the right places, but it was still easier to work from that by hand.
c) Pick ideas from it (partial success)
So simple merge rates don't say much.
osigurdson|8 months ago
pryelluw|8 months ago
myhandleisbest|8 months ago
falcor84|8 months ago
m3kw9|8 months ago
cjbarber|8 months ago
kaelandt|8 months ago
m4r1k|8 months ago
unknown|8 months ago
[deleted]
rcarmo|8 months ago
zekone|8 months ago
zachlatta|8 months ago