The lack of transparency here is wild. They aggregate the scores of the models they test against, which obscures the performance. They only release results on their own internal benchmark that they won't release. They talk about RL training but they don't discuss anything else about how the model was trained, including if they did their own pre-training or fine-tuned an existing model. I'm skeptical of basically everything claimed here until either they share more details or someone is able to interpedently benchmark this.
I understand where you're coming from, and I'd love to have learned about pre-training vs. off-the-shelf base model too.
But
> their own internal benchmark that they won't release
If they'd release their internal benchmark suite, it'd make it into the training set of about every LLM, which from a strictly scientific standpoint, invalidates all conclusions drawn from that benchmark from then on. On the other hand, not releasing the benchmark means they could've hand-picked the datapoints to favor them. It's a problem that can't be resolved unfortunately.
Disagree. The ultimate bar which is easily measurable, do users find value in it. Benchmarks are mostly meaningless especially in my opinion where cursor shines which is the tool chain. You can go try composer yourself today and see if it’s valuable to you.
Does it really matter tho? At the end of the day, what matters most is if real users find it useful or not. And cursor has that data (both historically and in real-time). Thousands of accepts/rejects >>> any benchmark that you can come up with. That should allow them to iterate on it, and make it better, eventually.
Benchmarks have become less and less useful. We have our own tests that we run whenever a new model comes out. It's a collection of trivial -> medium -> hard tasks that we've gathered, and it's much more useful to us than any published table. And it leads to more interesting finds, such as using cheaper models (5-mini, fast-code-1, etc) on some tasks vs. the big guns on other tasks.
I'm happy to see cursor iterate, as they were pretty vulnerable to the labs leaving them behind when all of them came out with coding agents. The multi-agents w/ built in git tree support is another big thing they launched recently. They can use their users as "teacher models" for multiple completions by competing models, and by proxying those calls, they get all the signals. And they can then use those signals to iterate on their own models. Cool stuff. We actually need competing products keeping eachother in check, w/ the end result being more options for us, and sometimes even cheaper usage overall.
Cursor has the best Tab model, and I feel like their lead there has kept growing - they're doing some really cool things there. https://cursor.com/blog/tab-rl
I wonder how much the methods/systems/data transfer, if they can pull off the same with their agentic coding model that would be exciting.
I agree, I tried to switch to Zed this week, and I prefer it in all respects, but the tab model is much worse, and it made me switch back. I never imagined I would care so much about a feature I felt was secondary.
I actually find myself using the agent mode less now, I like keeping code lean by hand and avoid technical debt. But I do use the tab completions constantly and they are fantastic now ever since they can jump around the file.
I feel like that's like having a lead in producing better buggy whips.
I run Claude Code in the background near constantly for a variety of projects, with --dangerously-skip-permissions, and review progress periodically. Tabbing is only relevant when it's totally failing to make progress and I have to manually intervene, and that to me is a failure scenario that is happening less and less often.
I am an ML researcher at Cursor, and worked on this project. Would love to hear any feedback you may have on the model, and can answer question about the blog post.
Impressive systems write-up. A question: if Composer is an RL finetune on an open model, why keep weights closed? The edge from a slightly better checkpoint erodes quickly in this market, it's not a durable advantage. Composer protects Cursor's margins from being squeezed by the big AI labs, but that is true whether the weights are open or closed, and I think Cursor would have more lasting benefit by generating developer goodwill than from a narrow, short-lived advantage. But, that's just my opinion. I personally find it hard to get excited about yet-another proprietary model. GPT-5 and Sonnet 4.5 are around when I need one of those, but I think the future is open.
I don't use these tools that much ( I tried and rejected Cursor a while ago, and decided not to use it ) but having played with GPT5 Codex ( as a paying customer) yesterday in regular VSCode , and having had Composer1 do the exact same things just now, it's night and day.
Composer did everything better, didn't stumble where Codex failed, and most importantly, the speed makes a huge difference. It's extremely comfortable to use, congrats.
Edit: I will therefore reconsider my previous rejection
Why did you stop training shy of the frontier models? From the log plot it seems like you would only need ~50% more compute to reach frontier capability
Do you have any graphs handy that kind of replicates the one used first in the blog post but a bit less ambiguous, maybe without model grouping? I feel like it would have been a bit more fair to include proper names, and individualize them rather than group everything together by something, and then present your own model on its own.
Congratulations on your work. I spent the day working with a mix of the Composer/Sonnet 4.5/Gemini 2.5 Pro models. In terms of quality, the Composer seems to perform well compared to the others. I have no complaints so far. I'm still using Claude for planning/starting a task, but the Composer performed very well in execution.
What I've really enjoyed is the speed. I had already tested other fast models, but with poor quality. Composer is the first one that combines speed and quality, and the experience has been very enjoyable to work with.
I prefer the approach of focusing on faster models despite their lower intelligence because I want my IDE to fly when I can see the code. I find this useful when I need to manually debug something that any model is able to do, so I know it's going to fail but at least it will fail fast. On the other hand, if I need more intelligence I have my other CLI that doesn't allow me to see the code but gets the planning and difficult code done.
Maybe I'm an outlier but Sonnet 4.5 quality is about as low as I'm willing to go.
It's generation speed is not the problem or the time sink.
It's wrestling with it to get the right output.
---
And just to clarify as maybe I misunderstood again but people are comparing cursor to Claude Code and codex etc here- isn't this whole article all cursor just using different models?
There’s two different kinds of users, on one side people are more hands off and want the model to autonomously handle longer tasks on its own with minimal guidance, and on the other side is users who want to interactively collaborate with the model to produce desired results. Speed matters much more for the second case, where you know what you want and just want the model to implement whatever you had in mind as quick as possible. Intelligence/ability matters more for the first case when you don’t have full understanding of all the code. I think it’s context dependent for me where more serious work tends to be more interactive. The intelligence of a model doesn’t make up for issues due to lack of context to me.
Same... I've found that using a non-Claude model just ends up being more expensive and not worth it. "Auto" tokens are hardly free, and I've had plenty of experiences putting "Auto" to work on a "simple" seeming task to have it consume like 1 USD of tokens quite quickly while producing nothing of value, when I'd replay with Claude 4.5 Sonnet non-thinking and it would provide a solid solution for 0.5 USD.
The reason I pulled out the comparison is to highlight how serious they are about all the important parts that make or break the AI coding experience - speed being very important to me. I’d rather catch my model doing the wrong thing quickly than having a higher chance of one-shotting it at the cost of having to do a lot of specification upfront.
While I am excited to see a new model, I am skeptical when there is so much vagueness - charts with "frontier models" without actually spelling out which ones, charts with no numbers (time axis, or in one chart - entirely).
People on here love to be contrarian about Cursor, but I’ve tried all the popular alternatives (Copilot, Claude Code, Codex, Gemini CLI, Cline) and found Cursor’s overall experience to just be unmatched. A big part of that is its speed, another its reliability.
It’s the only coding agent I’m actually really motivated to use out of the box because it really does make me feel more productive while the others keep messing up the project, from way too large changes I didn’t ask for all the way to constant syntax and request errors.
It’s the only coding agent I’ve used that feels serious about being a product rather than a prototype. Their effort in improving their stack is totally paying off.
I dropped cursor for the precise reason you mention: reliability.
Countless times my requests in the AI chat just hang there for 30+ seconds more until I can retry them.
When I decided to give Claude Code a try (I thought I didn't need it because I used Claude in Cursor) I couldn't believe how faster it was, and literally 100% reliable.
EDIT: given today's release, decided to give it a go. The Composer1 model _is_ fast, but right at the second new agent I started I got this:
> Connection failed. If the problem persists, please check your internet connection or VPN
I too have tried them all and have settled with Cursor being the best. That said I see the current space split between folks like me who know generally what I want built and appreciate a tool that helps me get to goal quicker and on the otherwise of the spectrum, folks who want the tool to orchestrate most of the engineering. I have no opinion on which is better but for me I sit on the first camp. In that camp Cursor is by far the best tool.
I used Cursor for the total of one day (paid for a year subscription), discovered Claude Code later that day and havent opened Cursor since.
Note, later I started using Codex and now Codex is my daily driver, Claude Code for problems where Codex fails (not many), and again Cursor is never used.
They were the first mover but Codex (in my opinion) blows Cursor up into 1000 tiny pieces. It's just so, so much better.
Yep, it just works seamlessly. Sure, it hangs sometimes, but their UI allows you to retry or undo changes to an earlier point in the conversation easily. The autocompletion is nice as well and pretty satisfying to tab through the small and menial things when refactoring.
There are lots of good models we like here. But we agree that getting the right point on the smart+fast graph can make agentic coding feel really good.
I love Cursor. I've tried Copilot/Claude/etc. but keep coming back to Cursor. I just want to work, and Cursor tab complete is dang accurate, esp. for refactoring tasks.
I tried going back to VS Code + Copilot a month ago. I only lasted 4 days because it was to bad. It was super slow and gave poor suggestions, but mostly it just flat out did not suggest anything. Cursor feels snappy in comparison and the suggestions are more often than not useful. The most annoying thing about Cursor tab complete, is that it is so fast that when I am doing something unusual then it will keep on jumping in with useless suggestions. They have a snooze function for this though.
For anyone else who was wondering, it looks like the within-Cursor model pricing for Cursor Composer is identical to gemini-2.5-pro, gpt-5, and gpt-5-codex: https://cursor.com/docs/models#model-pricing
($1.25 input, $1.25 cache write, $0.13 cache read, and $10 output per million tokens)
I'm curious if their near-term expectation is that this is be better than these models or is this a model they tend to use in Auto mode, or if the focus is really if you want speed...? I guess my question is why would I actively chose this over Auto?
I think both Cursor and Cognition and going in the same direction of SWE-grep[0].
SWE-grep was able to hit ~700tokens/s and Cursor ~300token/s, hard to compare the precision/recall and cost effectiveness though, considering SWE-grep also adopted a "hack" of running it on Cerebras.
I'm trying to kickstart a RL-based code search project called "op-grep" here[1], still pretty early, but looking for collaborators!
I used the new system tonight and it felt like a definite downgrade. Generated a few non-working basic apps, couldn’t handle CSS in a NextJS environment. Terminal context didn’t work. And it went back to not reasoning through the problem until resolution. And kept slowing down.
I’m assuming major release vs stable, but this is pretty lackluster so far. Switched back to Sonnet reasoning. Here’s to improving!
Could anyone explain how to use multiple agents and subagents in Cursor, Claude Code, or others? It is already challenging to me taming one model doing work, let alone synchronizing multiple parallel workers.
Do you have to split the plan in parallelizable tasks that could be worked in parallel in one codebase without breaking and confusing the other agents?
you can use git worktrees and just have multiple Claude Code terminal instances working on each worktree. That way they don't clash, just delete the worktree when the task is done.
I love cursor, the tab completion and agent mode. But I really dislike vscode after using intellij for so many years. I really wish the underlying editor was better, or I could get cursor features in intellij instead. The editing of the files is mostly fine, but its everything else around it that a full IDE provides thats just so much better. Right now its intellij + claude code for me, and its fine, but I wish I could get the AI power of cursor in a better package.
Intellij's tab-complete is coming along; it's hit and miss if it will work but for similar edits I'm finding it picks up the pattern quickly and I can tab - tab - tab to make them happen.
Building off of VSCode was probably Cursors silver bullet and the best decision they could have ever made.
It made migrating for everyone using VSCode (probably the single most popular editor) or another vscode forked editor (but at the time it was basically all VSCode) as simple as install and import settings.
I do not think Cursor would have done nearly as well as it has if it didn't. So even though it can be subpar in some areas due to VSCodes baggage, its probably staying that way for a while.
Hey - really sorry to hear this - could you email me andrew@cursor.com? Here are 3 suggestions to try-
1. Reset your settings.json - if shared with vscode, sometimes settings can cause perf regressions
2. Could you try cmd-shift-p -> "capture and send debugging data"? Will send us some profiling data to debug
3. Clear your user data (will delete chats) as a last resort - cmd-shift-p, "reveal user data," close the app, then delete this folder and restart the app
Unfortunately not, as we used our own internal code for the benchmark. We would also like to see more benchmarks that reflect the day-to-day agentic coding use.
As a stealth model, it was priced as $1.25M in / $10M out
Right now, it seems free when you are a Cursor Pro user, but I'd love more clarity on how much it will cost (I can't believe it'll be unlimited usage for subscribers)
The metrics in the post seem quite abstract. Does anyone know the detailed metrics of this mysterious model? Was it fine-tuned from open models or trained from scratch?
I wonder if this custom model is trained on cursor users. There’s a lot of potential on how much better a custom model could be the closer it is integrated with the product. Having the model learn to adapt to different user preferences would make it stand out compared to memoryless frontier models.
The fact that you are wondering this is bad. You definitely should know this. _ALL_ the online ai providers are training on your data. They have more expensive enterprise plans if want to opt out.
Please keep the naming of your models sane, I'd like to know that composer 1 is the first model and composer 2 is second but composer 1o is not yet another 1 variant that's actually newer and better than 2, that's just dumb. Not that you're doing that, some other companies do that.
cwyers|4 months ago
criemen|4 months ago
> their own internal benchmark that they won't release
If they'd release their internal benchmark suite, it'd make it into the training set of about every LLM, which from a strictly scientific standpoint, invalidates all conclusions drawn from that benchmark from then on. On the other hand, not releasing the benchmark means they could've hand-picked the datapoints to favor them. It's a problem that can't be resolved unfortunately.
infecto|4 months ago
NitpickLawyer|4 months ago
Benchmarks have become less and less useful. We have our own tests that we run whenever a new model comes out. It's a collection of trivial -> medium -> hard tasks that we've gathered, and it's much more useful to us than any published table. And it leads to more interesting finds, such as using cheaper models (5-mini, fast-code-1, etc) on some tasks vs. the big guns on other tasks.
I'm happy to see cursor iterate, as they were pretty vulnerable to the labs leaving them behind when all of them came out with coding agents. The multi-agents w/ built in git tree support is another big thing they launched recently. They can use their users as "teacher models" for multiple completions by competing models, and by proxying those calls, they get all the signals. And they can then use those signals to iterate on their own models. Cool stuff. We actually need competing products keeping eachother in check, w/ the end result being more options for us, and sometimes even cheaper usage overall.
jonasnelle|4 months ago
I wonder how much the methods/systems/data transfer, if they can pull off the same with their agentic coding model that would be exciting.
enraged_camel|4 months ago
srush|4 months ago
oersted|4 months ago
I actually find myself using the agent mode less now, I like keeping code lean by hand and avoid technical debt. But I do use the tab completions constantly and they are fantastic now ever since they can jump around the file.
vidarh|4 months ago
I run Claude Code in the background near constantly for a variety of projects, with --dangerously-skip-permissions, and review progress periodically. Tabbing is only relevant when it's totally failing to make progress and I have to manually intervene, and that to me is a failure scenario that is happening less and less often.
dagss|4 months ago
Every time I write code myself I find myself racing the AI to get an indentation in before the AI is done... gets annoying
typpilol|4 months ago
TiredOfLife|4 months ago
srush|4 months ago
I am an ML researcher at Cursor, and worked on this project. Would love to hear any feedback you may have on the model, and can answer question about the blog post.
coder543|4 months ago
Agingcoder|4 months ago
I don't use these tools that much ( I tried and rejected Cursor a while ago, and decided not to use it ) but having played with GPT5 Codex ( as a paying customer) yesterday in regular VSCode , and having had Composer1 do the exact same things just now, it's night and day.
Composer did everything better, didn't stumble where Codex failed, and most importantly, the speed makes a huge difference. It's extremely comfortable to use, congrats.
Edit: I will therefore reconsider my previous rejection
WanderPanda|4 months ago
chaidhat|4 months ago
embedding-shape|4 months ago
alyxya|4 months ago
dfltr|4 months ago
MysticFear|4 months ago
dlojudice|4 months ago
juanma0216|4 months ago
pdeva1|4 months ago
smg|4 months ago
ripped_britches|4 months ago
GPT-5-codex does more research before tackling a task, that is the biggest weakness for me not using Composer yet.
Could you provide any color on whether ACP (from zed) will be supported?
az226|4 months ago
carlosbaraza|4 months ago
jasonjmcghee|4 months ago
It's generation speed is not the problem or the time sink.
It's wrestling with it to get the right output.
---
And just to clarify as maybe I misunderstood again but people are comparing cursor to Claude Code and codex etc here- isn't this whole article all cursor just using different models?
swyx|4 months ago
literally a 30 day old model and you've moved the "low" goalpost all the way there haha. funny how humans work
alyxya|4 months ago
srush|4 months ago
timcobb|4 months ago
solarkraft|4 months ago
NaomiLehman|4 months ago
stared|4 months ago
srush|4 months ago
solarkraft|4 months ago
It’s the only coding agent I’m actually really motivated to use out of the box because it really does make me feel more productive while the others keep messing up the project, from way too large changes I didn’t ask for all the way to constant syntax and request errors.
It’s the only coding agent I’ve used that feels serious about being a product rather than a prototype. Their effort in improving their stack is totally paying off.
pqdbr|4 months ago
Countless times my requests in the AI chat just hang there for 30+ seconds more until I can retry them.
When I decided to give Claude Code a try (I thought I didn't need it because I used Claude in Cursor) I couldn't believe how faster it was, and literally 100% reliable.
EDIT: given today's release, decided to give it a go. The Composer1 model _is_ fast, but right at the second new agent I started I got this:
> Connection failed. If the problem persists, please check your internet connection or VPN
infecto|4 months ago
saberience|4 months ago
Note, later I started using Codex and now Codex is my daily driver, Claude Code for problems where Codex fails (not many), and again Cursor is never used.
They were the first mover but Codex (in my opinion) blows Cursor up into 1000 tiny pieces. It's just so, so much better.
psygn89|4 months ago
rtfeldman|4 months ago
Can't help but notice you haven't tried Zed!
ramon156|4 months ago
OsrsNeedsf2P|4 months ago
srush|4 months ago
(Cursor researcher)
nu11ptr|4 months ago
Sammi|4 months ago
simonw|4 months ago
jeffnv|4 months ago
neuronexmachina|4 months ago
($1.25 input, $1.25 cache write, $0.13 cache read, and $10 output per million tokens)
lubujackson|4 months ago
SafeDusk|4 months ago
SWE-grep was able to hit ~700tokens/s and Cursor ~300token/s, hard to compare the precision/recall and cost effectiveness though, considering SWE-grep also adopted a "hack" of running it on Cerebras.
I'm trying to kickstart a RL-based code search project called "op-grep" here[1], still pretty early, but looking for collaborators!
[0]: https://cognition.ai/blog/swe-grep [1]: https://github.com/aperoc/op-grep
swyx|4 months ago
toobulkeh|4 months ago
I’m assuming major release vs stable, but this is pretty lackluster so far. Switched back to Sonnet reasoning. Here’s to improving!
carlosbaraza|4 months ago
Do you have to split the plan in parallelizable tasks that could be worked in parallel in one codebase without breaking and confusing the other agents?
asdev|4 months ago
koakuma-chan|4 months ago
kilroy123|4 months ago
I think competition in the space is a good thing, but I'm very skeptical their model will outperform Claude.
80hd|4 months ago
srush|4 months ago
[1] https://www.businessinsider.com/no-shoes-policy-in-office-cu...
netcraft|4 months ago
pbowyer|4 months ago
Still not up to Cursor standards though :)
Jcampuzano2|4 months ago
It made migrating for everyone using VSCode (probably the single most popular editor) or another vscode forked editor (but at the time it was basically all VSCode) as simple as install and import settings.
I do not think Cursor would have done nearly as well as it has if it didn't. So even though it can be subpar in some areas due to VSCodes baggage, its probably staying that way for a while.
carlosbaraza|4 months ago
amilich|4 months ago
Jayakumark|4 months ago
romanovcode|4 months ago
matheist|4 months ago
swyx|4 months ago
other links across the web:
https://x.com/amanrsanger/status/1983581288755032320?s=46
https://x.com/cursor_ai/status/1983567619946147967?s=46
swyx|4 months ago
Cursor Cheetah wouldve been amazing. reusing the Composer name feels like the reverse OpenAI Codex move haha
asdev|4 months ago
srush|4 months ago
timcobb|4 months ago
sebdufbeau|4 months ago
Right now, it seems free when you are a Cursor Pro user, but I'd love more clarity on how much it will cost (I can't believe it'll be unlimited usage for subscribers)
skeptrune|4 months ago
ciphix|4 months ago
ianberdin|4 months ago
alyxya|4 months ago
Sammi|4 months ago
ibash|4 months ago
numbers|4 months ago
srush|4 months ago
arresin|4 months ago