An important note not mentioned in this announcement is that Claude 4's training cutoff date is March 2025, which is the latest of any recent model. (Gemini 2.5 has a cutoff of January 2025)
With web search being available in all major user-facing LLM products now (and I believe in some APIs as well, sometimes unintentionally), I feel like the exact month of cutoff is becoming less and less relevant, at least in my personal experience.
The models I'm regularly using are usually smart enough to figure out that they should be pulling in new information for a given topic.
Should we not necessarily assume that it would have some FastHTML training with that March 2025 cutoff date? I'd hope so but I guess it's more likely that it still hasn't trained on FastHTML?
One thing I'm 100% is that a cut off date doesn't exist for any large model, or rather there is no single date since it's practically almost impossible to achieve that.
“GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”
Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.
I am incredibly eager to see what affordable coding agents can do for open source :) in fact, I should really be giving away CheepCode[0] credits to open source projects. Pending any sort of formal structure, if you see this comment and want free coding agent runs, email me and I’ll set you up!
[0] My headless coding agents product, similar to “assign to copilot” but works from your task board (Linear, Jira, etc) on multiple tasks in parallel. So far simple/routine features are already quite successful. In general the better the tests, the better the resulting code (and yes, it can and does write its own tests).
That's kind of my benchmark for whether or not these models are useful. I've got a project that needs some extensive refactoring to get working again. Mostly upgrading packages, but also it will require updating the code to some new language semantics that didn't exist when it was written. So far, current AI models can make essentially zero progress on this task. I'll keep trying until they can!
> Users requiring raw chains of thought for advanced prompt engineering can contact sales
So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.
In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.
I can't be the only one who thinks this version is no better than the previous one, and that LLMs have basically reached a plateau, and all the new releases "feature" are more or less just gimmicks.
Sooo, I love Claude 3.7, and use it every day, I prefer it to Gemini models mostly, but I've just given Opus 4 a spin with Claude Code (codebase in Go) for a mostly greenfield feature (new files mostly) and... the thinking process is good, but 70-80% of tool calls are failing for me.
And I mean basic tools like "Write", "Update" failing with invalid syntax.
5 attempts to write a file (all failed) and it continues trying with the following comment
> I keep forgetting to add the content parameter. Let me fix that.
So something is wrong here. Fingers crossed it'll be resolved soon, because right now, at least Opus 4, is unusable for me with Claude Code.
The files it did succeed in creating were high quality.
i pay for claude premium but actually use grok quite a bit, the 'think' function usually gets me where i want more often than not. odd you don't have any xAI models listed. sure grok is a terrible name but it surprises me more often. i have not tried the $250 chatgpt model yet though, just don't like openAI practices lately.
Just curious, how do you know your questions and the SQL aren't in the LLM training data? Looks like the benchmark questions w/SQL are online (https://ghe.clickhouse.tech/).
Have they documented the context window changes for Claude 4 anywhere? My (barely informed) understanding was one of the reasons Gemini 2.5 has been so useful is that it can handle huge amounts of context --- 50-70kloc?
> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.
I don't want to see a "summary" of the model's reasoning! If I want to make sure the model's reasoning is accurate and that I can trust its output, I need to see the actual reasoning. It greatly annoys me that OpenAI and now Anthropic are moving towards a system of hiding the models thinking process, charging users for tokens they cannot see, and providing "summaries" that make it impossible to tell what's actually going on.
I really hope sonnet 4 is not obsessed with tool calls the way 3-7 is. 3-5 was sort of this magical experience where, for the first time, I felt the sense that models were going to master programming. It’s kind of been downhill from there.
It feels as if the CPU MHz wars of the '90s are back. Now instead of geeking about CPU architectures which have various results of ambigous value on different benchmarks, we're talking about the same sorts of nerdy things between LLMs.
After using Claude 3.7 Sonnet for a few weeks, my verdict is that its coding abilities are unimpressive both for unsupervised coding but also for problem solving/debugging if you are expecting accurate results and correct code.
However, as a debugging companion, it's slightly better than a rubber duck, because at least there's some suspension of disbelief so I tend to explain things to it earnestly and because of that, process them better by myself.
That said, it's remarkable and interesting how quickly these models are getting better. Can't say anything about version 4, not having tested it yet, but in a five years time, the things are not looking good for junior developers for sure, and a few years more, for everybody.
As a junior developer it's much easier for me to jump into a new codebase or language and make an impact. I just shipped a new error message in LLVM because Cline found the 5 spots in 10k+ files where I needed to make the code changes.
When I started an internship last year, it took me weeks to learn my way around my team's relatively smaller codebase.
I consider this a skill and cost issue.
If you are rich and able to read fast, you can start writing LLVM/Chrome/etc features before graduating university.
If you cannot afford the hundreds of dollars a month Claude costs or cannot effectively review the code as it is being generated, you will not be employable in the workforce.
Things were already not looking good for junior devs. I graduated this year in Poland, many of my peers were looking for jobs in IT for like a year before they were able to find anything. And many internships were faked as they couldn't get anything (here it's required for you to do internship if you want to graduate).
Most people who are happy with LLM coding say something like "Wow, it's awesome. I asked it to do X and it did it so fast with minimal bugs, and good code", and occasionally show the output. Many provide even more details.
Most people who are not happy with LLM coding ... provide almost no details.
As someone who's impressed by LLM coding, when I read a post like yours, I tend to have a lot of questions, and generally the post doesn't have the answers.
1. What type of problem did you try it out with?
2. Which model did you use (you get points for providing that one!)
3. Did you consider a better model (compare how Gemini 2.5 Pro compares to Sonnet 3.7 on the Aider leaderboard)?
4. What were its failings? Buggy code? Correct code but poorly architected? Correct code but used some obscure method to solve it rather than a canonical one?
5. Was it working on an existing codebase or was this new code?
6. Did you manage well how many tokens were sent? Did you use a tool that informs you of the number of tokens for each query?
7. Which tool did you use? It's not just a question of the model, but of how the tool handles the prompts/agents under it. Aider is different from Code which is different from Cursor which is different form Windsurf.
8. What strategy did you follow? Did you give it the broad spec and ask it to do anything? Did you work bottom up and work incrementally?
I'm not saying LLM coding is the best or can replace a human. But for certain use cases (e.g. simple script, written from scratch), it's absolutely fantastic. I (mostly) don't use it on production code, but little peripheral scripts I need to write (at home or work), it's great. And that's why people like me wonder what people like you are doing differently.
But such people aren't forthcoming with the details.
Claude 3.8 wrote me some code this morning, and I was running into a bug. I switched to 4 and gave it its own code. It pointed out the bug right away and fixed it. So an upgrade for me :-)
is 'Claude does not have the ability to run the code it generates yet' a split infinitive? it's clumsy, no? i'd say 'Claude does not yet have the ability to run the code it generates'
Claude:
You're absolutely right on both counts!
"Claude does not have the ability to run the code it generates yet" isn't technically a split infinitive (that would be something like "to quickly run"), but it is definitely clumsy. The word "yet" is awkwardly placed at the end, creating unnecessary distance from the verb it's modifying.
Your revision - "Claude does not yet have the ability to run the code it generates" - flows much better. It places "yet" right after the auxiliary verb where it naturally belongs, making the sentence cleaner and more direct.
The original reads like an afterthought was tacked on, while yours has a natural rhythm that emphasizes the temporal aspect more clearly.
It feels like these new models are no longer making order of magnitude jumps, but are instead into the long tail of incremental improvements. It seems like we might be close to maxing out what the current iteration of LLMs can accomplish and we're into the diminishing returns phase.
If that's the case, then I have a bad feeling for the state of our industry. My experience with LLMs is that their code does _not_ cut it. The hallucinations are still a serious issue, and even when they aren't hallucinating they do not generate quality code. Their code is riddled with bugs, bad architectures, and poor decisions.
Writing good code with an LLM isn't any faster than writing good code without it, since the vast majority of an engineer's time isn't spent writing -- it's spent reading and thinking. You have to spend more or less the same amount of time with the LLM understanding the code, thinking about the problems, and verifying its work (and then reprompting or redoing its work) as you would just writing it yourself from the beginning (most of the time).
Which means that all these companies that are firing workers and demanding their remaining employees use LLMs to increase their productivity and throughput are going to find themselves in a few years with spaghettified, bug-riddled codebases that no one understands. And competitors who _didn't_ jump on the AI bandwagon, but instead kept grinding with a strong focus on quality will eat their lunches.
Of course, there could be an unforeseen new order of magnitude jump. There's always the chance of that and then my prediction would be invalid. But so far, what I see is a fast approaching plateau.
Using Claude Opus 4, this was the first time I've gotten any of these models to produce functioning Dyalog APL that does something relatively complicated. And it actually runs without errors. Crazy (at least to me).
I'm curious what are others priors when reading benchmark scores. Obviously with immense funding at stakes, companies have every incentive to game the benchmarks, and the loss of goodwill from gaming the system doesn't appear to have much consequences.
Obviously trying the model for your use cases more and more lets you narrow in on actually utility, but I'm wondering how others interpret reported benchmarks these days.
Sooo... it can play Pokemon. Feels like they had to throw that in after Google IO yesterday. But the real question is now can it beat the game including the Elite Four and the Champion. That was pretty impressive for the new Gemini model.
[+] [-] minimaxir|9 months ago|reply
https://docs.anthropic.com/en/docs/about-claude/models/overv...
[+] [-] lxgr|9 months ago|reply
The models I'm regularly using are usually smart enough to figure out that they should be pulling in new information for a given topic.
[+] [-] tristanb|9 months ago|reply
[+] [-] liorn|9 months ago|reply
> Which version of tailwind css do you know?
> I have knowledge of Tailwind CSS up to version 3.4, which was the latest stable version as of my knowledge cutoff in January 2025.
[+] [-] Phelinofist|9 months ago|reply
[+] [-] m3kw9|9 months ago|reply
[+] [-] indigodaddy|9 months ago|reply
[+] [-] VectorLock|9 months ago|reply
[+] [-] dvfjsdhgfv|9 months ago|reply
[+] [-] cma|9 months ago|reply
[+] [-] SparkyMcUnicorn|9 months ago|reply
Both Sonnet and Opus 4 say Joe Biden is president and claim their knowledge cutoff is "April 2024".
[+] [-] jasonthorsness|9 months ago|reply
Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.
[+] [-] rco8786|9 months ago|reply
[+] [-] max_on_hn|9 months ago|reply
[0] My headless coding agents product, similar to “assign to copilot” but works from your task board (Linear, Jira, etc) on multiple tasks in parallel. So far simple/routine features are already quite successful. In general the better the tests, the better the resulting code (and yes, it can and does write its own tests).
[+] [-] ModernMech|9 months ago|reply
[+] [-] epolanski|9 months ago|reply
Those are already non-issues mostly solved by bots.
In any case, where I think AI could help here would be by summarizing changes, conflicts, impact on codebase and possibly also conduct security scans.
[+] [-] BaculumMeumEst|9 months ago|reply
[+] [-] phito|9 months ago|reply
[+] [-] ed_elliott_asc|9 months ago|reply
[+] [-] Doohickey-d|9 months ago|reply
So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.
In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.
[+] [-] hsn915|9 months ago|reply
[+] [-] cube2222|9 months ago|reply
And I mean basic tools like "Write", "Update" failing with invalid syntax.
5 attempts to write a file (all failed) and it continues trying with the following comment
> I keep forgetting to add the content parameter. Let me fix that.
So something is wrong here. Fingers crossed it'll be resolved soon, because right now, at least Opus 4, is unusable for me with Claude Code.
The files it did succeed in creating were high quality.
[+] [-] _peregrine_|9 months ago|reply
Opus 4 beat all other models. It's good.
[+] [-] XCSme|9 months ago|reply
If a model is really that much smarter, shouldn't it lead to better first-attempt performance? It still "thinks" beforehand, right?
[+] [-] stadeschuldt|9 months ago|reply
[+] [-] Workaccount2|9 months ago|reply
[+] [-] ineedaj0b|9 months ago|reply
[+] [-] gkfasdfasdf|9 months ago|reply
[+] [-] sagarpatil|9 months ago|reply
[+] [-] dcreater|9 months ago|reply
[+] [-] mritchie712|9 months ago|reply
I wonder how much the results would change with a more agentic flow (e.g. allow it to see an error or select * from the_table first).
sonnet seems particularly good at in-session learning (e.g. correcting it's own mistakes based on a linter).
[+] [-] jpau|9 months ago|reply
Is there anything to read into needing twice the "Avg Attempts", or is this column relatively uninteresting in the overall context of the bench?
[+] [-] XCSme|9 months ago|reply
[+] [-] tptacek|9 months ago|reply
[+] [-] a2128|9 months ago|reply
I don't want to see a "summary" of the model's reasoning! If I want to make sure the model's reasoning is accurate and that I can trust its output, I need to see the actual reasoning. It greatly annoys me that OpenAI and now Anthropic are moving towards a system of hiding the models thinking process, charging users for tokens they cannot see, and providing "summaries" that make it impossible to tell what's actually going on.
[+] [-] waleedlatif1|9 months ago|reply
[+] [-] travisgriggs|9 months ago|reply
History Rhymes with Itself.
[+] [-] modeless|9 months ago|reply
Edit: How do you install it? Running `/ide` says "Make sure your IDE has the Claude Code extension", where do you get that?
[+] [-] GolDDranks|9 months ago|reply
However, as a debugging companion, it's slightly better than a rubber duck, because at least there's some suspension of disbelief so I tend to explain things to it earnestly and because of that, process them better by myself.
That said, it's remarkable and interesting how quickly these models are getting better. Can't say anything about version 4, not having tested it yet, but in a five years time, the things are not looking good for junior developers for sure, and a few years more, for everybody.
[+] [-] jjmarr|9 months ago|reply
When I started an internship last year, it took me weeks to learn my way around my team's relatively smaller codebase.
I consider this a skill and cost issue.
If you are rich and able to read fast, you can start writing LLVM/Chrome/etc features before graduating university.
If you cannot afford the hundreds of dollars a month Claude costs or cannot effectively review the code as it is being generated, you will not be employable in the workforce.
[+] [-] StefanBatory|9 months ago|reply
[+] [-] BeetleB|9 months ago|reply
Most people who are happy with LLM coding say something like "Wow, it's awesome. I asked it to do X and it did it so fast with minimal bugs, and good code", and occasionally show the output. Many provide even more details.
Most people who are not happy with LLM coding ... provide almost no details.
As someone who's impressed by LLM coding, when I read a post like yours, I tend to have a lot of questions, and generally the post doesn't have the answers.
1. What type of problem did you try it out with?
2. Which model did you use (you get points for providing that one!)
3. Did you consider a better model (compare how Gemini 2.5 Pro compares to Sonnet 3.7 on the Aider leaderboard)?
4. What were its failings? Buggy code? Correct code but poorly architected? Correct code but used some obscure method to solve it rather than a canonical one?
5. Was it working on an existing codebase or was this new code?
6. Did you manage well how many tokens were sent? Did you use a tool that informs you of the number of tokens for each query?
7. Which tool did you use? It's not just a question of the model, but of how the tool handles the prompts/agents under it. Aider is different from Code which is different from Cursor which is different form Windsurf.
8. What strategy did you follow? Did you give it the broad spec and ask it to do anything? Did you work bottom up and work incrementally?
I'm not saying LLM coding is the best or can replace a human. But for certain use cases (e.g. simple script, written from scratch), it's absolutely fantastic. I (mostly) don't use it on production code, but little peripheral scripts I need to write (at home or work), it's great. And that's why people like me wonder what people like you are doing differently.
But such people aren't forthcoming with the details.
[+] [-] cschmidt|9 months ago|reply
[+] [-] jen729w|9 months ago|reply
[+] [-] IceHegel|9 months ago|reply
1. It tended to produce very overcomplicated and high line count solutions, even compared to 3.5.
2. It didn't follow instructions code style very well. For example, the instruction to not add docstrings was often ignored.
Hopefully 4 is more steerable.
[+] [-] zone411|9 months ago|reply
Claude Opus 4 Thinking 16K: 52.7.
Claude Opus 4 No Reasoning: 34.8.
Claude Sonnet 4 Thinking 64K: 39.6.
Claude Sonnet 4 Thinking 16K: 41.4 (Sonnet 3.7 Thinking 16K was 33.6).
Claude Sonnet 4 No Reasoning: 25.7 (Sonnet 3.7 No Reasoning was 19.2).
Claude Sonnet 4 Thinking 64K refused to provide one puzzle answer, citing "Output blocked by content filtering policy." Other models did not refuse.
[+] [-] dbingham|9 months ago|reply
If that's the case, then I have a bad feeling for the state of our industry. My experience with LLMs is that their code does _not_ cut it. The hallucinations are still a serious issue, and even when they aren't hallucinating they do not generate quality code. Their code is riddled with bugs, bad architectures, and poor decisions.
Writing good code with an LLM isn't any faster than writing good code without it, since the vast majority of an engineer's time isn't spent writing -- it's spent reading and thinking. You have to spend more or less the same amount of time with the LLM understanding the code, thinking about the problems, and verifying its work (and then reprompting or redoing its work) as you would just writing it yourself from the beginning (most of the time).
Which means that all these companies that are firing workers and demanding their remaining employees use LLMs to increase their productivity and throughput are going to find themselves in a few years with spaghettified, bug-riddled codebases that no one understands. And competitors who _didn't_ jump on the AI bandwagon, but instead kept grinding with a strong focus on quality will eat their lunches.
Of course, there could be an unforeseen new order of magnitude jump. There's always the chance of that and then my prediction would be invalid. But so far, what I see is a fast approaching plateau.
[+] [-] sndean|9 months ago|reply
[+] [-] uludag|9 months ago|reply
Obviously trying the model for your use cases more and more lets you narrow in on actually utility, but I'm wondering how others interpret reported benchmarks these days.
[+] [-] sigmoid10|9 months ago|reply