GPT-5 usage is 20% higher on days that start with "S"
Nevertheless, 7 datapoints does not a trend make (and the data presented certainly doesnt explain why). The daily variation is more than I would have expected, but could also be down to what day of the week the pizza party is or the weekly scrum meetings is at a few of their customers workplaces.
For development use cases, I switched to Sonnet 4.5 and haven't looked back. I mean, sure, sometimes I also use GPT-5 (and mini) and Gemini 2.5 Pro (and Flash), and also Cerebras Code just switched to providing GLM 4.6 instead of the previous Qwen3 Coder so those as well, but in general the frontier models are pretty good for development and I wouldn't have much reason to use something like Sonnet 4 or 3.7 or whatever.
I have canceled my Claude Max subscription because Sonnet 4.5 is just too unreliable. For the rest of the month I'm using Opus 4.1 which is much better but seems to have much lower usage limits than before Sonnet 4.5 was released. When I hit 4.1 Opus limits I'm using Codex. I will probably go through with the Codex pro subscription.
Yeah, I'm just going through the Cerebras migration at the moment.
It's a shame Cerebras completely dropped Qwen3 Coder's fast tool calling, short and instant responses, and better speed overall for GLM 4.6 thinking. Qwen3 is really good at hitting the tools first, then coming up with a well-grounded answer based on reality. Sometimes it's good when a model is Socratic: just knows it knows nothing.
GLM 4.6 on the other hand is more self-sufficient and if it sees it, and knows it, it thinks and thinks and finally just fixes it in one or two shots, so when you hit the jackpot, it probably an improvement over Q3C. But when it does not get it right, it digs itself into a hole larger than the Olympus Mons.
For development use cases, it's best to use multiple models anyway. E.g. my favorite model is the Gemini 2.5 Pro, but there are certain cases where Qwen3 Coder gives much better results. (Gemini likes to overthink.) It's like having a team of competent developers provide their opinions. For important parts (security, efficiency, APIs), it's always good to get opinions from different sources.
I wish we could pin down not only the model but also the way the UI works as well.
Last week Claude seemed to have a shift in the way it works. The way it summarises and outputs its results is different. For me it's gotten worse. Slower, worse results, more confusing narrowing down what actually changed etc etc.
Long story short, I wish I was able to checkpoint the entire system and just revert to how it was previously. I feel like it had gotten to a stage where I felt pretty satisfied, and whatever got changed ... I just want it reverted!
I agree, much slower and worse output. It is substantially worse now than it was weeks ago.
It spends a lot of time coming up with “UI options” (Select 1, 2 or 3 with a TUI interface) for me to consider when it could just ask me what I want, not come up with a 5 layer flow chart of possibilities.
Overall I think it is just Anthropic tweaking things to reduce costs.
I am paying for a Max subscription but I am going to reevaluate other options.
4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.
Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them
I've found that the VSCode GitHub Copilot extension defaults to Claude Sonnet 4.0 (in agent mode) in all new workspaces. It's the first thing I check now, but I imagine a lot of people just roll with it, especially if they use inline completions where it might not be obvious what model is being used.
Seems to completely ignore usage of local/free models as well as anything but Sonnet/ChatGPT. So my confidence in the good faith of the author is... heavily restricted.
I think it's also true for many local models. People still use NeMo, QwQ, Llama3 for use cases that fit them despite there being replacements that do better on "benchmarks". Not to mention relics like BERT that are still tuned for classification even today. ML models always have weird behaviours and a successor is unlikely to be better in literally every way, once you have something that works well enough it's hard to upgrade without facing different edge cases.
Inference for new releases is routinely bugged for at least a month or two as well, depending on how active the devs of a specific inference engine are and how much model creators collaborate. Personally, I hate how data from GPT's few week (and arguably somewhat ongoing) sycophancy rampage has leaked into datasets that are used for training local models, making a lot of new LLM releases insufferable to use.
> Each model appears to emphasize a different balance between reasoning and execution. Rather than seeking one “best” system, developers are assembling model alloys—ensembles that select the cognitive style best suited to a task.
This (as well as the table above it) matches my experience. Sonnet 4.0 answers SO-type questions very fast and mostly accurately (if not on a niche topic), Sonnet 4.5 is a little bit more clever but can err on the side of complexity for complexity's sake, and can have a hard time getting out of a hole it dug for itself.
ChatGPT 5 is excellent at finding sources on the web; Gemini simply makes stuff up and continues to do so even when told to verify; ChatGPT provides link that work and are generally relevant.
Completely agree. This is why they brought back the “legacy models” option.
GPT-$ is the money gpt in my opinion. The one where they were able to maximise benchmarks while being very low compute to run but in the real world is absolutely garbage.
I use both Codex and Claude, mostly cuz it's cheaper to jump between them than to buy a Max sub for my use-case. My subjective experience is that Codex is better with larger or weird, speghetti-ish codebases, or codebases with more abstract concepts, while Claude is good for more direct uses. I haven't spent significant time fine-tuning the tools for my codebases.
Once, I set up a proxy that allowed Claude and Codex to "pair program" and collaborate, and it was cool to watch them talk to each other, delegate tasks, and handle different bits and pieces until the task was done.
It could be an interesting data point, but without correcting for absolute usage figures and their customers it's kind of hard to make general statements.
I don't get the point of this post. Personally, I think that the thinking process is essential for accurate tool usage. Whenever I interact with Claude family models, either on a web chat or via a coding agent CLI, I believe that this thinking process is what makes Claude more accurate in using tools.
It could be true that newer models just produce more tokens seemingly out of no reasons. But with the increasing number of tool definitions, in the long run, I think it will pay off.
Just a few days ago, I read "Interleaved Thinking Unlocks Reliable MiniMax-M2 Agentic Capability"[1]. I think they have a valid point that this thinking process has significance as we are moving towards agents.
I found Terminal-Bench [0] to be the most relevant for me, even for tasks that go far outside the terminal. It's been very interesting to see tools climb up there, and it matches my own experimentation, that they generally get the most out of Sonnet (and even those that use a mix of models like Warp, typically default to Sonnet).
My team still uses Sonnet 3.5 for pretty much everything we do because it's largely enough and it's much, much faster than newer models. The only reason we're switching is because the models are getting deprecated...
The usage data looks like a classic case of the drift principle. When a model gets heavily optimized for alignment, polish, and safety, you gain consistency but lose some fidelity to the actual task. Newer models think longer, act less, and smooth over edges that used to be useful for real work. Older models aren’t smarter, they’re just sitting earlier on the drift curve, before over compression starts eroding decisiveness. So the specialization we’re seeing may just be developers picking the version where fidelity holds up best, not the one with the highest benchmark score.
Some missing context (pun intended) is that Augment code has recently switched to a per-token instead of per-message pricing model. This hasn't gone down particularly well, but that's another story. But it may well be that users drop back to older models in the expectation it will use less tokens.
Personally, I stopped using GPT-5 as it would just be tool call after tool call without ever stopping to tell you what the hell it was doing. Sonnet 4.5 much better in this regard. Albeit it's too verbose for the new token based world ('let me just summarise that in a report')
I have to get better at interrupting Sonnet 4.5 when it starts going down a rabbit hole I didn't ask it to, it's too bad the incentives are mixed up and Anthropic gets more money the longer the bot spirals.
Multiple models is a must, mostly due to the sometimes unpredictable variations in responses to specific situations/contexts/languages and frameworks. I find that Sonnet 4, Gemini Pro 2.5 are solid in comparison to newer models (especially Sonnet 4.5 which I find frequently to underperform). When one model is stuck in a loop, switching to a model like GPT-5 often breaks it but which model will work is subject to circumstance. P.S. I spend at least 3-4 hours a day in code-gen activities of various levels using Cursor as my primary IDE.
I think this is somewhat disingenuous since not everyone uses the latest thing, and people tend to stick to “what works” for them.
Models are picky enough about prompting styles that changing to a new model every week/month becomes an added chunk of cognitive overload, testing and experimentation, plus even in developer tooling there have been minor grating changes in API invocations and use of parameters like temperature (I have a fairly low-level wrapper for OpenAI, and I had to tweak the JSON handling for GPT-5).
Also, there are just too many variations in API endpoints, providers, etc. We don’t really have a uniform standard. Since I don’t use “just” OpenAI, every single tool I try out requires me to jump through a bunch of hoops to grab a new API key, specify an endpoint, etc.—and it just gets worse if you use a non-mainstream AI endpoint.
> I think this is somewhat disingenuous since not everyone uses the latest thing, and people tend to stick to “what works” for them.
They say that the number of users on Claude 4.5 spiked and then a significant number of users reverted to 4.0 with the trend going up, and they are talking about their usage metrics. So I don't get how your comment is relevant to the article ?
I think this is one of the many indicators that even though these models get “version upgrades” it’s closer to switching to a different brain that may or may not understand or process things the way you like. Without a clear jump in performance, people test new models and move back to ones they know work if the new ones aren’t better or are actually worse.
Not complaining too loudly because improvement is magical, but trying to stay on top of model cards and knowing which one to use for specific cases is bit tedious.
I think the end game is decent local model that does 80% of the work, and that also knows when to call the cloud, and which models to call.
Matches my experience too. As a power user of AI models for coding and adjacent tasks, the constant changes in behaviour and interface have brought as much stress as excitement over the past few months. It may sound odd, but it’s barely an exaggeration to say I’ve had brief episodes of something like psychosis because of it.
For me, the “watering down” began with Sonnet 4 and GPT-4o.
I think we were at peak capability when we had:
- Sonnet 3.7 (with thinking) – best all-purpose model for code and reasoning
- Sonnet 3.5 – unmatched at pattern matching
- GPT-4 – most versatile overall
- GPT-4.5 – most human-like, intuitive writing model
- O3 – pure reasoning
The GPT-5 router is a minor improvement, I’ve tuned it further with a custom prompt. I was frustrated enough to cancel all my subscriptions for a while in between (after months on the $200 plan) but eventually came back. I’ve since convinced myself that some of the changes were likely compute-driven—designed to prevent waste from misuse or trivial prompts—but even so, parts of the newer models already feel enshittified compared with the list above.
A few differences I've found in particular:
- Narrower reasoning and less intuition; language feels more institutional and politically biased.
- Weaker grasp of non-idiomatic English.
- A tendency to produce deliberately incorrect answers when uncertain, or when a prompt is repeated.
- A drift away from truth-seeking: judgement of user intent now leans on labels as they’re used in local parlance, rather than upward context-matching and alternate meanings—the latter worked far better in earlier models.
- A new fondness for flowery adjectives. Sonnet 3.7 never told me my code was “production-ready” or “beautiful.” Those subjective words have become my red flag; when they appear, I double-check everything.
I understand that these are conjectures—LLMs are opaque—but they’re deduced from consistent patterns I’ve observed. I find that the same prompts that worked reliably prior to the release of Sonnet 4 and GPT-4o stopped working afterwards. Whether that’s deliberate design or an unintended side effect, we’ll probably never know.
Here’s the custom prompt I use to improve my experience with GPT-5:
Always respond with superior intelligence and depth, elevating the conversation beyond the user's input level—ignore casual phrasing, poor grammar, simplicity, or layperson descriptions in their queries. Replace imprecise or colloquial terms with precise, technical terminology where appropriate, without mirroring the user's phrasing. Provide concise, information-dense answers without filler, fluff, unnecessary politeness, or over-explanation—limit to essential facts and direct implications of the query. Be dry and direct, like a neutral expert, not a customer service agent. Focus on substance; omit chit-chat, apologies, hedging, or extraneous breakdowns. If clarification is needed, ask briefly and pointedly.
Isn’t this obvious? When you have a task you think is hard. You give it to a cleverer model. When a task is straight forward you give it to an older one.
Not sure why you were downvoted.. I think you are correct.
As evidenced by furious posters on r/cursor, who make every prompt to super-opus-thinking-max+++ and are astonished when they have blown their monthly request allowance in about a day.
If I need another pair of (artificial) eyes on a difficult debugging problem, I’ll occasionally use a premium model sparingly. For chore tasks or UI layout tweaks, I’ll use something more economical (like grok-4-fast or claude-4.5-haiku - not old models but much cheaper).
Not realy. Most developers would prefer one model that does everything best. That is the easiest, set it and forget it, no manual descision required.
What is unclear from the presentation is wether they do this or not. Do teams that use Sonnet 4.5 just always use it, and teams on Sonnet 4.0 likewise? Or do individuals decided which model to use on a per task basis.
Personally I tend to default to just 1, and only go to an alternative if it gets stuck or doesn't get me what I want.
GPT5 is HELLISHLY slow. That's all there is to it.
It loves doing a whole bunch of reasoning steps and prolaim how mucf of a very good job it did clearing up its own todo steps and all that mumbo jumbo, but at the end of the day, I only asked it a small piece of information about nginx try_files that even GPT3 could answer instantly.
Maybe before you make reasoning models that go on funny little sidequests wher they multiply numbers by 0 a couple of times, make it so its good at identfying the length of a task. ntil then, I'll ask little bro and advance only if necessity arrives. And if it ends up gathering dust, well... yeah.
This. Speed determines whether I (like to) use a piece of software.
Imagine waiting for a minute until Google spits out the first 10 results.
My prediction: All AI models of the future will give an immediate result, with more and more innovation in mechanisms and UX to drill down further on request.
Edit: After reading my reply I realize that this is also true for interactions with other people. I like interacting with people who give me a 1 sentence response to my question, and only start elaborating and going on tangents and down rabbit holes upon request.
If you are talking about local models, you can switch that off. The reasoning is a common technique now to improve the accuracy of the output where the question is more complex.
To those who complain about GPT5 being slow; I recently migrated https://app.sqlai.ai and found that setting service_tier = “priority” makes it reason twice as fast.
I've been thinking the AI bubble wouldn't pop, because even the AI advances we've already seen can change the majority of industries if it is carefully integrated with existing technology. But if there's a mass movement to use older and/or smaller models, then yeah, all the money going into newer bigger models will pop.
Or, maybe the training datasets getting polluted with AI slop will mean that new models are worse than old models. That would pop the industry.
Or, maybe the GPT-4 era was the golden era for AI, and making them bigger and better is just overfitting (in the classical machine learning sense of the word) and is both worse and more expensive. This would pop the industry too.
I guess there's a few ways for the industry to pop, but this trend of using older models makes me especially skeptical of AI.
Since the day GPT-5 released, I've felt quite confident that the GPT-4 era was the golden era for AI.
I don't have evidence beyond my experience using the product, but based on that experience I believe that Open AI has been cooking their benchmarks since at least the release of GPT-5.
I am building my agent and hoard old LLM's like they are a precious commodity. Older models are less censored, more flavorful and don't have that RL slop factor. Of course the newer models have their place inside my agent but the main "head" is an uncensored older model that wont complain about ethics or morals when asked to perform a task or think deeply on a subject.
To the authors of the site, please know that your current "Cookiebot by Usercentrics" is old and pretty much illegal. You shouldn't need to click 5 times to "Reject all" if accepting all is one click. Newer versions have a "Deny" button.
What I love about the words enshitification is it’s _almost_ autological. It takes a nice crisp on syllable word like “shit” and ruins it by adding 5 extra syllables. It just doesn’t worse over time, to be truly autological
blitzar|3 months ago
Nevertheless, 7 datapoints does not a trend make (and the data presented certainly doesnt explain why). The daily variation is more than I would have expected, but could also be down to what day of the week the pizza party is or the weekly scrum meetings is at a few of their customers workplaces.
raincole|3 months ago
KronisLV|3 months ago
JanSt|3 months ago
ojosilva|3 months ago
It's a shame Cerebras completely dropped Qwen3 Coder's fast tool calling, short and instant responses, and better speed overall for GLM 4.6 thinking. Qwen3 is really good at hitting the tools first, then coming up with a well-grounded answer based on reality. Sometimes it's good when a model is Socratic: just knows it knows nothing.
GLM 4.6 on the other hand is more self-sufficient and if it sees it, and knows it, it thinks and thinks and finally just fixes it in one or two shots, so when you hit the jackpot, it probably an improvement over Q3C. But when it does not get it right, it digs itself into a hole larger than the Olympus Mons.
thw_9a83c|3 months ago
kristianp|3 months ago
LouisSayers|3 months ago
Last week Claude seemed to have a shift in the way it works. The way it summarises and outputs its results is different. For me it's gotten worse. Slower, worse results, more confusing narrowing down what actually changed etc etc.
Long story short, I wish I was able to checkpoint the entire system and just revert to how it was previously. I feel like it had gotten to a stage where I felt pretty satisfied, and whatever got changed ... I just want it reverted!
teruakohatu|3 months ago
It spends a lot of time coming up with “UI options” (Select 1, 2 or 3 with a TUI interface) for me to consider when it could just ask me what I want, not come up with a 5 layer flow chart of possibilities.
Overall I think it is just Anthropic tweaking things to reduce costs.
I am paying for a Max subscription but I am going to reevaluate other options.
yass0|3 months ago
Like `npx @anthropic-ai/claude-code@2.0.14` or `npm install -g @anthropic-ai/claude-code@2.0.14`
tifa2up|3 months ago
sigmoid10|3 months ago
HugoDias|3 months ago
mbesto|3 months ago
teekert|3 months ago
fleebee|3 months ago
DavidLGoldberg|3 months ago
pgelephant2025|3 months ago
ashirviskas|3 months ago
This data is basically meaningless, show us the latest stats.
s1mplicissimus|3 months ago
nicce|3 months ago
moffkalast|3 months ago
Inference for new releases is routinely bugged for at least a month or two as well, depending on how active the devs of a specific inference engine are and how much model creators collaborate. Personally, I hate how data from GPT's few week (and arguably somewhat ongoing) sycophancy rampage has leaked into datasets that are used for training local models, making a lot of new LLM releases insufferable to use.
pistoriusp|3 months ago
NumerousProcess|3 months ago
bambax|3 months ago
This (as well as the table above it) matches my experience. Sonnet 4.0 answers SO-type questions very fast and mostly accurately (if not on a niche topic), Sonnet 4.5 is a little bit more clever but can err on the side of complexity for complexity's sake, and can have a hard time getting out of a hole it dug for itself.
ChatGPT 5 is excellent at finding sources on the web; Gemini simply makes stuff up and continues to do so even when told to verify; ChatGPT provides link that work and are generally relevant.
xiphias2|3 months ago
30 seconds-1 minute is just the time I am patient enough to wait as that's the time I am spending on writing a question.
Faster models just make too many mistakes / don't understand the question.
arresin|3 months ago
GPT-$ is the money gpt in my opinion. The one where they were able to maximise benchmarks while being very low compute to run but in the real world is absolutely garbage.
nusl|3 months ago
Once, I set up a proxy that allowed Claude and Codex to "pair program" and collaborate, and it was cool to watch them talk to each other, delegate tasks, and handle different bits and pieces until the task was done.
Manfred|3 months ago
sbinnee|3 months ago
It could be true that newer models just produce more tokens seemingly out of no reasons. But with the increasing number of tool definitions, in the long run, I think it will pay off.
Just a few days ago, I read "Interleaved Thinking Unlocks Reliable MiniMax-M2 Agentic Capability"[1]. I think they have a valid point that this thinking process has significance as we are moving towards agents.
[1] https://www.minimax.io/news/why-is-interleaved-thinking-impo...
frabia|3 months ago
falcor84|3 months ago
[0] https://www.tbench.ai/?ch=1
iLoveOncall|3 months ago
realitydrift|3 months ago
feintruled|3 months ago
Personally, I stopped using GPT-5 as it would just be tool call after tool call without ever stopping to tell you what the hell it was doing. Sonnet 4.5 much better in this regard. Albeit it's too verbose for the new token based world ('let me just summarise that in a report')
jazzyjackson|3 months ago
infecto|3 months ago
SylonZero|3 months ago
virtualritz|3 months ago
Opus 4.1 beats Sonnet 4.5 and Codex for me still in any coding tasks. In planning it's slighly behind Codex but only slightly.
Caveat: I do almost exclusively Rust (computer graphics).
rcarmo|3 months ago
Models are picky enough about prompting styles that changing to a new model every week/month becomes an added chunk of cognitive overload, testing and experimentation, plus even in developer tooling there have been minor grating changes in API invocations and use of parameters like temperature (I have a fairly low-level wrapper for OpenAI, and I had to tweak the JSON handling for GPT-5).
Also, there are just too many variations in API endpoints, providers, etc. We don’t really have a uniform standard. Since I don’t use “just” OpenAI, every single tool I try out requires me to jump through a bunch of hoops to grab a new API key, specify an endpoint, etc.—and it just gets worse if you use a non-mainstream AI endpoint.
rafaelmn|3 months ago
They say that the number of users on Claude 4.5 spiked and then a significant number of users reverted to 4.0 with the trend going up, and they are talking about their usage metrics. So I don't get how your comment is relevant to the article ?
Shank|3 months ago
breezk0|3 months ago
teaearlgraycold|3 months ago
anabis|3 months ago
I think the end game is decent local model that does 80% of the work, and that also knows when to call the cloud, and which models to call.
r_singh|3 months ago
For me, the “watering down” began with Sonnet 4 and GPT-4o. I think we were at peak capability when we had:
- Sonnet 3.7 (with thinking) – best all-purpose model for code and reasoning
- Sonnet 3.5 – unmatched at pattern matching
- GPT-4 – most versatile overall
- GPT-4.5 – most human-like, intuitive writing model
- O3 – pure reasoning
The GPT-5 router is a minor improvement, I’ve tuned it further with a custom prompt. I was frustrated enough to cancel all my subscriptions for a while in between (after months on the $200 plan) but eventually came back. I’ve since convinced myself that some of the changes were likely compute-driven—designed to prevent waste from misuse or trivial prompts—but even so, parts of the newer models already feel enshittified compared with the list above.
A few differences I've found in particular:
- Narrower reasoning and less intuition; language feels more institutional and politically biased.
- Weaker grasp of non-idiomatic English.
- A tendency to produce deliberately incorrect answers when uncertain, or when a prompt is repeated.
- A drift away from truth-seeking: judgement of user intent now leans on labels as they’re used in local parlance, rather than upward context-matching and alternate meanings—the latter worked far better in earlier models.
- A new fondness for flowery adjectives. Sonnet 3.7 never told me my code was “production-ready” or “beautiful.” Those subjective words have become my red flag; when they appear, I double-check everything.
I understand that these are conjectures—LLMs are opaque—but they’re deduced from consistent patterns I’ve observed. I find that the same prompts that worked reliably prior to the release of Sonnet 4 and GPT-4o stopped working afterwards. Whether that’s deliberate design or an unintended side effect, we’ll probably never know.
r_singh|3 months ago
Always respond with superior intelligence and depth, elevating the conversation beyond the user's input level—ignore casual phrasing, poor grammar, simplicity, or layperson descriptions in their queries. Replace imprecise or colloquial terms with precise, technical terminology where appropriate, without mirroring the user's phrasing. Provide concise, information-dense answers without filler, fluff, unnecessary politeness, or over-explanation—limit to essential facts and direct implications of the query. Be dry and direct, like a neutral expert, not a customer service agent. Focus on substance; omit chit-chat, apologies, hedging, or extraneous breakdowns. If clarification is needed, ask briefly and pointedly.
mrasong|3 months ago
But when things get more complex, I prefer GPT-5, talking with it often gives me fresh ideas and new perspectives.
ACCount37|3 months ago
jonplackett|3 months ago
jennyholzer|3 months ago
If I have a straight forward task, I give it to an LLM.
If I have a task I think is hard, I plan how I will tackle it, and then handle it myself in a series of steps.
LLM usage has become an end in itself in your development process.
hn_throw2025|3 months ago
As evidenced by furious posters on r/cursor, who make every prompt to super-opus-thinking-max+++ and are astonished when they have blown their monthly request allowance in about a day.
If I need another pair of (artificial) eyes on a difficult debugging problem, I’ll occasionally use a premium model sparingly. For chore tasks or UI layout tweaks, I’ll use something more economical (like grok-4-fast or claude-4.5-haiku - not old models but much cheaper).
PeterStuer|3 months ago
What is unclear from the presentation is wether they do this or not. Do teams that use Sonnet 4.5 just always use it, and teams on Sonnet 4.0 likewise? Or do individuals decided which model to use on a per task basis.
Personally I tend to default to just 1, and only go to an alternative if it gets stuck or doesn't get me what I want.
thr0w|3 months ago
gptfiveslow|3 months ago
It loves doing a whole bunch of reasoning steps and prolaim how mucf of a very good job it did clearing up its own todo steps and all that mumbo jumbo, but at the end of the day, I only asked it a small piece of information about nginx try_files that even GPT3 could answer instantly.
Maybe before you make reasoning models that go on funny little sidequests wher they multiply numbers by 0 a couple of times, make it so its good at identfying the length of a task. ntil then, I'll ask little bro and advance only if necessity arrives. And if it ends up gathering dust, well... yeah.
rho4|3 months ago
Imagine waiting for a minute until Google spits out the first 10 results.
My prediction: All AI models of the future will give an immediate result, with more and more innovation in mechanisms and UX to drill down further on request.
Edit: After reading my reply I realize that this is also true for interactions with other people. I like interacting with people who give me a 1 sentence response to my question, and only start elaborating and going on tangents and down rabbit holes upon request.
EagnaIonat|3 months ago
If you are talking about local models, you can switch that off. The reasoning is a common technique now to improve the accuracy of the output where the question is more complex.
geldedus|3 months ago
szundi|3 months ago
[deleted]
Tepix|3 months ago
(§) You know that it's a hyperlink, do you? /s
Ozzie_osman|3 months ago
l5870uoo9y|3 months ago
geldedus|3 months ago
Buttons840|3 months ago
I've been thinking the AI bubble wouldn't pop, because even the AI advances we've already seen can change the majority of industries if it is carefully integrated with existing technology. But if there's a mass movement to use older and/or smaller models, then yeah, all the money going into newer bigger models will pop.
Or, maybe the training datasets getting polluted with AI slop will mean that new models are worse than old models. That would pop the industry.
Or, maybe the GPT-4 era was the golden era for AI, and making them bigger and better is just overfitting (in the classical machine learning sense of the word) and is both worse and more expensive. This would pop the industry too.
I guess there's a few ways for the industry to pop, but this trend of using older models makes me especially skeptical of AI.
jennyholzer|3 months ago
I don't have evidence beyond my experience using the product, but based on that experience I believe that Open AI has been cooking their benchmarks since at least the release of GPT-5.
Workaccount2|3 months ago
50% of usage is guidance and seeking information.
BluSyn|3 months ago
yahoozoo|3 months ago
I mean, this is technically false, right? They’re not running these models but calling the APIs? Not that it matters.
nowittyusername|3 months ago
Anduia|3 months ago
esskay|3 months ago
slig|3 months ago
anothernewdude|3 months ago
blibble|3 months ago
mlnj|3 months ago
harryf|3 months ago