top | item 47070852

(no title)

andreagrandi | 11 days ago

I'm only waiting for OpenAI to provide an equivalet ~100 USD subscription to entirely ditch Claude.

Opus has gone down the hill continously in the last week (and before you start flooding with replies, I've been testing opus/codex in parallel for the last week, I've plenty of examples of Claude going off track, then apologising, then saying "now it's all fixed!" and then only fixing part of it, when codex nailed at the first shot).

I can accept specific model limits, not an up/down in terms of reliability. And don't even let me get started on how bad Claude client has become. Others are finally catching up and gpt-5.3-codex is definitely better than opus-4.6

Everyone else (Codex CLI, Copilot CLI etc...) is going opensource, they are going closed. Others (OpenAI, Copilot etc...) explicitly allow using OpenCode, they explicitly forbid it.

This hostile behaviour is just the last drop.

discuss

order

super256|11 days ago

OpenAI forces users to verify with their ID + face scan when using Codex 5.3 if any of your conversations was redeemed as high risk.

It seems like they currently have a lot of false positives: https://github.com/openai/codex/issues?q=High%20risk

andreagrandi|11 days ago

They haven't asked me yet (my subscription is from work with a business/team plan). Probably my conversations as too boring

seu|11 days ago

> Opus has gone down the hill continously in the last week

Is a week the whole attention timespan of the late 2020s?

latexr|11 days ago

We’re still in the mid-late 2020s. Once we really get to the late 2020s, attention spans won’t be long enough to even finish reading your comment. People will be speaking (not typing) to LLMs and getting distracted mid-sentence.

_kb|11 days ago

Unfortunately, and “Attention Is All You Need”.

marcus_holmes|11 days ago

oh shit we're in the late 2020's now

abm53|11 days ago

I’m unsure exactly in what way you believe it has gone “down the hill” so this isn’t aimed at you specifically but more a general pattern I see.

That pattern is people complaining that a particular model has degraded in quality of its responses over time or that it has been “nerfed” etc.

Although the models may evolve, and the tools calling them may change, I suspect a huge amount of this is simply confirmation bias.

ifwinterco|11 days ago

Opus 4.6 genuinely seems worse than 4.5 was in Q4 2025 for me. I know everyone always says this and anecdote != data but this is the first time I've really felt it with a new model to the point where I still reach for the old one.

I'll give GPT 5.3 codex a real try I think

Esophagus4|11 days ago

Huh… I’ve seen this comment a lot in this thread but I’ve really been impressed with both Anthropic’s latest models and latest tooling (plugins like /frontend-design mean it actually designs real front ends instead of the vibe coded purple gradient look). And I see it doing more planning and making fewer mistakes than before. I have to do far less oversight and debugging broken code these days.

But if people really like Codex better, maybe I’ll try it. I’ve been trying not to pay for 2 subscriptions at once but it might be worth a test.

mosselman|11 days ago

I asked Codex 5.3 and Opus 4.6 to write me a macos application with a certain set of requirements.

Opus 4.6 wrote me a working macos application.

Codex wrote me a html + css mockup of a macos application that didn't even look like a macos application at all.

Opus 4.5 was fine, but I feel that 4.6 is more often on the money on its implementations than 4.5 was. It is just slower.

kilroy123|11 days ago

I agree with you. Codex 5.3 is good it's just a bit slower.

trillic|10 days ago

The rate limit for my $20 OpenAI / Codex account feels 10x larger than the $20 claude account.

choilive|10 days ago

YES. I hit the rate limit in about ~15 mins on Claude. But it will take me a few hours with Codex. A/B testing them on the same tasks. Same $20/mo.

GorbachevyChase|11 days ago

I was underwhelmed by Opus4.6. I didn’t get a sense of significant improvement, but the token usage was excessive to the point that I dropped the subscription for codex. I am suspect that all the models are so glib that they can create a quagmire for themselves in a project. I have not yet found a satisfying strategy for non-destructive resets when the systems own comments and notes poisons new output. Fortunately, deleting and starting over is cheap.

dannersy|11 days ago

No offense, but this is the most predicable outcome ever. The software industry at large does this over and over again and somehow we're surprised. Provide thing for free or for cheap, and then slowly draw back availability once you have dominant market share or find yourself needing money (ahem).

The providers want to control what AI does to make money or dominate an industry so they don't have to make their money back right away. This was inevitable, I do not understand why we trust these companies, ever.

NamlchakKhandro|11 days ago

because it's easier than paying $50k for local llm setup that might not last 5 years.

andreagrandi|11 days ago

No offense taken here :)

First, we are not talking about a cheap service here. We are talking about a monthly subscription which costs 100 USD or 200 USD per month, depending on which plan you choose.

Second, it's like selling me a pizza and pretending I only eat it while sitting at your table. I want to eat the pizza at home. I'm not getting 2-3 more pizzas, I'm still getting the same pizza others are getting.

neya|11 days ago

It's the most overrated model there is. I do Elixir development primarily and the model sucks balls in comparison to Gemini and GPT-5x. But the Claude fanboys will swear by it and will attack you if you ever say even something remotely negative about their "god sent" model. It fails miserably even in basic chat and research contexts and constantly goes off track. I wired it up to fire up some tasks. It kept hallucinating and swearing it did when it didn't even attempt to. It was so unreliable I had to revert to Gemini.

resiros|11 days ago

It might simply be that it was not trained enough in Elixir RL environments compared to Gemini and gpt. I use it for both ts and python and it's certainly better than Gemini. For Codex, it depends on the task.

thepasch|10 days ago

> I’m only waiting for OpenAI to provide an equivalet ~100 USD subscription to entirely ditch Claude.

I have a feeling Anthropic might be in for an extremely rude awakening when that happens, and I don’t think it’s a matter of “if” anymore.

submain|10 days ago

> And don't even let me get started on how bad Claude client has become

The latest versions of claude code have been freezing and then crashing while waiting on long running commands. It's pretty frustrating.

WarmWash|10 days ago

My favorite conspiracy explanation:

Claude has gotten a lot of popular media attention in the last few weeks, and the influx of users is constraining compute/memory on an already compute heavy model. So you get all the suspected "tricks" like quantization, shorter thinking, KV cache optimizations.

It feels like the same thing that happened to Gemini 3, and what you can even feel throughout the day (the models seem smartest at 12am).

Dario in his interview with dwarkesh last week also lamented the same refrain that other lab leaders have: compute is constrained and there are big tradeoffs in how you allocate it. It feels safe to reason then that they will use any trick they can to free up compute.

cactusplant7374|11 days ago

No developer writes the same prompt twice. How can you be sure something has changed?

kasey_junk|11 days ago

I regularly run the same prompts twice and through different models. Particularly, when making changes to agent metadata like agent files or skills.

At least weekly I run a set of prompts to compare codex/claude against each other. This is quite easy the prompt sessions are just text files that are saved.

The problem is doing it enough for statistical significance and judging the output as better or not.

andreagrandi|11 days ago

I suspect you may not be writing code regularly... If I have to ask Claude the same things three times and it keeps saying "You are right, now I've implemented it!" and the code is still missing 1 out of 3 things or worse, then I can definitely say the model has become worse (since this wasn't happening before).

SkyPuncher|11 days ago

When I use Claude daily (both professionally and personally with a Max subscription), there are things that it does differently between 4.5 and 4.6. It's hard to point to any single conversation, but in aggregate I'm finding that certain tasks don't go as smoothly as they used to. In my view, Opus 4.6 is a lot better at long standing conversations (which has value), but does worse with critical details within smaller conversations.

A few things I've noticed:

* 4.6 doesn't look at certain files that it use to

* 4.6 tends to jump into writing code before it's fully understood the problem (annoying but promptable)

* 4.6 is less likely to do research, write to artifacts, or make external tool calls unless you specifically ask it to

* 4.6 is much more likely to ask annoying (blocking) questions that it can reasonably figure out on it's own

* 4.6 is much more likely to miss a critical detail in a planning document after being explicitly told to plan for that detail

* 4.6 needs to more proactively write its memories to file within a conversation to avoid going off track

* 4.6 is a lot worse about demonstrating critical details. I'm so tired of it explaining something conceptually without it thinking about how it implements details.

baq|11 days ago

Ralph Wiggum would like a word

bbstats|11 days ago

all this because of a single week?

andreagrandi|11 days ago

No, it's not the first time their models degrade for some time.