We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.
Thanks for the response, I appreciate it. I do notice variation in quality throughout the day. I use it primarily for searching documentation since it’s faster than google in most case, often it is on point, but also it seems off at times, inaccurate or shallow maybe. In some cases I just end the session.
Hi Ted. I think that language models are great, and they’ve enabled me to do passion projects I never would have attempted before. I just want to say thanks.
Yeah, happy to be more specific. No intention of making any technically true but misleading statements.
The following are true:
- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)
- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.
- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.
My gut feeling is that performance is more heavily affected by harnesses which get updated frequently. This would explain why people feel that Claude is sometimes more stupid - that's actually accurate phrasing, because Sonnet is probably unchanged. Unless Anthropic also makes small A/B adjustments to weights and technically claims they don't do dynamic degradation/quantization based on load. Either way, both affect the quality of your responses.
It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.
If you make raw API calls and see behavioural changes over time, that would be another concern.
It will give the user lower quality if it finds them “distressed” however, choosing paternalistic safety over epistemic accuracy.
As a user gets more frustrated with the system, it will pick up the distress signal even more so, a kind of feedback loop toward degraded service quality.
In my experience.
I believe you when you say you're not changing the model file loaded onto the H100s or whatever, but there's something going on, beyond just being slower, when the GPUs are heavily loaded.
wasmainiac|24 days ago
nl|24 days ago
Accuracy can decreases at large context sizes. OpenAI's compaction handles this better than anyone else, but it's still an issue.
If you are seeing this kind of thing start a new chat and re-run the same query. You'll usually see an improvement.
GorbachevyChase|24 days ago
zamadatix|24 days ago
Trufa|24 days ago
tedsanders|24 days ago
The following are true:
- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)
- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.
- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.
ChatGPT release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...
Codex changelog: https://developers.openai.com/codex/changelog/
Codex CLI commit history: https://github.com/openai/codex/commits/main/
joshvm|24 days ago
It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.
If you make raw API calls and see behavioural changes over time, that would be another concern.
smugtrain|23 days ago
Someone1234|24 days ago
PS - I appreciate you coming here and commenting!
hhh|24 days ago
derwiki|24 days ago
fragmede|24 days ago
clbrmbr|24 days ago
robertclaus|23 days ago
a456463|23 days ago