[SWE-bench co-author here]
It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that.
I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.
Davidzheng|1 month ago
botacode|1 month ago
They don't have to be malicious operators in this case. It just happens.
samusiam|1 month ago
Consider two scenarios: (1) degradation leads to the model being routed behind the scenes to a different server, with subtly different performance characteristics, all unbeknownst to the user; (2) degradation leads to the model refusing a request and returning an "overloaded" message.
In the first case, absolutely you want to include that because that's the kind of lack of transparency about performance that you'd want signal on. In the second case, an automated test harness might fail, but in the real world the user will just wait and retry when the server is under less load. Maybe you don't include that because it's actually misleading to say that performance (in terms of the model's intelligence, which is how the benchmark will be interpreted) is worse.
megabless123|1 month ago
cmrdporcupine|1 month ago
I don't know if they do this or not, but the nature of the API is such you could absolutely load balance this way. The context sent at each point is not I believe "sticky" to any server.
TLDR you could get a "stupid" response and then a "smart" response within a single session because of heterogeneous quantization / model behaviour in the cluster.
mohsen1|1 month ago
How do you pay for those SWE-bench runs?
I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.
https://mafia-arena.com
ofirpress|1 month ago
nikcub|1 month ago
assume this is because of model costs. anthropic could either throw some credits their way (would be worthwhile to dispel the 80 reddit posts a day about degrading models and quantization) or OP could throw up a donation / tip link
simsla|1 month ago
E.g. some binomial interval proportions (aka confidence intervals).
phist_mcgee|1 month ago
seunosewa|1 month ago
GoatInGrey|1 month ago
rootnod3|1 month ago
"You can't measure my Cloud Service's performance correctly if my servers are overloaded"?
"Oh, you just measured me at bad times each day. On only 50 different queries."
So, what does that mean? I have to pick specific times during the day for Claude to code better?
Does Claude Code have office hours basically?
johnsmith1840|1 month ago
Basically the paper showed methods for how to handle heavy traffic load by changing model requirements or routing to different ones. This was awhile ago and I'm sure it's massively more advanced now.
Also why some of AI's best work for me is early morning and weekends! So yes, the best time to code with modern LLM stacks is when nobody else is. It's also possibly why we go through phases of "they neutered the model" some time after a new release.
kuboble|1 month ago
swyx|1 month ago
copilot_king|1 month ago
[deleted]
bhk|1 month ago
https://www.anthropic.com/engineering/a-postmortem-of-three-...
embedding-shape|1 month ago
epolanski|1 month ago
chrisjj|1 month ago
Are you suggesting result accuracy varies with server load?
dana321|1 month ago
Aha, so the models do degrade under load.
cedws|1 month ago
bredren|1 month ago
It’s a terrific idea to provide this. ~Isitdownorisitjustme for LLMs would be the parakeet in the coalmine that could at least inform the multitude of discussion threads about suspected dips in performance (beyond HN).
What we could also use is similar stuff for Codex, and eventually Gemini.
Really, the providers themselves should be running these tests and publishing the data.
The availability status information is no longer sufficient to gauge the service delivery because it is by nature non-deterministic.
swyx|1 month ago
sjtgraham|1 month ago