(no title)
eterm | 1 month ago
A model or new model version X is released, everyone is really impressed.
3 months later, "Did they nerf X?"
It's been this way since the original chatGPT release.
The answer is typically no, it's just your expectations have risen. What was previously mind-blowing improvement is now expected, and any mis-steps feel amplified.
quentindanjou|1 month ago
What we need is an open and independent way of testing LLMs and stricter regulation on the disclosure of a product change when it is paid under a subscription or prepaid plan.
landl0rd|1 month ago
Unfortunately, it's paywalled most of the historical data since I last looked at it, but interesting that opus has dipped below sonnet on overall performance.
Analemma_|1 month ago
I mean, that's part of the problem: as far as I know, no claim of "this model has gotten worse since release!" has ever been validated by benchmarks. Obviously benchmarking models is an extremely hard problem, and you can try and make the case that the regressions aren't being captured by the benchmarks somehow, but until we have a repeatable benchmark which shows the regression, none of these companies are going to give you a refund based on your vibes.
jampa|1 month ago
This is not the same thing as a "omg vibes are off", it's reproducible, I am using the same prompts and files, and getting way worse results than any other model.
eterm|1 month ago
It has a habit of trusting documentation over the actual code itself, causing no end of trouble.
Check your claude.md files (both local and ~user ) too, there could be something lurking there.
Or maybe it has horribly regressed, but that hasn't been my experience, certainly not back to Sonnet levels of needing constant babysitting.
mrguyorama|1 month ago
If LLMs have a 90% chance of working, there will be some who have only success and some who have only failure.
People are really failing to understand the probabilistic nature of all of this.
"You have a radically different experience with the same model" is perfectly possible with less than hundreds of thousands of interactions, even when you both interact in comparable ways.
olao99|1 month ago
ojr|1 month ago
spike021|1 month ago
unknown|1 month ago
[deleted]