They’re not wrong though. The frequency with which these things still just make shit up is astonishingly bad. Very dismissive of a legitimate criticism.
It's getting better, faster than you and I and the GP are. What else matters?
You can't bullshit your way through this particular benchmark. Try it.
And yes, they're wrong. The latest/greatest models "make shit up" perhaps 5-10% as frequently as were seeing just a couple of years ago. Only someone who has deliberately decided to stop paying attention could possibly argue otherwise.
And yet I still can't trust Claude or o1 to not get the simplest of things, such as test cases (not even full on test suites, just the test cases) wrong, consistently. No amount of handholding from me or prompting or feeding it examples etc helps in the slightest, it is just consistently wrong for anything but the simplest possible examples, which takes more effort to manually verify than if I had just written it myself. I'm not even using an obscure stack or language, but especially with things that aren't Python or JS it shits the bed even worse.
I have noticed it's great in the hands of marketers and scammers, however. Real good at those "jobs", so I see why the cryptobros have now moved onto hailing LLMs as the next coming of jesus.
CamperBob2|1 year ago
You can't bullshit your way through this particular benchmark. Try it.
And yes, they're wrong. The latest/greatest models "make shit up" perhaps 5-10% as frequently as were seeing just a couple of years ago. Only someone who has deliberately decided to stop paying attention could possibly argue otherwise.
sensanaty|1 year ago
I have noticed it's great in the hands of marketers and scammers, however. Real good at those "jobs", so I see why the cryptobros have now moved onto hailing LLMs as the next coming of jesus.