The people that work at Anthropic are aware of simonw and his test, and people aren't unthinking data-driven machines. How valid his test is or isn't, a better score on it is convincing. If it gets, say, 1,000 people to use Claude Code over Codex, how much would that be worth to Anthropic?
$200 * 1,000 = $200k/month.
I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.
It would be way way better if they were benchmaxxing this. The pelican in the image (both images) has arms. Pelicans don't have arms, and a pelican riding a bike would use it's wings.
I don't think that really proves anything, it's unsurprising that recumbent bicycles are represented less in the training data and so it's less able to produce them.
Try something that's roughly equally popular, like a Turkey riding a Scooter, or a Yak driving a Tractor.
fragmede|24 days ago
$200 * 1,000 = $200k/month.
I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.
margalabargala|24 days ago
I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.
https://i.imgur.com/UvlEBs8.png
WarmWash|24 days ago
mrandish|24 days ago
TheDong|24 days ago
Try something that's roughly equally popular, like a Turkey riding a Scooter, or a Yak driving a Tractor.
riffraff|24 days ago
KeplerBoy|24 days ago
collinmanderson|24 days ago
unknown|24 days ago
[deleted]