This paper creates a new benchmark comprised of real remote work tasks sourced from the remote working website Upwork. The best commercial LLMs like Opus, GPT, Gemini, and Grok were tested.
Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.
One of the tasks was "Build an interactive dashboard for exploring data from the World Happiness Report." -- I can't imagine how Opus4.5 could've failed that.
Kinda sus that least known model did best and none of the more recent models were tested. Capabilities grow very fast. So things that now routinely succeed rarely ever succeeded even half a year ago.
codexon|22 days ago
Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.
kolinko|22 days ago
One of the tasks was "Build an interactive dashboard for exploring data from the World Happiness Report." -- I can't imagine how Opus4.5 could've failed that.
scotty79|22 days ago
rsynnott|21 days ago
tessitore|22 days ago
zb3|22 days ago
Then go ahead and use AI to fix this: https://gitlab.gnome.org/GNOME/mutter/-/issues/4051
stoneforger|22 days ago
unknown|22 days ago
[deleted]
Venn1|22 days ago
unknown|22 days ago
[deleted]