top | item 45404495

(no title)

entee | 5 months ago

A lot of this post relies on the recent open ai result they call GDPval (link below). They note some limitations (lack of iteration in the tasks and others) which are key complaints and possibly fundamental limitations of current models.

But more interesting is the 50% win rate stat that represents expert human performance in the paper.

That seems absurdly low, most employees don’t have a 50% success rate on self contained tasks that take ~1 day of work. That means at least one of a few things could be true:

1. The tasks aren’t defined in a way that makes real world sense

2. The tasks require iteration, which wasn’t tested, for real world success (as many tasks do)

I think while interesting and a very worthy research avenue, this paper is only the first in a still early area of understanding how AI will affect with the real world, and it’s hard to project well from this one paper.

https://cdn.openai.com/pdf/d5eb7428-c4e9-4a33-bd86-86dd4bcf1...

discuss

drc500free|5 months ago

That's not 50% success rate at completing the task, that's the win rate of a head-to-head comparison of an algorithm and an expert. 50% means the expert and the algorithm each "win" half the time.

viernullvier|5 months ago

For the METR rating (first half of the article), it is indeed 50% success rate at completing the task. The win rate only applies to the GDPval rating (second half of the article).