With those test parameters for how long it would take a human to complete the same work, it fits a similar pattern to METR; i.e. at "humans would take 11.5 hours" (Figure 4, median) you're pushing your luck for any success with all but the most recent models*, and METR is testing software where AI has the possibility of fully automating a lot of its own tests.
Even more recent models than they tested, like Opus 4.5, are only 50% successful for tasks that take humans 5h20m: https://metr.org/time-horizons/
Assuming the bubble doesn't pop/WW3 doesn't start first (IDK, 25% and 5% respectively?), and if trends continue (???), I expect a similar paper this time next year to show something like 50% success at automation of similar tasks.
* which they didn't test, I don't blame them for that because this field moves too fast
Or they've determined that micromanaging it is circuitous and increases their dependence on tech giants, so it's a bad deal given that they also need to know the work well enough to verify it anyway.
[+] [-] ben_w|1 month ago|reply
Sounds about right.
With those test parameters for how long it would take a human to complete the same work, it fits a similar pattern to METR; i.e. at "humans would take 11.5 hours" (Figure 4, median) you're pushing your luck for any success with all but the most recent models*, and METR is testing software where AI has the possibility of fully automating a lot of its own tests.
Even more recent models than they tested, like Opus 4.5, are only 50% successful for tasks that take humans 5h20m: https://metr.org/time-horizons/
Assuming the bubble doesn't pop/WW3 doesn't start first (IDK, 25% and 5% respectively?), and if trends continue (???), I expect a similar paper this time next year to show something like 50% success at automation of similar tasks.
* which they didn't test, I don't blame them for that because this field moves too fast
[+] [-] deterministic|1 month ago|reply
[+] [-] belter|1 month ago|reply
https://news.ycombinator.com/item?id=46928172
https://news.ycombinator.com/item?id=47004754
[+] [-] adyashakti|1 month ago|reply
[+] [-] BoredPositron|1 month ago|reply
[+] [-] devnonymous|1 month ago|reply
[+] [-] gdulli|1 month ago|reply
[+] [-] vrighter|1 month ago|reply
There's a saying that if everywhere you go it smells like shit, you might just have some shit smeared on your own nose.
96% is not "holding it wrong".
[+] [-] ihibubh|1 month ago|reply
[deleted]