top | item 46992497

(no title)

fishpham | 17 days ago

Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)

discuss

jstummbillig|17 days ago

Could it also be that the models are just a lot better than a year ago?

bigbadfeline|17 days ago

> Could it also be that the models are just a lot better than a year ago?

No, the proof is in the pudding.

After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.

If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.

layer8|17 days ago

Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?

egeozcan|17 days ago

How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?

I tell this as a person who really enjoys AI by the way.

theywillnvrknw|17 days ago

* that you weren't supposed to be able to

unknown|17 days ago

[deleted]

XenophileJKO|17 days ago

https://chatgpt.com/s/m_698e2077cfcc81919ffbbc3d7cccd7b3

aleph_minus_one|17 days ago

I don't understand what you want to tell us with this image.

olalonde|17 days ago

Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.

gowld|17 days ago

Does folding a protein count? How about increasing performance at Go?

unknown|17 days ago

[deleted]