(no title)
gkamradt | 11 months ago
Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.
In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.
ARC-AGI-2 targets test-time reasoning.
My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.
Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.
Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.
Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.
Change log from ARC-AGI-2 to ARC-AGI-2: * The two main evaluation sets (semi-private, private eval) have increased to 120 tasks * Solving tasks requires more reasoning vs pure intuition * Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less * Non-training task sets are now difficulty-calibrated
The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.
The Kaggle competition goes live later this week and you can sign up here: https://arcprize.org/competition
We're in an idea-constrained environment. The next AGI breakthrough might come from you, not a giant lab.
Happy to answer questions.
Chathamization|11 months ago
I don’t think that follows. Just because people fail to create ARC-AGI problems that are difficult for an AI to solve, doesn’t mean that said AI can just be plugged into a humanoid robot and it will now reliably cook dinner, order a pizza and drive to pick it up, take a bus to downtown to busk on the street and take the money back home, etc.
ARC-AGI is an interesting benchmark, but it’s extremely presumptive to think that these types of tests are going to demonstrate AGI.
Palmik|11 months ago
Who said that cooking dinner couldn't be part of ARC-AGI-<N>?
jononor|11 months ago
There are humans who cannot perform these tasks, at least without assistive/adapted systems such as a wheelchair and accessible bus.
CooCooCaCha|11 months ago
The scenarios you listed are examples of what they’re talking about. Those are tasks that humans can easily do but robots have a hard time with.
yorwba|11 months ago
unknown|11 months ago
[deleted]
artninja1988|11 months ago
gkamradt|11 months ago
1. Public Train - 1,000 tasks that are public 2. Public Eval - 120 tasks that are public
So for those two we don't have protections.
3. Semi Private Eval - 120 tasks that are exposed to 3rd parties. We sign data agreements where we can, but we understand this is exposed and not 100% secure. It's a risk we are open to in order to keep testing velocity. In theory it is very difficulty to secure this 100%. The cost to create a new semi-private test set is lower than the effort needed to secure it 100%.
4. Private Eval - Only on Kaggle, not exposed to any 3rd parties at all. Very few people have access to this. Our trust vectors are with Kaggle and the internal team only.
Centigonal|11 months ago
levocardia|11 months ago
az226|11 months ago
tananaev|11 months ago
trott|11 months ago
mapmeld|11 months ago
gmkhf|11 months ago
jmtulloss|11 months ago
vessenes|11 months ago
You have my wheels turning on how to get computers better at these. Looking forward to see G the first computer tech that can get 30-50% on these!
Nuzzerino|11 months ago
az226|11 months ago
synapsomorphy|11 months ago
The success of o3 directly contradicts us being in an "idea-constrained environment", what makes you believe that?
littlestymaar|11 months ago
From ChatGPT 3.5 to o1, all LLMs progress came from investment in training: either by using much more data, or using higher quality data thanks to artificial data.
o1 (and then o3) broke this paradigm by applying a novel idea (RL+search on CoT) and that's because of it that it was able to make progress on ARC-AGI.
So IMO the success of o3 goes in favor of the argument of how we are in an idea-constrained environment.
jononor|11 months ago
ustad|11 months ago
JFingleton|11 months ago
doctorpangloss|11 months ago