Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.
In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.
ARC-AGI-2 targets test-time reasoning.
My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.
Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.
Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.
Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.
Change log from ARC-AGI-2 to ARC-AGI-2:
* The two main evaluation sets (semi-private, private eval) have increased to 120 tasks
* Solving tasks requires more reasoning vs pure intuition
* Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less
* Non-training task sets are now difficulty-calibrated
The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.
> Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI.
I don’t think that follows. Just because people fail to create ARC-AGI problems that are difficult for an AI to solve, doesn’t mean that said AI can just be plugged into a humanoid robot and it will now reliably cook dinner, order a pizza and drive to pick it up, take a bus to downtown to busk on the street and take the money back home, etc.
ARC-AGI is an interesting benchmark, but it’s extremely presumptive to think that these types of tests are going to demonstrate AGI.
What are you doing to prevent the test set being leaked? Will you still be offering API access to the semi private test set to the big model providers who presumably train on their API?
I'm really pleased to see this! The original ARC-AGI-1 paper still informs how I think about "what is intelligence" today. I was thrilled to see AI models make real progress on that test precisely when we had the next big idea (reasoning). Here's to hoping round 2 falls with a similarly big breakthrough!
I think a lot of people got discouraged, seeing how openai solved arc agi 1 by what seems like brute forcing and throwing money at it. Do you believe arc was solved in the "spirit" of the challenge? Also all the open sourced solutions seem super specific to solving arc. Is this really leading us to human level AI at open ended tasks?
Just want to say I really love these new problems - feels like some general intelligence went into conceiving of and creating these puzzles: we just did a few over dinner as a family.
You have my wheels turning on how to get computers better at these. Looking forward to see G the first computer tech that can get 30-50% on these!
Why wasn’t the ICOM framework (D. Kelley) allowed to make a scoring submission after they claimed to have beaten the scores? Are you concerned that may appear to contradict your mission statement and alienate the AGI community?
Which puzzles had the lowest solve rate? I did the first 10 and felt all easy (mentally solve it in 10-20 seconds for easier ones and 30-60 seconds for harder ones), I’d like to try the most difficult ones.
> and was the only benchmark to pinpoint the exact moment in late 2024 when AI moved beyond pure memorization
This is self-referential, the benchmark pinpointed the time when AI went from memorization to problem solving, because the benchmark requires problem solving to complete. How do we know it requires problem solving skills? Because memorization-only LLMs can't do it but humans can.
I think ARC are producing some great benchmarks, and I think they probably are pushing forward the state of the art, however I don't think they identified anything particular with o3, at least they don't seem to have proven a step change.
The reason these tasks require fluid intelligence is because they were designed this way -- with task uniqueness/novelty as the primary goal.
ARC 1 was released long before in-context learning was identified in LLMs (and designed before Transformer-based LLMs existed), so the fact that LLMs can't do ARC was never a design consideration. It just turned out this way, which confirmed our initial assumption.
I spent half an hour playing with these now at https://arcprize.org/play and it's fun, but I must say that they are not "easy". So far I eventually solved all of the ones I've gone through, but several took me significantly more than the 2 tries allotted.
I wonder if this can be shown to be a valid IQ test, and if so, what IQ would a person need to solve e.g. 90% of them in 1 or 2 tries.
Yes, I looked that these and thought about what percentage of humans could even solve these. It seems that, unless average humans are not considered generally intelligence, the test for general intelligence should be passable by most humans.
I did the first 10 from ARC-AGI-2 (hard) set. 9 were in one try, 1 was in two.
To be fair I've spent a lot of time thinking about cellular automata and Conway's game of life, which definitely seems to be influencing the design of these puzzles.
I'd very much like to see VLAs get in the game with ARC. When I solve these puzzles I'm imagining myself move blocks around. Much of the time I'm treating these as physics simulations with custom physics per puzzle. VLAs are particularly well suited to the kind of training and planning which might unlock solutions here.
I don't know if this was a design goal, but I just did the first 10 Arc-AGI-2 public eval (hard) puzzles, and found them much more enjoyable (as a human) than any of the Arc-AGI-1 puzzles. That said the grid/puzzle editor is still a little clunky – would be nice to be able to drag-to-paint and have an adjustable brush size.
Maybe this is a really stupid question but I've been curious... are LLMs based on... "Neuronormativity"? Like, what neurology is an LLM based on? Would we get any benefit from looking at neurodiverse processing styles?
It’s kind of a silly question in that the neural architecture of neural nets is really only loosely inspired by neurology, and that basic vague neurology is shared by neurotypical people and neurodivergent people and animals and even bugs.
The "select" tool gives some help with tasks that require counting or copying. You can select areas of the input, which will show their dimensions, and copy-paste them into the output (ctrl+c/ctrl+v).
At the very first glance, it's like ARC 1 with some structures serving as contextual data, and more complicated symmetries / topological transformations.
Now, I wonder what surprises are to be found in the full dataset.
The focus on solving cost efficiently discrete tasks might actually lead us towards deep learning systems that could be used reliably in production, and not just give a whoa effect or need to be constantly supervised
The computer vision community needs an dataset like this for evaluation... train in one domain and test on another. The best we have now are thr imagenet r and c datasets. Humans have no issues with domain adaptation with vision, but comouter vision models struggle in many ways sti including out of domain images.
Probably openai will be >60% in three months if not immediately with these $1000/question level compute (which is the way tbh we should throw compute whenever possible that's the main advantage of silicon intelligence)
Their own admission that intelligence is a meaningless metric without bound on compute is one of the main reasons AI will overpower human intelligence soon. Simple scaling is very effective.
Have you had any neurologists utilize your dataset? My own reaction after solving a few of the puzzles was "Why is this so intuitive for me, but not for an LLM?".
Our human-ability to abstract things is underrated.
There have been some human studies on ARC 1 previously, I expect there will be more in the future. See this paper from 2021, which was one of the earliest works in this direction: https://arxiv.org/abs/2103.05823
These benchmarks, and specifically the constraints placed on solving them (compute etc) seem to me to incentivize the opposite of "general intelligence"
Have any of the technical contributions used to win the past competition been used to advance general AI in any way?
We have transformer based systems constantly gaining capabilities. On the other hand have any of the Kaggle submissions actually advanced the field in any way outside of the ARC Challenge?
To me (a complete outsider, admittedly) the ARC prize seems like an operationalization of the bitter lesson
Good question! This was one of the main motivations of our "Paper Prize" track. We wanted to reward conceptual progress vs leaderboard chasing. In fact, when we increased the prizes mid year we awarded more money towards the paper track vs top score.
We had 40 papers submitted last year and 8 were awarded prizes. [1]
On of the main teams, MindsAI, just published their paper on their novel test time fine tuning approach. [2]
Jan/Daniel (1st place winners last year) talk all about their progress and journey building out here [3]. Stories like theirs help push the field forward.
Not the team, just follow ARC on-and-off as a ML engineer. I think it will take a few years (at least) to see the impact of ARC, especially the more conceptual works. Those are more close to basic research than applied - It will take time before the lessons are transferred to applications (that also requires considerable R&D).
But more importantly, current LLM-based systems and the in-the-spirit-of-ARC-systems have quite different goals. The ARC challenge is intended to measure and build system which can learn efficiently - that is, be able to solve a novel task with very little new data. Ref F. Chollet paper "On the Measure of Intelligence".
Current LLMs do not care for learning efficiency at all - actually the strategy is completely opposite - they aim to utilize ss much data and compute as possible to make the most capable system (at least on task that are somehow spanned by the training data). Which works well, but is for sure quite costly and it might also limit applications to those that not require a lot of learning at runtime (we still do not know how far we can take in-context learning).
ARC brings in a fresh perspective, but I expect it to take several years for the approaches to really start cross-pollinating.
You can easily convert these tasks to token strings. The reason why ARC does not use language as part of its format is that it seeks to minimize the amount of prior knowledge needed to approach the tasks, so as to focus on fluid intelligence as opposed to acquired knowledge.
All ARC tasks are built entirely on top of "Core Knowledge" priors, the kind of elementary knowledge that a small child has already mastered and that is possessed universally by all humans.
gkamradt|11 months ago
Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.
In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.
ARC-AGI-2 targets test-time reasoning.
My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.
Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.
Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.
Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.
Change log from ARC-AGI-2 to ARC-AGI-2: * The two main evaluation sets (semi-private, private eval) have increased to 120 tasks * Solving tasks requires more reasoning vs pure intuition * Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less * Non-training task sets are now difficulty-calibrated
The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.
The Kaggle competition goes live later this week and you can sign up here: https://arcprize.org/competition
We're in an idea-constrained environment. The next AGI breakthrough might come from you, not a giant lab.
Happy to answer questions.
Chathamization|11 months ago
I don’t think that follows. Just because people fail to create ARC-AGI problems that are difficult for an AI to solve, doesn’t mean that said AI can just be plugged into a humanoid robot and it will now reliably cook dinner, order a pizza and drive to pick it up, take a bus to downtown to busk on the street and take the money back home, etc.
ARC-AGI is an interesting benchmark, but it’s extremely presumptive to think that these types of tests are going to demonstrate AGI.
artninja1988|11 months ago
Centigonal|11 months ago
levocardia|11 months ago
az226|11 months ago
tananaev|11 months ago
gmkhf|11 months ago
vessenes|11 months ago
You have my wheels turning on how to get computers better at these. Looking forward to see G the first computer tech that can get 30-50% on these!
Nuzzerino|11 months ago
az226|11 months ago
synapsomorphy|11 months ago
The success of o3 directly contradicts us being in an "idea-constrained environment", what makes you believe that?
ustad|11 months ago
doctorpangloss|11 months ago
danpalmer|11 months ago
This is self-referential, the benchmark pinpointed the time when AI went from memorization to problem solving, because the benchmark requires problem solving to complete. How do we know it requires problem solving skills? Because memorization-only LLMs can't do it but humans can.
I think ARC are producing some great benchmarks, and I think they probably are pushing forward the state of the art, however I don't think they identified anything particular with o3, at least they don't seem to have proven a step change.
fchollet|11 months ago
ARC 1 was released long before in-context learning was identified in LLMs (and designed before Transformer-based LLMs existed), so the fact that LLMs can't do ARC was never a design consideration. It just turned out this way, which confirmed our initial assumption.
falcor84|11 months ago
I wonder if this can be shown to be a valid IQ test, and if so, what IQ would a person need to solve e.g. 90% of them in 1 or 2 tries.
colordrops|11 months ago
fastball|11 months ago
To be fair I've spent a lot of time thinking about cellular automata and Conway's game of life, which definitely seems to be influencing the design of these puzzles.
iandanforth|11 months ago
fastball|11 months ago
neom|11 months ago
dcre|11 months ago
artificialprint|11 months ago
Congrats on launch, lets see how long it'll take to get saturated
fchollet|11 months ago
daemonologist|11 months ago
Nesco|11 months ago
Now, I wonder what surprises are to be found in the full dataset.
The focus on solving cost efficiently discrete tasks might actually lead us towards deep learning systems that could be used reliably in production, and not just give a whoa effect or need to be constantly supervised
ipunchghosts|11 months ago
Davidzheng|11 months ago
Davidzheng|11 months ago
nneonneo|11 months ago
carra|11 months ago
anshumankmr|11 months ago
momojo|11 months ago
Our human-ability to abstract things is underrated.
fchollet|11 months ago
FergusArgyll|11 months ago
These benchmarks, and specifically the constraints placed on solving them (compute etc) seem to me to incentivize the opposite of "general intelligence"
Have any of the technical contributions used to win the past competition been used to advance general AI in any way?
We have transformer based systems constantly gaining capabilities. On the other hand have any of the Kaggle submissions actually advanced the field in any way outside of the ARC Challenge?
To me (a complete outsider, admittedly) the ARC prize seems like an operationalization of the bitter lesson
gkamradt|11 months ago
We had 40 papers submitted last year and 8 were awarded prizes. [1]
On of the main teams, MindsAI, just published their paper on their novel test time fine tuning approach. [2]
Jan/Daniel (1st place winners last year) talk all about their progress and journey building out here [3]. Stories like theirs help push the field forward.
[1] https://arcprize.org/blog/arc-prize-2024-winners-technical-r...
[2] https://github.com/MohamedOsman1998/deep-learning-for-arc/bl...
[3] https://www.youtube.com/watch?v=mTX_sAq--zY
jononor|11 months ago
lawrenceyan|11 months ago
Defining the reward function, which is basically what ARC is doing, is 50% of the problem solving process.
ttol|11 months ago
Reasoner passed on first try.
“Correct!”
(See screenshot that shows one rated “hard” -- https://www.linkedin.com/posts/waynechang_tried-reasoner-on-...)
jwpapi|11 months ago
fchollet|11 months ago
All ARC tasks are built entirely on top of "Core Knowledge" priors, the kind of elementary knowledge that a small child has already mastered and that is possessed universally by all humans.
timonofathens|11 months ago