Arc-AGI-2 and ARC Prize 2025

gkamradt|11 months ago

Hey HN, Greg from ARC Prize Foundation here.

Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.

In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.

ARC-AGI-2 targets test-time reasoning.

My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.

Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.

Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.

Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.

Change log from ARC-AGI-2 to ARC-AGI-2: * The two main evaluation sets (semi-private, private eval) have increased to 120 tasks * Solving tasks requires more reasoning vs pure intuition * Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less * Non-training task sets are now difficulty-calibrated

The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.

The Kaggle competition goes live later this week and you can sign up here: https://arcprize.org/competition

We're in an idea-constrained environment. The next AGI breakthrough might come from you, not a giant lab.

Happy to answer questions.

Chathamization|11 months ago

> Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI.

I don’t think that follows. Just because people fail to create ARC-AGI problems that are difficult for an AI to solve, doesn’t mean that said AI can just be plugged into a humanoid robot and it will now reliably cook dinner, order a pizza and drive to pick it up, take a bus to downtown to busk on the street and take the money back home, etc.

ARC-AGI is an interesting benchmark, but it’s extremely presumptive to think that these types of tests are going to demonstrate AGI.

artninja1988|11 months ago

What are you doing to prevent the test set being leaked? Will you still be offering API access to the semi private test set to the big model providers who presumably train on their API?

Centigonal|11 months ago

Thank you for including cost (or really any proxy for efficiency) as a dimension to this prize!

levocardia|11 months ago

I'm really pleased to see this! The original ARC-AGI-1 paper still informs how I think about "what is intelligence" today. I was thrilled to see AI models make real progress on that test precisely when we had the next big idea (reasoning). Here's to hoping round 2 falls with a similarly big breakthrough!

az226|11 months ago

Did any single individual solve all problems? How many such individuals were there?

tananaev|11 months ago

Did I read this right that only 2 humans out of 400 solved the problems?

gmkhf|11 months ago

I think a lot of people got discouraged, seeing how openai solved arc agi 1 by what seems like brute forcing and throwing money at it. Do you believe arc was solved in the "spirit" of the challenge? Also all the open sourced solutions seem super specific to solving arc. Is this really leading us to human level AI at open ended tasks?

vessenes|11 months ago

Just want to say I really love these new problems - feels like some general intelligence went into conceiving of and creating these puzzles: we just did a few over dinner as a family.

You have my wheels turning on how to get computers better at these. Looking forward to see G the first computer tech that can get 30-50% on these!

Nuzzerino|11 months ago

Why wasn’t the ICOM framework (D. Kelley) allowed to make a scoring submission after they claimed to have beaten the scores? Are you concerned that may appear to contradict your mission statement and alienate the AGI community?

az226|11 months ago

Which puzzles had the lowest solve rate? I did the first 10 and felt all easy (mentally solve it in 10-20 seconds for easier ones and 30-60 seconds for harder ones), I’d like to try the most difficult ones.

synapsomorphy|11 months ago

Thanks for your awesome work Greg!

The success of o3 directly contradicts us being in an "idea-constrained environment", what makes you believe that?

ustad|11 months ago

Using AGI in the titles of your tests might not be accurate or appropriate. May I suggest NAI - Narrow AI?

doctorpangloss|11 months ago

Why doesn’t every blogpost contain an example of a question you ask?

danpalmer|11 months ago

> and was the only benchmark to pinpoint the exact moment in late 2024 when AI moved beyond pure memorization

This is self-referential, the benchmark pinpointed the time when AI went from memorization to problem solving, because the benchmark requires problem solving to complete. How do we know it requires problem solving skills? Because memorization-only LLMs can't do it but humans can.

I think ARC are producing some great benchmarks, and I think they probably are pushing forward the state of the art, however I don't think they identified anything particular with o3, at least they don't seem to have proven a step change.

fchollet|11 months ago

The reason these tasks require fluid intelligence is because they were designed this way -- with task uniqueness/novelty as the primary goal.

ARC 1 was released long before in-context learning was identified in LLMs (and designed before Transformer-based LLMs existed), so the fact that LLMs can't do ARC was never a design consideration. It just turned out this way, which confirmed our initial assumption.

falcor84|11 months ago

I spent half an hour playing with these now at https://arcprize.org/play and it's fun, but I must say that they are not "easy". So far I eventually solved all of the ones I've gone through, but several took me significantly more than the 2 tries allotted.

I wonder if this can be shown to be a valid IQ test, and if so, what IQ would a person need to solve e.g. 90% of them in 1 or 2 tries.

colordrops|11 months ago

Yes, I looked that these and thought about what percentage of humans could even solve these. It seems that, unless average humans are not considered generally intelligence, the test for general intelligence should be passable by most humans.

fastball|11 months ago

I did the first 10 from ARC-AGI-2 (hard) set. 9 were in one try, 1 was in two.

To be fair I've spent a lot of time thinking about cellular automata and Conway's game of life, which definitely seems to be influencing the design of these puzzles.

iandanforth|11 months ago

I'd very much like to see VLAs get in the game with ARC. When I solve these puzzles I'm imagining myself move blocks around. Much of the time I'm treating these as physics simulations with custom physics per puzzle. VLAs are particularly well suited to the kind of training and planning which might unlock solutions here.

fastball|11 months ago

I don't know if this was a design goal, but I just did the first 10 Arc-AGI-2 public eval (hard) puzzles, and found them much more enjoyable (as a human) than any of the Arc-AGI-1 puzzles. That said the grid/puzzle editor is still a little clunky – would be nice to be able to drag-to-paint and have an adjustable brush size.

neom|11 months ago

Maybe this is a really stupid question but I've been curious... are LLMs based on... "Neuronormativity"? Like, what neurology is an LLM based on? Would we get any benefit from looking at neurodiverse processing styles?

dcre|11 months ago

It’s kind of a silly question in that the neural architecture of neural nets is really only loosely inspired by neurology, and that basic vague neurology is shared by neurotypical people and neurodivergent people and animals and even bugs.

artificialprint|11 months ago

Oh boy! Some of these tasks are not hard, but require full attention and a lot of counting just to get things right! ARC3 will go 3D perhaps? JK

Congrats on launch, lets see how long it'll take to get saturated

fchollet|11 months ago

ARC 3 is still spatially 2D, but it adds a time dimension, and it's interactive.

daemonologist|11 months ago

The "select" tool gives some help with tasks that require counting or copying. You can select areas of the input, which will show their dimensions, and copy-paste them into the output (ctrl+c/ctrl+v).

Nesco|11 months ago

At the very first glance, it's like ARC 1 with some structures serving as contextual data, and more complicated symmetries / topological transformations.

Now, I wonder what surprises are to be found in the full dataset.

The focus on solving cost efficiently discrete tasks might actually lead us towards deep learning systems that could be used reliably in production, and not just give a whoa effect or need to be constantly supervised

ipunchghosts|11 months ago

The computer vision community needs an dataset like this for evaluation... train in one domain and test on another. The best we have now are thr imagenet r and c datasets. Humans have no issues with domain adaptation with vision, but comouter vision models struggle in many ways sti including out of domain images.

Davidzheng|11 months ago

Probably openai will be >60% in three months if not immediately with these $1000/question level compute (which is the way tbh we should throw compute whenever possible that's the main advantage of silicon intelligence)

Davidzheng|11 months ago

Their own admission that intelligence is a meaningless metric without bound on compute is one of the main reasons AI will overpower human intelligence soon. Simple scaling is very effective.

nneonneo|11 months ago

Nitpick: “Public” is misspelled as “pubic” in several of the captions on that page.

carra|11 months ago

Maybe realizing those things is the actual test?

anshumankmr|11 months ago

Oof its still there... but yeah typos happen lol

momojo|11 months ago

Have you had any neurologists utilize your dataset? My own reaction after solving a few of the puzzles was "Why is this so intuitive for me, but not for an LLM?".

Our human-ability to abstract things is underrated.

fchollet|11 months ago

There have been some human studies on ARC 1 previously, I expect there will be more in the future. See this paper from 2021, which was one of the earliest works in this direction: https://arxiv.org/abs/2103.05823

FergusArgyll|11 months ago

I'd love to hear from the ARC guys:

These benchmarks, and specifically the constraints placed on solving them (compute etc) seem to me to incentivize the opposite of "general intelligence"

Have any of the technical contributions used to win the past competition been used to advance general AI in any way?

We have transformer based systems constantly gaining capabilities. On the other hand have any of the Kaggle submissions actually advanced the field in any way outside of the ARC Challenge?

To me (a complete outsider, admittedly) the ARC prize seems like an operationalization of the bitter lesson

gkamradt|11 months ago

Good question! This was one of the main motivations of our "Paper Prize" track. We wanted to reward conceptual progress vs leaderboard chasing. In fact, when we increased the prizes mid year we awarded more money towards the paper track vs top score.

We had 40 papers submitted last year and 8 were awarded prizes. [1]

On of the main teams, MindsAI, just published their paper on their novel test time fine tuning approach. [2]

Jan/Daniel (1st place winners last year) talk all about their progress and journey building out here [3]. Stories like theirs help push the field forward.

[1] https://arcprize.org/blog/arc-prize-2024-winners-technical-r...

[2] https://github.com/MohamedOsman1998/deep-learning-for-arc/bl...

[3] https://www.youtube.com/watch?v=mTX_sAq--zY

jononor|11 months ago

Not the team, just follow ARC on-and-off as a ML engineer. I think it will take a few years (at least) to see the impact of ARC, especially the more conceptual works. Those are more close to basic research than applied - It will take time before the lessons are transferred to applications (that also requires considerable R&D). But more importantly, current LLM-based systems and the in-the-spirit-of-ARC-systems have quite different goals. The ARC challenge is intended to measure and build system which can learn efficiently - that is, be able to solve a novel task with very little new data. Ref F. Chollet paper "On the Measure of Intelligence". Current LLMs do not care for learning efficiency at all - actually the strategy is completely opposite - they aim to utilize ss much data and compute as possible to make the most capable system (at least on task that are somehow spanned by the training data). Which works well, but is for sure quite costly and it might also limit applications to those that not require a lot of learning at runtime (we still do not know how far we can take in-context learning). ARC brings in a fresh perspective, but I expect it to take several years for the approaches to really start cross-pollinating.

lawrenceyan|11 months ago

Concrete benchmarks like these are very valuable.

Defining the reward function, which is basically what ARC is doing, is 50% of the problem solving process.

ttol|11 months ago

Had to give https://reasoner.com a try on ARC-AGI-2.

Reasoner passed on first try.

“Correct!”

(See screenshot that shows one rated “hard” -- https://www.linkedin.com/posts/waynechang_tried-reasoner-on-...)

jwpapi|11 months ago

Did we run out of textual tasks that are easy for humans but hard for AI, or why are the examples all graphics?

fchollet|11 months ago

You can easily convert these tasks to token strings. The reason why ARC does not use language as part of its format is that it seeks to minimize the amount of prior knowledge needed to approach the tasks, so as to focus on fluid intelligence as opposed to acquired knowledge.

All ARC tasks are built entirely on top of "Core Knowledge" priors, the kind of elementary knowledge that a small child has already mastered and that is possessed universally by all humans.

timonofathens|11 months ago

ARC tasks are language-independent

101 comments