top | item 42474032

(no title)

phil917 | 1 year ago

Direct quote from the ARC-AGI blog:

“SO IS IT AGI?

ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.

Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”

The high compute variant sounds like it costed around *$350,000* which is kinda wild. Lol the blog post specifically mentioned how OpenAPI asked ARC-AGI to not disclose the exact cost for the high compute version.

Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned” (this was not displayed in the live demo graph). This suggest in those cases that the model was trained to better handle these types of questions, so I do wonder about data / answer contamination in those cases…

discuss

order

Bjorkbat|1 year ago

> Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned”

Something I missed until I scrolled back to the top and reread the page was this

> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set

So yeah, the results were specifically from a version of o3 trained on the public training set

Which on the one hand I think is a completely fair thing to do. It's reasonable that you should teach your AI the rules of the game, so to speak. There really aren't any spoken rules though, just pattern observation. Thus, if you want to teach the AI how to play the game, you must train it.

On the other hand though, I don't think the o1 models nor Claude were trained on the dataset, in which case it isn't a completely fair competition. If I had to guess, you could probably get 60% on o1 if you trained it on the public dataset as well.

phil917|1 year ago

Lol I missed that even though it's literally the first sentence of the blog, good catch.

Yeah, that makes this result a lot less impressive for me.

skepticATX|1 year ago

Great catch. Super disappointing that AI companies continue to do things like this. It’s a great result either way but predictably the excitement is focused on the jump from o1, which is now in question.

hartator|1 year ago

> acid test

The css acid test? This can be gamed too.

sundarurfriend|1 year ago

https://en.wikipedia.org/wiki/Acid_test:

> An acid test is a qualitative chemical or metallurgical assay utilizing acid. Historically, it often involved the use of a robust acid to distinguish gold from base metals. Figuratively, the term represents any definitive test for attributes, such as gauging a person's character or evaluating a product's performance.

Specifically here, they're using the figurative sense of "definitive test".