top | item 43852937

(no title)

malisper | 10 months ago

> Also, ARC AGI reported they've been unable to independently replicate OpenAI's claimed breakthrough score from December

Can you elaborate on this? Where did ARC AGI report that? From ARC AGI[0]:

> ARC Prize Foundation was invited by OpenAI to join their “12 Days Of OpenAI.” Here, we shared the results of their first o3 model, o3-preview, on ARC-AGI. It set a new high-water mark for test-time compute, applying near-max resources to the ARC-AGI benchmark.

> We announced that o3-preview (low compute) scored 76% on ARC-AGI-1 Semi Private Eval set and was eligible for our public leaderboard. When we lifted the compute limits, o3-preview (high compute) scored 88%. This was a clear demonstration of what the model could do with unrestricted test-time resources. Both scores were verified to be state of the art.

That makes it sound like ARC AGI were the ones running the original test with o3

What they say they haven't been able to reproduce is o3-preview's performance with the production versions of o3. They attribute this to the production versions being given less compute than the versions they ran in the test

[0] https://arcprize.org/blog/analyzing-o3-with-arc-agi

discuss

No comments yet.