top | item 40712282

(no title)

mikeknoop | 1 year ago

(ARC Prize co-founder here).

Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:

> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)

Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.

A couple important notes:

1. this result is on the public eval set vs private set (ARC Prize $).

2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard

EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this

discuss

order

refreshingdrink|1 year ago

Also worth nothing that Ryan mentions

> In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set

and

> it's unfortunate that these sets aren’t IID: it makes iteration harder and more confusing

It’s not unfortunate: generalizing beyond the training distribution is a crucial part of intelligence that ARC is trying to measure! Among other reasons, developing with test-set data is a bad practice in ML because it hides the difficulty this challenge. Even worse, writing about a bunch of tricks that help results on this subset is extending the test-set leakage the blog post's readers. This is why I'm glad the ARC Prize has a truly hidden test set

rfoo|1 year ago

... and we know that if we really want to nail it we'd better just pay someone else to create 1,000,000 more harder problems for training (without looking at any in test set, of course). i.e. make the training set distribution similar to test set again.

Because the thing we have now is data-hungry. Your brain is pre-trained on other similar challenges as well. What's the point of requiring it to "generalize beyond the training distribution" with so few samples?

Really, I thought LLMs ended this "can we pretrain on in-house prepared private data for ILSVRC" flame war already.

refibrillator|1 year ago

Do you have any perspectives to share on Ryan's observation of a potential scaling law for these tasks and his comment that "ARC-AGI will be one benchmark among many that just gets solved by scale"?

mikeknoop|1 year ago

ARC isn't perfect and I hope ARC is not the last AGI benchmark. I've spoken with a few other benchmark creators looking to emulate ARC's novelty in other domains, so I think we'll see more. The evolution of AGI benchmarks likely needs to evolve alongside the tech -- humans have to design these tasks today to ensure novelty but should expect that to shift.

One core idea we've been advocating with ARC is that pure LLM scaling (parameters...) is insufficient to achieve AGI. Something new is needed. And OPs approach using a novel outer loop is one cool demonstration of this.

hackerlight|1 year ago

Reminds me of the AlphaCode approach.

Why do you say it's sampling programs from "training data"? With that choice of words, you're rhetorically assuming the conclusion.

If he only sampled 20 programs, instead of 8000, will we still say the programs came from "training data", or will we say it's genuine OOD generalization? At what point do we attribute the intelligence to the LLM itself instead of the outer loop?

This isn't meant to be facetious. Because clearly, if the N programs sampled is very large, it's easy to get the right solution with little intelligence by relying on luck. But as N gets small the LLM has to be intelligent and capable of OOD generalization, assuming the benchmark is good.

Nimitz14|1 year ago

Ah that's an important detail about public v private. Makes it a nice result but nearly as impressive as initially stated.

data_maan|1 year ago

It's not that novel. Others have implemented this approach , in the context of mathematics.

Already the 2021 paper Drori (and many papers since) did similar things.

It's a common idea in this space...

lelanthran|1 year ago

Maybe I am missing something, but to me this looks like "Let's brute-force on the training data".

I mean, generating tens of thousands of possible solutions, to find one that works does not, to me, signify AGI.

After all, the human solving these problem doesn't make 10k attempts before getting a solution, do they?

The approach here, due to brute force, can't really scale: if a random solution to a very simple problem has a 1/10k chance of being right, you can't scale this up to non-trivial problems without exponentially increasing the computational power used. Hence, I feel this is brute-force.

killerstorm|1 year ago

10000 samples are nothing compared to 2^100 possible outputs. It is absolutely, definitely not a "brute search". Testing a small fraction of possibilities (e.g. 0.000001%) is called heuristics, and that's what people use too.

Please learn a bit of combinatorics.

> After all, the human solving these problem doesn't make 10k attempts before getting a solution, do they?

No. People have much better "early rejection", also human brain has massive parallel compute capacity.

It's ridiculous to demand GPT-4 performs as good as a human. Obviously its vision is much worse and it doesn't have 'video' and physics priors people have, so it has to guess more times.

jd115|1 year ago

Reminds me a bit of Genetic Programming as proposed by John Holland, John Koza, etc. Ever since GPT came out, I've been thinking of ways to combine that original idea with LLMs in some way that would accelerate the process with a more "intelligent" selection.

lachlan_gray|1 year ago

I’d love to hear more about this!

sriku|1 year ago

Part of the challenge I understood to be learning priors from the training set that can then be applied to an extended private test set. This approach doesn't seem to do any such "learning" on the go. So, supposing it accomplished 85% on the private test set, would it be construed to have won the prize with "we have AGI" being trumpeted out?

ec109685|1 year ago

There are similarities to the approach in this paper (though they trained a model from scratch): https://arxiv.org/pdf/2309.07062

How well would an LLM trained with a huge number of examples do on this test? Essentially with enough attention, Goodhart's law will take over.

machiaweliczny|1 year ago

Do you accept such solutions as legit? It’s obviously is easier to generate program that to make prompt that will solve it

YeGoblynQueenne|1 year ago

Ah, give it a rest. That's not "frontier AI research", neither is it any kind of reasoning. It's the dumbest of the dumb possible generate-and-test approach that spams a fire hose of Python programs until it hits one that works. And still it gets only 50% on the public eval.

How many thousands of Python programs does a human need to solve a single ARC task? That's what you get with reasoning: you don't need oodles of compute and boodles of sampling.

And I'm sorry to be so mean, but ARC is a farce. It's supposed to be a test for AGI but its only defense from a big data approach (what Francois calls "memorisation") is that there are few examples provided. That doesn't make the tasks hard to solve with memorisation it just makes it hard for a human researcher to find enough examples to solve with memorisation. Like almost every other AI-IQ test before it, ARC is testing for the wrong thing, with the wrong assumptions. See the Winograd Schema Challenge (but not yet the Bongard problems).

jononor|1 year ago

Do you have any suggestions for a better approach of testing artificial intelligence? I mean, in a way that allows comparing different approaches and being a reasonable metric of progress.