(no title)
mikeknoop | 1 year ago
Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:
> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)
Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.
A couple important notes:
1. this result is on the public eval set vs private set (ARC Prize $).
2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.
All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard
EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this
refreshingdrink|1 year ago
> In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set
and
> it's unfortunate that these sets aren’t IID: it makes iteration harder and more confusing
It’s not unfortunate: generalizing beyond the training distribution is a crucial part of intelligence that ARC is trying to measure! Among other reasons, developing with test-set data is a bad practice in ML because it hides the difficulty this challenge. Even worse, writing about a bunch of tricks that help results on this subset is extending the test-set leakage the blog post's readers. This is why I'm glad the ARC Prize has a truly hidden test set
rfoo|1 year ago
Because the thing we have now is data-hungry. Your brain is pre-trained on other similar challenges as well. What's the point of requiring it to "generalize beyond the training distribution" with so few samples?
Really, I thought LLMs ended this "can we pretrain on in-house prepared private data for ILSVRC" flame war already.
refibrillator|1 year ago
mikeknoop|1 year ago
One core idea we've been advocating with ARC is that pure LLM scaling (parameters...) is insufficient to achieve AGI. Something new is needed. And OPs approach using a novel outer loop is one cool demonstration of this.
hackerlight|1 year ago
Why do you say it's sampling programs from "training data"? With that choice of words, you're rhetorically assuming the conclusion.
If he only sampled 20 programs, instead of 8000, will we still say the programs came from "training data", or will we say it's genuine OOD generalization? At what point do we attribute the intelligence to the LLM itself instead of the outer loop?
This isn't meant to be facetious. Because clearly, if the N programs sampled is very large, it's easy to get the right solution with little intelligence by relying on luck. But as N gets small the LLM has to be intelligent and capable of OOD generalization, assuming the benchmark is good.
Nimitz14|1 year ago
data_maan|1 year ago
Already the 2021 paper Drori (and many papers since) did similar things.
It's a common idea in this space...
lelanthran|1 year ago
I mean, generating tens of thousands of possible solutions, to find one that works does not, to me, signify AGI.
After all, the human solving these problem doesn't make 10k attempts before getting a solution, do they?
The approach here, due to brute force, can't really scale: if a random solution to a very simple problem has a 1/10k chance of being right, you can't scale this up to non-trivial problems without exponentially increasing the computational power used. Hence, I feel this is brute-force.
killerstorm|1 year ago
Please learn a bit of combinatorics.
> After all, the human solving these problem doesn't make 10k attempts before getting a solution, do they?
No. People have much better "early rejection", also human brain has massive parallel compute capacity.
It's ridiculous to demand GPT-4 performs as good as a human. Obviously its vision is much worse and it doesn't have 'video' and physics priors people have, so it has to guess more times.
jd115|1 year ago
lachlan_gray|1 year ago
sriku|1 year ago
ec109685|1 year ago
How well would an LLM trained with a huge number of examples do on this test? Essentially with enough attention, Goodhart's law will take over.
machiaweliczny|1 year ago
YeGoblynQueenne|1 year ago
How many thousands of Python programs does a human need to solve a single ARC task? That's what you get with reasoning: you don't need oodles of compute and boodles of sampling.
And I'm sorry to be so mean, but ARC is a farce. It's supposed to be a test for AGI but its only defense from a big data approach (what Francois calls "memorisation") is that there are few examples provided. That doesn't make the tasks hard to solve with memorisation it just makes it hard for a human researcher to find enough examples to solve with memorisation. Like almost every other AI-IQ test before it, ARC is testing for the wrong thing, with the wrong assumptions. See the Winograd Schema Challenge (but not yet the Bongard problems).
unknown|1 year ago
[deleted]
jononor|1 year ago
unknown|1 year ago
[deleted]