(no title)
obblekk | 1 year ago
This means we have an algorithm to get to human level performance on this task.
If you think this task is an eval of general reasoning ability, we have an algorithm for that now.
There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.
Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!
[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1
phillipcarter|1 year ago
But, still, this is incredibly impressive.
qt31415926|1 year ago
azeirah|1 year ago
The only effect smarter models will have is that intelligent people will have to use less of their brain to do their work. As has always been the case, the medium is the message, and climate change is one of the most difficult and worst problems of our time.
If this gets software people to quit en-masse and start working in energy, biology, ecology and preservation? Then it has succeeded.
scotty79|1 year ago
cchance|1 year ago
The Average human is a lot dumber than people on hackernews and reddit seem to realize, shit the people on mturk are likely smarter than the AVERAGE person
FrustratedMonky|1 year ago
There are blind spots, doesn't take away from 'general'.
cryptoegorophy|1 year ago
hammock|1 year ago
In other words, it's possible humans can reason better than o3, but cannot articulate that reasoning as well through text - only in our heads, or through some alternative medium.
hamburga|1 year ago
6gvONxR4sf7o|1 year ago
tim333|1 year ago
From a post elsewhere the scores on ARC-AGI-PUB are approx average human 64%, o3 87%. https://news.ycombinator.com/item?id=42474659
Though also elsewhere, o3 seems very expensive to operate. You could probably hire a PhD researcher for cheaper.
ALittleLight|1 year ago
lastdong|1 year ago
antirez|1 year ago
notfish|1 year ago
How does a giant pile of linear algebra not meet that definition?
benlivengood|1 year ago
At most you can argue that there isn't a useful bounded loss on every possible input, but it turns out that humans don't achieve useful bounded loss on identifying arbitrary sets of pixels as a cat or whatever, either. Most problems NNs are aimed at are qualitative or probabilistic where provable bounds are less useful than Nth-percentile performance on real-world data.
drdeca|1 year ago
But, to my mind, something of the form "Train a neural network with an architecture generally like [blah], with a training method+data like [bleh], and save the result. Then, when inputs are received, run them through the NN in such-and-such way." would constitute an algorithm.
necovek|1 year ago
When a NN is trained, it produces a set of parameters that basically define an algorithm to do inference with: it's a very big one though.
We also call that a NN (the joy of natural language).
KeplerBoy|1 year ago
dyauspitr|1 year ago
creer|1 year ago
When someone is "disinterested enough" to publish though, note the obvious way to launch a new fund or advisor with a good track record: crank out a pile of them, run them one or two years, discard the many losers and publish the one or two top winners. I.E. first you should be suspicious of why it's being published, then of how selected that result is.
hamburga|1 year ago
hypoxia|1 year ago
- 64.2% for humans vs. 82.8%+ for o3.
...
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide
[2] https://arcprize.org/blog/oai-o3-pub-breakthrough
[3] https://arxiv.org/abs/2409.01374
usaar333|1 year ago
Their post has stem grad at nearly 100%
dmead|1 year ago
It really calls into question two things.
1. You don't know what you're talking about about.
2. You have a perverse incentive to believe this such that you will preach it to others and elevate some job salary range or stock.
Either way, not a good look.
javaunsafe2019|1 year ago