top | item 42473876

(no title)

Efficiency is now key.

~=$3400 per single task to meet human performance on this benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED", which makes me think they did some undisclosed amount of fine-tuning (eg. via the API they showed off last week), so even more compute went into this task.

We can compare this roughly to a human doing ARC-AGI puzzles, where a human will take (high variance in my subjective experience) between 5 second and 5 minutes to solve the task. (So i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr, and they include in their document an average mechancal turker at $2 USD task in their document)

Going the other direction: I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.

Super exciting that OpenAI pushed the compute out this far so we could see he O-series scaling continue and intersect humans on ARC, now we get to work towards making this economical!

discuss

bluecoconut|1 year ago

some other imporant quotes: "Average human off the street: 70-80%. STEM college grad: >95%. Panel of 10 random humans: 99-100%" -@fchollet on X

So, considering that the $3400/task system isn't able to compete with STEM college grad yet, we still have some room (but it is shrinking, i expect even more compute will be thrown and we'll see these barriers broken in coming years)

Also, some other back of envelope calculations:

The gap in cost is roughly 10^3 between O3 High and Avg. mechanical turkers (humans). Via Pure GPU cost improvement (~doubling every 2-2.5 years) puts us at 20~25 years.

The question is now, can we close this "to human" gap (10^3) quickly with algorithms, or are we stuck waiting for the 20-25 years for GPU improvements. (I think it feels obvious: this is new technology, things are moving fast, the chance for algorithmic innovation here is high!)

I also personally think that we need to adjust our efficiency priors, and start looking not at "humans" as the bar to beat, but theoretical computatble limits (show gaps much larger ~10^9-10^15 for modest problems). Though, it may simply be the case that tool/code use + AGI at near human cost covers a lot of that gap.

miki123211|1 year ago

It's also worth keeping in mind that AIs are a lot less risky to deploy for businesses than humans.

You can scale them up and down at any time, they can work 24/7 (including holidays) with no overtime pay and no breaks, they need no corporate campuses, office space, HR personnel or travel budgets, you don't have to worry about key employees going on sick/maternity leave or taking time off the moment they're needed most, they won't assault a coworker, sue for discrimination or secretly turn out to be a pedophile and tarnish the reputation of your company, they won't leak internal documents to the press or rage quit because of new company policies, they won't even stop working when a pandemic stops most of the world from running.

zamadatix|1 year ago

I don't follow how 10 random humans can beat the average STEM college grad and average humans in that tweet. I suspect it's really "a panel of 10 randomly chosen experts in the space" or something?

I agree the most interesting thing to watch will be cost for a given score more than maximum possible score achieved (not that the latter won't be interesting by any means).

bloppe|1 year ago

Other important quotes: "o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence. Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training)."

So ya, working on efficiency is important, but we're still pretty far away from AGI even ignoring efficiency. We need an actual breakthrough, which I believe will not be possible by simply scaling the transformer architecture.

xbmcuser|1 year ago

You are missing that cost of electricity is also going to keep falling because of solar and batteries. This year in China my table cloth math says it is $0.05 pkwh and following the cost decline trajectory be under $0.01 in 10 years

iandanforth|1 year ago

Let's say that Google is already 1 generation ahead of nvidia in terms of efficient AI compute. ($1700)

Then let's say that OpenAI brute forced this without any meta-optimization of the hypothesized search component (they just set a compute budget). This is probably low hanging fruit and another 2x in compute reduction. ($850)

Then let's say that OpenAI was pushing really really hard for the numbers and was willing to burn cash and so didn't bother with serious thought around hardware aware distributed inference. This could be more than a 2x decrease in cost like we've seen deliver 10x reductions in cost via better attention mechanisms, but let's go with 2x for now. ($425).

So I think we've got about an 8x reduction in cost sitting there once Google steps up. This is probably 4-6 months of work flat out if they haven't already started down this path, but with what they've got with deep research, maybe it's sooner?

Then if "all" we get is hardware improvements we're down to what 10-14 years?

cchance|1 year ago

I mean considering the big breaththrough this year for o1/o3 seems to have been "models having internal thoughts might help reasoning", seems to everyone outside of the AI field was sort of a "duh" moment.

I'd hope we see more internal optimizations and improvements to the models. The idea that the big breakthrough being "don't spit out the first thought that pops into your head" seems obvious to everyone outside of the field, but guess what turns out it was a big improvement when the devs decided to add it.

acchow|1 year ago

> ~doubling every 2-2.5 years) puts us at 20~25 years.

The trend for power consumption of compute (Megaflops per watt) has generally tracked with Koomey’s law for a doubling every 1.57 years

Then you also have model performance improving with compression. For example, Llama 3.1’s 8B outperforming the original Llama 65B

agumonkey|1 year ago

who in this field is anticipating impact of near AGI for society ? maybe i'm too anxious but not planning for potential workless life seems dangerous (but maybe i'm just not following the right groups)

m3kw9|1 year ago

Don’t forget humans which is real GI paired with increasing capable AI can create a feed back loop to accelerate new advances.

bjornsing|1 year ago

> are we stuck waiting for the 20-25 years for GPU improvements

If this turns out to be hard to optimize / improve then there will be a huge economic incentive for efficient ASICs. No freaking way we’ll be running on GPUs for 20-25 years, or even 2.

noFaceDiscoG668|1 year ago

Maybe another plane with a bunch of semiconductor people will disappear over Kazakhstan or something. Capitalist communisms gets bossier in stealth mode.

But sorry, blablabla, this shit is getting embarrassing.

> The question is now, can we close this "to human" gap

You won’t close this gap by throwing more compute at it. Anything in the sphere of creative thinking eludes most people in the history of the planet. People with PhDs in STEM end up working in IT sales not because they are good or capable of learning but because more than half of them can’t do squat shit, despite all that compute and all those algorithms in their brains.

spencerchubb|1 year ago

> Super exciting that OpenAI pushed the compute out this far

it's even more exciting than that. the fact that you even can use more compute to get more intelligence is a breakthrough. if they spent even more on inference, would they get even better scores on arc agi?

lolinder|1 year ago

> the fact that you even can use more compute to get more intelligence is a breakthrough.

I'm not so sure—what they're doing by just throwing more tokens at it is similar to "solving" the traveling salesman problem by just throwing tons of compute into a breadth first search. Sure, you can get better and better answers the more compute you throw at it (with diminishing returns), but is that really that surprising to anyone who's been following tree of thought models?

All it really seems to tell us is that the type of model that OpenAI has available is capable of solving many of the types of problems that ARC-AGI-PUB has set up given enough compute time. It says nothing about "intelligence" as the concept exists in most people's heads—it just means that a certain very artificial (and intentionally easy for humans) class of problem that wasn't computable is now computable if you're willing to pay an enormous sum to do it. A breakthrough of sorts, sure, but not a surprising one given what we've seen already.

echelon|1 year ago

Maybe it's not linear spend.

matusp|1 year ago

I don't think this is only about efficiency. The model I have here is that this is similar to when we beat chess. Yes, it is impressive that we made progress on a class of problems, but is this class aligned with what the economy or the society needs?

Simple turn-based games such as chess turned out to be too far away from anything practical and chess-engine-like programs were never that useful. It is entirely possible that this will end up in a similar situation. ARC-like pattern matching problems or programming challenges are indeed a respectable challenge for AI, but do we need a program that is able to solve them? How often does something like that come up really? I can see some time-saving in using AI vs StackOverflow in solving some programming challenges, but is there more to this?

edanm|1 year ago

I mostly agree with your analysis, but just to drive home a point here - I don't think that algorithms to beat Chess were ever seriously considered as something that would be relevant outside of the context of Chess itself. And obviously, within the world of Chess, they are major breakthroughs.

In this case there is more reason to think these things are relevant outside of the direct context - these tests were specifically designed to see if AI can do general-thinking tasks. The benchmarks might be bad, but that's at least their purpose (unlike in Chess).

lugu|1 year ago

ARC is designed to be hard for current models. It cannot be a proxy for how useful they are. It says something else. Most likely those models won't replace human at their tasks in their organization. Instead "we" will design pipeline so that the tasks aligns with the ability of the model and we will put the human at the periphery. Think of how a factory is organised for the robots.

spamlettuce|1 year ago

okay, but what about literal swe-bench. O3 scored 75% eval

daxfohl|1 year ago

I wonder if we'll start seeing a shift in compute spend, moving away from training time, and toward inference time instead. As we get closer to AGI, we probably reach some limit in terms of how smart the thing can get just training on existing docs or data or whatever. At some point it knows everything it'll ever know, no matter how much training compute you throw at it.

To move beyond that, the thing has to start thinking for itself, some auto feedback loop, training itself on its own thoughts. Interestingly, this could plausibly be vastly more efficient than training on external data because it's a much tighter feedback loop and a smaller dataset. So it's possible that "nearly AGI" leads to ASI pretty quickly and efficiently.

Of course it's also possible that the feedback loop, while efficient as a computation process, isn't efficient as a learning / reasoning / learning-how-to-reason process, and the thing, while as intelligent as a human, still barely competes with a worm in true reasoning ability.

Interesting times.

freehorse|1 year ago

> I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.

On a very simple, toy task, which arc-agi basically is. Arc-agi tests are not hard per se, just LLM’s find them hard. We do not know how this scales for more complex, real world tasks.

SamPatt|1 year ago

Right. Arc is meant to test the ability of a model to generalize. It's neat to see it succeed, but it's not yet a guarantee that it can generalize when given other tasks.

The other benchmarks are a good indication though.

riku_iki|1 year ago

> ~=$3400 per single task

report says it is $17 per task, and $6k for whole dataset of 400 tasks.

binarymax|1 year ago

"Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration."

The low compute was $17 per task. Speculate 172*$17 for the high compute is $2,924 per task, so I am also confused on the $3400 number.

bluecoconut|1 year ago

That's the low-compute mode. In the plot at the top where they score 88%, O3 High (tuned) is ~3.4k

xrendan|1 year ago

You're misreading it, there's two different runs, a low and a high compute run.

The number for the high-compute one is ~172x the first one according to the article so ~=$2900

jhrmnn|1 year ago

That’s for the low-compute configuration that doesn’t reach human-level performance (not far though)

unknown|1 year ago

[deleted]

cle|1 year ago

Efficiency has always been the key.

Fundamentally it's a search through some enormous state space. Advancements are "tricks" that let us find useful subsets more efficiently.

Zooming way out, we have a bunch of social tricks, hardware tricks, and algorithmic tricks that have resulted in a super useful subset. It's not the subset that we want though, so the hunt continues.

Hopefully it doesn't require revising too much in the hardware & social bag of tricks, those are lot more painful to revisit...

Macuyiko|1 year ago

I am not so sure, but indeed it is perhaps also a sad realization.

You compare this to "a human" but also admit there is a high variation.

And, I would say there are a lot humans being paid ~=$3400 per month. Not for a single task, true, but for honestly for no value creating task at all. Just for their time.

So what about we think in terms of output rather than time?

madduci|1 year ago

Let's see when this will be released to the free tier. Looks promising, although I hope they will also be able to publish more details on this, as part of the "open" in their name

ein0p|1 year ago

This is beta version. By the time they're done with this it'll be measured in single digit dollars, if not cents.

chefandy|1 year ago

I think the real key is figuring out how to turn the hand-wavy promises of this making everything better into policy long fucking before we kick the door open. It’s self-evident that this being efficient and useful would be a technological revolution; what’s not self evident is that it wouldn’t benefit the large corporate entities that control even more disproportionately than it does now to the detriment of many other people.

unknown|1 year ago

[deleted]