Even before this, Gemini 3 has always felt unbelievably 'general' for me.
It can beat Balatro (ante 8) with text description of the game alone[0]. Yeah, it's not an extremely difficult goal for humans, but considering:
1. It's an LLM, not something trained to play Balatro specifically
2. Most (probably >99.9%) players can't do that at the first attempt
3. I don't think there are many people who posted their Balatro playthroughs in text form online
I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.
Per BalatroBench, gemini-3-pro-preview makes it to round (not ante) 19.3 ± 6.8 on the lowest difficulty on the deck aimed at new players. Round 24 is ante 8's final round. Per BalatroBench, this includes giving the LLM a strategy guide, which first-time players do not have. Gemini isn't even emitting legal moves 100% of the time.
Hi, BalatroBench creator here. Yeah, Google models perform well (I guess the long context + world knowledge capabilities). Opus 4.6 looks good on preliminary results (on par with Gemini 3 Pro). I'll add more models and report soon. Tbh, I didn't expect LLMs to start winning runs. I guess I have to move to harder stakes (e.g. red stake).
My experience also shows that Gemini has unique strength in “generalized” (read: not coding) tasks. Gemini 2.5 Pro and 3 Pro seems stronger at math and science for me, and their Deep Research usually works the hardest, as long as I run it during off-hours. Opus seems to beat Gemini almost “with one hand tied behind its back” in coding, but Gemini is so cheap that it’s usually my first stop for anything that I think is likely to be relatively simple. I never worry about my quota on Gemini like I do with Opus or Chat-GPT.
Comparisons generally seem to change much faster than I can keep my mental model updated. But the performance lead of Gemini on more ‘academic’ explorations of science, math, engineering, etc has been pretty stable for the past 4 months or so, which makes it one of the longer-lasting trends for me in comparing foundation models.
I do wish I could more easily get timely access to the “super” models like Deep Think or o3 pro. I never seem to get a response to requesting access, and have to wait for public access models to catch up, at which point I’m never sure if their capabilities have gotten diluted since the initial buzz died down.
They all still suck at writing an actually good essay/article/literary or research review, or other long-form things which require a lot of experienced judgement to come up with a truly cohesive narrative. I imagine this relates to their low performance in humor - there’s just so much nuance and these tasks represent the pinnacle of human intelligence. Few humans can reliably perform these tasks to a high degree of performance either. I myself am only successful some percentage of the time.
Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets.
(i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)
Google has a library of millions of scanned books from their Google Books project that started in 2004. I think we have reason to believe that there are more than a few books about effectively playing different traditional card games in there, and that an LLM trained with that dataset could generalize to understand how to play Balatro from a text description.
Nonetheless I still think it's impressive that we have LLMs that can just do this now.
I don't think it'd need Balatro playthroughs to be in text form though. Google owns YouTube and has been doing automatic transcriptions of vocalized content on most videos these days, so it'd make sense that they used those subtitles, at the very least, as training data.
Strange, because I could not for the life of me get Gemini 3 to follow my instructions the other day to work through an example with a table, Claude got it first try.
> Most (probably >99.9%) players can't do that at the first attempt
Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.
Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?
I ask because I cannot distinguish all the benchmarks by heart.
François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4.
His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)
I'm excited for the big jump in ARC-AGI scores from recent models, but no one should think for a second this is some leap in "general intelligence".
I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.
Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.
Agreed. I love the elegance of ARC, but it always felt like a gotcha to give spatial reasoning challenges to token generators- and the fact that the token generators are somehow beating it anyway really says something.
Wouldn't you deal with spatial reasoning by giving it access to a tool that structures the space in a way it can understand or just is a sub-model that can do spatial reasoning? These "general" models would serve as the frontal cortex while other models do specialized work. What is missing?
5-10 years? The human panel cost/task is $17 with 100% score. Deep Think is $13.62 with 84.6%. 20% discount for 15% lower score. Sorry, what am I missing?
Arc-AGI (and Arc-AGI-2) is the most overhyped benchmark around though.
It's completely misnamed. It should be called useless visual puzzle benchmark 2.
It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!
So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"
Yes but with a significant (logarithmic) increase in cost per task. The ARC-AGI site is less misleading and shows how GPT and Claude are not actually far behind
Am I the only one that can’t find Gemini useful except if you want something cheap? I don’t get what was the whole code red about or all that PR. To me I see no reason to use Gemini instead of of GPT and Anthropic combo. I should add that I’ve tried it as chat bot, coding through copilot and also as part of a multi model prompt generation.
Gemini was always the worst by a big margin. I see some people saying it is smarter but it doesn’t seem smart at all.
You are not the only one, it's to the point where I think that these benchmark results must be faked somehow because it doesn't match my reality at all.
maybe it depends on the usage, but in my experience most of the times the Gemini produces much better results for coding, especially for optimization parts. The results that were produced by Claude wasn't even near that of Gemini. But again, depends on the task I think.
I’m surprised that gemini 3 pro is so low at 31.1% though compared to opus 4.6 and gpt 5.2. This is a great achievement but its only available to ultra subscribers unfortunately
I read somewhere that Google will ultimately always produce the best LLMs, since "good AI" relies on massive amounts of data and Google owns the most data.
I mean, remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced this isn’t data leakage.
raincole|17 days ago
1. It's an LLM, not something trained to play Balatro specifically
2. Most (probably >99.9%) players can't do that at the first attempt
3. I don't think there are many people who posted their Balatro playthroughs in text form online
I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.
[0]: https://balatrobench.com/
tl|17 days ago
S1M0N38-hn|17 days ago
nerdsniper|17 days ago
Comparisons generally seem to change much faster than I can keep my mental model updated. But the performance lead of Gemini on more ‘academic’ explorations of science, math, engineering, etc has been pretty stable for the past 4 months or so, which makes it one of the longer-lasting trends for me in comparing foundation models.
I do wish I could more easily get timely access to the “super” models like Deep Think or o3 pro. I never seem to get a response to requesting access, and have to wait for public access models to catch up, at which point I’m never sure if their capabilities have gotten diluted since the initial buzz died down.
They all still suck at writing an actually good essay/article/literary or research review, or other long-form things which require a lot of experienced judgement to come up with a truly cohesive narrative. I imagine this relates to their low performance in humor - there’s just so much nuance and these tasks represent the pinnacle of human intelligence. Few humans can reliably perform these tasks to a high degree of performance either. I myself am only successful some percentage of the time.
ankit219|17 days ago
(i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)
littlestymaar|17 days ago
There are *tons* of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.
ebiester|17 days ago
silver_sun|17 days ago
Nonetheless I still think it's impressive that we have LLMs that can just do this now.
winstonp|17 days ago
gaudystead|17 days ago
WiSaGaN|17 days ago
FuckButtons|17 days ago
dudisubekti|17 days ago
throwawayk7h|17 days ago
SomaticPirate|16 days ago
tehsauce|17 days ago
acid__|17 days ago
Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.
Falsintio|17 days ago
[deleted]
nubg|17 days ago
I ask because I cannot distinguish all the benchmarks by heart.
modeless|17 days ago
His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
fishpham|17 days ago
verdverm|17 days ago
https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...
tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark
mNovak|17 days ago
I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.
Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.
causal|17 days ago
throw310822|17 days ago
"100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%."
https://arcprize.org/arc-agi/2/
colordrops|17 days ago
unknown|17 days ago
[deleted]
aeyes|17 days ago
$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?
But the real question is if they just fit the model to the benchmark.
onlyrealcuzzo|17 days ago
At current rates, price per equivalent output is dropping at 99.9% over 5 years.
That's basically $0.01 in 5 years.
Does it really need to be that cheap to be worth it?
Keep in mind, $0.01 in 5 years is worth less than $0.01 today.
re-thc|17 days ago
tedd4u|16 days ago
golem14|17 days ago
igravious|17 days ago
mnicky|17 days ago
saberience|17 days ago
It's completely misnamed. It should be called useless visual puzzle benchmark 2.
It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!
So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"
CuriouslyC|17 days ago
culi|17 days ago
https://arcprize.org/leaderboard
thefounder|17 days ago
Gemini was always the worst by a big margin. I see some people saying it is smarter but it doesn’t seem smart at all.
Nathanba|17 days ago
pell|17 days ago
mileshilles|16 days ago
viking123|16 days ago
nprateem|16 days ago
whiplash451|16 days ago
robertwt7|17 days ago
chillfox|17 days ago
I found that anything over $2/task on Arc-AGI-2 ends up being way to much for use in coding agents.
fzeindl|17 days ago
Is that a based assumption?
astrange|16 days ago
emp17344|16 days ago
unknown|17 days ago
[deleted]
karmasimida|17 days ago
baal80spam|17 days ago