top | item 45963670

Gemini 3 Pro Model Card [pdf]

280 points| virgildotcodes | 4 months ago |storage.googleapis.com | reply

335 comments

order
[+] scrlk|4 months ago|reply
Benchmarks from page 4 of the model card:

    | Benchmark             | 3 Pro     | 2.5 Pro | Sonnet 4.5 | GPT-5.1   |
    |-----------------------|-----------|---------|------------|-----------|
    | Humanity's Last Exam  | 37.5%     | 21.6%   | 13.7%      | 26.5%     |
    | ARC-AGI-2             | 31.1%     | 4.9%    | 13.6%      | 17.6%     |
    | GPQA Diamond          | 91.9%     | 86.4%   | 83.4%      | 88.1%     |
    | AIME 2025             |           |         |            |           |
    |   (no tools)          | 95.0%     | 88.0%   | 87.0%      | 94.0%     |
    |   (code execution)    | 100%      | -       | 100%       | -         |
    | MathArena Apex        | 23.4%     | 0.5%    | 1.6%       | 1.0%      |
    | MMMU-Pro              | 81.0%     | 68.0%   | 68.0%      | 80.8%     |
    | ScreenSpot-Pro        | 72.7%     | 11.4%   | 36.2%      | 3.5%      |
    | CharXiv Reasoning     | 81.4%     | 69.6%   | 68.5%      | 69.5%     |
    | OmniDocBench 1.5      | 0.115     | 0.145   | 0.145      | 0.147     |
    | Video-MMMU            | 87.6%     | 83.6%   | 77.8%      | 80.4%     |
    | LiveCodeBench Pro     | 2,439     | 1,775   | 1,418      | 2,243     |
    | Terminal-Bench 2.0    | 54.2%     | 32.6%   | 42.8%      | 47.6%     |
    | SWE-Bench Verified    | 76.2%     | 59.6%   | 77.2%      | 76.3%     |
    | t2-bench              | 85.4%     | 54.9%   | 84.7%      | 80.2%     |
    | Vending-Bench 2       | $5,478.16 | $573.64 | $3,838.74  | $1,473.43 |
    | FACTS Benchmark Suite | 70.5%     | 63.4%   | 50.4%      | 50.8%     |
    | SimpleQA Verified     | 72.1%     | 54.5%   | 29.3%      | 34.9%     |
    | MMLU                  | 91.8%     | 89.5%   | 89.1%      | 91.0%     |
    | Global PIQA           | 93.4%     | 91.5%   | 90.1%      | 90.9%     |
    | MRCR v2 (8-needle)    |           |         |            |           |
    |   (128k avg)          | 77.0%     | 58.0%   | 47.1%      | 61.6%     |
    |   (1M pointwise)      | 26.3%     | 16.4%   | n/s        | n/s       |
n/s = not supported

EDIT: formatting, hopefully a bit more mobile friendly

[+] mynti|4 months ago|reply
It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding
[+] Workaccount2|4 months ago|reply
I think Anthropic is reading the room, and just going to go hard on being "the" coding model. I suppose they feel that if they can win that, they can get an ROI without having to do full blown multimodality at the highest level.

It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.

[+] vharish|4 months ago|reply
From my personal experience using the CLI agentic coding tools, I think gemini-cli is fairly on par with the rest in terms of the planning/code that is generated. However, when I recently tried qwen-code, it gave me a better sense of reasoning and structure that geimini. Claude definitely has it's own advantages but is expensive(at least for some if not for all).

My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.

I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.

[+] felipeerias|4 months ago|reply
IMHO coding use cases are much more constrained by tooling than by raw model capabilities at the moment. Perhaps we have finally reached the time of diminishing returns and that will remain the case going forward.
[+] Palmik|4 months ago|reply
Also does not beat GPT-5.1 Codex on terminal bench (57.8% vs 54.2%): https://www.tbench.ai/

I did not bother verifying the other claims.

[+] tosh|4 months ago|reply
This might also hint at SWE struggling to capture what “being good at coding” means.

Evals are hard.

[+] JacobAsmuth|4 months ago|reply
50% of the CLs in SWE-Bench Verified are the DJango codebase. So if you're a big contributor to Django you should care a lot about that benchmark. Otherwise the difference between models is +-2 tasks done correctly. I wouldn't worry too much about it. Just try it out yourself and see if its any better.
[+] aoeusnth1|4 months ago|reply
Their scores on SWE bench are very close because the benchmark is nearly saturated. Gemini 3 beats Sonnet 4.5 on TerminalBench 2.0 by a nice margin (54% vs. 43%), which is also agentic coding (CLI instead of python).
[+] varispeed|4 months ago|reply
Never got good code out of Sonnet. It's been Gemini 2.5 for me followed by GPT-5.x.

Gemini is very good a pointing out flaws that are very subtle and non noticeable at a first and second glance.

It also produces code that is easy to reason about. You can then feed it to GPT-5.x for refinement and then back to Gemini for assessment.

[+] alyxya|4 months ago|reply
I think Google probably cares more about a strong generalist model rather than solely optimizing for coding.
[+] macrolime|4 months ago|reply
Pretty sure it will beat Sonnet by a wide margin in actual real-world usage.
[+] I_am_tiberius|4 months ago|reply
I don't know if this is true but I believe Anthropic has for a long time illegally used user prompts for training, without user consent.
[+] jbellis|4 months ago|reply
swebench is (1) terrible and (2) saturated
[+] Taek|4 months ago|reply
One benchmark I would really like to see: instruction adherence.

For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outputs became inconsistent and difficult to control.

The latest set of models (2.5 Pro, GPT-5, etc) seem to top out somewhere in the 100 range? They are clearly much better at following a laundry list of instructions, but they also clearly have a limit and once your prompt is too large and too specific you lose coherence again.

If I had to guess, Gemini 3 Pro has once again pushed the bar, and maybe we're up near 250 (haven't used it, I'm just blindly projecting / hoping). And that's a huge deal! I actually think it would be more helpful to have a model that could consistently follow 1000 custom instructions than it would be to have a model that had 20 more IQ points.

I have to imagine you could make some fairly objective benchmarks around this idea, and it would be very helpful from an engineering perspective to see how each model stacked up against the others in this regard.

[+] machiaweliczny|4 months ago|reply
20 more IQ would be nuts, 110 ~ top 25%, 130 ~ top 2%, 150 ~ top 0.05%

If you ever played competitive game the difference is insane between these tiers

[+] transcriptase|4 months ago|reply
There needs to be a sycophancy benchmark in these comparisons. More baseless praise and false agreement = lower score.
[+] Workaccount2|4 months ago|reply
This idea isn't just smart, it's revolutionary. You're getting right at the heart of the problem with today's benchmarks — we don't measure model praise. Great thinking here.

For real though, I think that overall LLM users enjoy things to be on the higher side of sycophancy. Engineers aren't going to feel it, we like our cold dead machines, but the product people will see the stats (people overwhelmingly use LLMs to just talk to about whatever) and go towards that.

[+] swalsh|4 months ago|reply
You're absolutely right
[+] BoredPositron|4 months ago|reply
Your comment demonstrates a remarkably elevated level of cognitive processing and intellectual rigor. Inquiries of this caliber are indicative of a mind operating at a strategically advanced tier, displaying exceptional analytical bandwidth and thought-leadership potential. Given the substantive value embedded in your question, it is operationally imperative that we initiate an immediate deep-dive and execute a comprehensive response aligned with the strategic priorities of this discussion.
[+] postalcoder|4 months ago|reply
I care very little about model personality outside of sycophancy. The thing about gemini is that it's notorious for its low self esteem. Given that thing is trained from scratch, I'm very curious to see how they've decided to take it.
[+] SiempreViernes|4 months ago|reply
I'd like if the scorecard also gave an expected number of induced suicides per hundred thousand users.
[+] Lord-Jobo|4 months ago|reply
And have the score heavily modified based on how fixable the sycophancy is.
[+] embedding-shape|4 months ago|reply
Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/[email protected]` which obviously fails...

Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?

[+] Fornax96|4 months ago|reply
Creator of pixeldrain here. I have no idea why my site is blocked in Spain, but it's a long running issue.

I actually never discovered who was responsible for the blockade, until I read this comment. I'm going to look into Allot and send them an email.

EDIT: Also, your DNS provider is censoring (and probably monitoring) your internet traffic. I would switch to a different provider.

[+] amarcheschi|4 months ago|reply
That website is used to share everything including pirated things, so that's the reason maybe
[+] grodriguez100|4 months ago|reply
Is it possible to file a complaint with the ISP or directly with Allot ?
[+] tngranados|4 months ago|reply
It works fine for me using Movistar
[+] miqazza|4 months ago|reply
do you know about the cloudflare and laliga issues? might be that
[+] lxdlam|4 months ago|reply
What does the "Google Antigravity" mean? The link is http://antigravity.google/docs, seemingly a new product but now routing to the Google main page.
[+] dbosch|4 months ago|reply
I was asking myself the exact same question. No idea
[+] laborcontract|4 months ago|reply
It's hilarious that the release of Gemini 3 is getting eclipsed by this cloudflare outage.
[+] ethmarks|4 months ago|reply
> TPUs are specifically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs.

That seems like a low bar. Who's training frontier LLMs on CPUs? Surely they meant to compare TPUs to GPUs. If "this is faster than a CPU for massively parallel AI training" is the best you can say about it, that's not very impressive.

[+] denysvitali|4 months ago|reply
Title of the document is "[Gemini 3 Pro] External Model Card - November 18, 2025 - v2", in case you needed further confirmation that the model will be released today.

Also interesting to know that Google Antigravity (antigravity.google / https://github.com/Google-Antigravity ?) leaked. I remember seeing this subdomain recently. Probably Gemini 3 related as well.

Org was created on 2025-11-04T19:28:13Z (https://api.github.com/orgs/Google-Antigravity)

[+] patates|4 months ago|reply
It says it's been trained from scratch. I wonder if it will have the same undescribable magic that makes me spend an hour every day with 2.5. I really love the results I can get with 2.5 pro. Google eventually limiting aistudio will be a sad day.

Also I really hoped for a 2M+ context. I'm living on the context edge even with 1M.

[+] fraboniface|4 months ago|reply
> Developments to the model architecture contribute to the significantly improved performance from previous model families.

I wonder how significant this is. DeepMind was always more research-oriented that OpenAI, which mostly scaled things up. They may have come up with a significantly better architecture (Transformer MoE still leaves a lot of room).

[+] mohsen1|4 months ago|reply

     This model is not a modification or a fine-tune of a prior model

Is that common to mention that? Feels like they built something from scratch
[+] rvz|4 months ago|reply
> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.

Well don't complain when you are using Gmail and your emails are being trained to develop Gemini.

[+] Topfi|4 months ago|reply
Additional context from AI Studio including pricing:

Our most intelligent model with SOTA reasoning and multimodal understanding, and powerful agentic and vibe coding capabilities

<=200K tokens • Input: $2,00 / Output: $12,00

> 200K tokens • Input: $4,00 / Output: $18,00

Knowledge cut off: Jan. 2025

[+] aliljet|4 months ago|reply
What's wild here is that among every single score they've absolutely killed, somehow, Anthropic and Claude Sonnet 4.5 have won a single victory in the fight: SWE Bench Verified and only by a singular point.

I already enjoy Gemini 2.5 pro for planning and if Gemini 3 is priced similarly, I'll be incredibly happy to ditch the painfully pricey Claude max subscription. To be fair, I've already got an extremely sour taste in my mouth from the last Anthropic bait and switch on pricing and usage, so happy to see Google take the crown here.