(no title)
alphabetting | 10 days ago
For example the APEX-Agents benchmark for long time horizon investment banking, consulting and legal work:
1. Gemini 3.1 Pro - 33.2% 2. Opus 4.6 - 29.8% 3. GPT 5.2 Codex - 27.6% 4. Gemini Flash 3.0 - 24.0% 5. GPT 5.2 - 23.0% 6. Gemini 3.0 Pro - 18.0%
kakugawa|10 days ago
girvo|10 days ago
I'll withhold judgement until I've tried to use it.
phatfish|9 days ago
That sounds so broad that creating a meaningful benchmark is probably as difficult as creating an AI that actually "solves" those domains.
avereveard|10 days ago
metadat|10 days ago
nl|10 days ago
It's certainly not impossible that the better long-horizon agentic performance in Codex overcomes any deficiencies in outright banking knowledge that Codex 5.2 has vs plain 5.2.
306bobby|10 days ago
306bobby|10 days ago
blueaquilae|10 days ago
HardCodedBias|10 days ago
Let's give it a couple of days since no one believes anything from benchmarks, especially from the Gemini team (or Meta).
If we see on HN that people are willing switching their coding environment, we'll know "hot damn they cooked" otherwise this is another wiff by Google.
drivebyhooting|10 days ago