top | item 44829730

(no title)

AgentMatrixAI | 6 months ago

I'm not really convinced, the benchmark blunder was really strange but the demos were quite underwhelming, and it appears this was reflected by a huge market correction in the betting markets as to who will have the best AI by end of the year.

What excites me now is that Gemini 3.0 or some answer from Google is coming soon and that will be the one I will actually end up using. It seems like the last mover in the LLM race is more advantageous.

discuss

order

Buttons840|6 months ago

Polymarket betters are not impressed. Based upon the market odds, OpenAI had a 35% chance to have the best model (at year end), but those odds have dropped to 18% today.

(I'm mostly making this comment to document what happened for the history books.)

https://polymarket.com/event/which-company-has-best-ai-model...

vessenes|6 months ago

After a few hours with gpt-5, I'd trade that spread. Not that I think oAI will win end of year. But I think gpt5 is better than it looks on the benchmark side. It is very very good at something we don't have a lot of benchmarks for -- keeping track of where it's at. codex is vassstly better in practice than claude code or gemini cli right now.

On the chat side, it's also quite different, and I wouldn't be surprised if people need some time to get a taste and a preference for it. I ask most models to help me build a macbook pro charger in 15th century florence with the instructions that I start with only my laptop and I can only talk for four hours of chat before the battery dies -- 5 was notable in that it thought through a bunch of second order implications of plans and offered some unusual things, including a list of instructions for a foot-treadle-based split ring commutator + generator in 15th century florentine italian(!). I have no way of verifying if the italian was correct.

Upshot - I think they did something very special with long context and iterative task management, and I would be surprised if they don't keep improving 5, based on their new branding and marketing plan.

That said, to me this is one of the first 'product release' moments in the frontier model space. 5 is not so much a model release as a polished-up, holes-fixed, annoyances-reduced/removed, 10x faster type of product launch. Google (current polymarket favorite) is remarkably bad at those product releases.

Back to betting - I bet there's a moment this year where those numbers change 10% in oAIs favor.

apetresc|6 months ago

How on Earth does that market have Anthropic at 2%, in a dead heat with the likes of Meta? If the market was about yesterday rather than 5 months from now I think Claude would be pretty clearly the front runner. Why does the market so confidently think they’ll drop to dead last in the next little while?

jstummbillig|6 months ago

That bet does not seem to be very illuminating. Winner is likely who happens to release closest to end of year, no?

croemer|6 months ago

Looking at LMarena which polymarket uses, I'm not surprised. Based on the little data there is (3k duels, it's possibly worse than Gemini, it lost more to Gemini 2.5 Pro than it won in direct duels). Not sure why the ELO is still higher, possibly GPT5 did more clearly better against bad models, which I don't care about.

roflyear|6 months ago

The Musk effect is pretty crazy. Or is there another explanation for why x can compete with Google?

boringg|6 months ago

You don't actually hold polymarket odds with any significant weighting on actual outcomes do you?

m3kw9|6 months ago

Is not that they are not impressed, is just google came out with steerable video gen

riku_iki|6 months ago

> Polymarket betters are not impressed. Based upon the market odds, OpenAI had a 35% chance to have the best model (at year end)

who will decide the winner to resolve bets?

joshmlewis|6 months ago

I am convinced. I've been giving it tasks the past couple hours that Opus 4.1 was failing on and it not only did them but cleaned up the mess Opus made. It's the real deal.

diego_sandoval|6 months ago

On that same vein, I had just tried Opus 4.1 yesterday, and it succesfully completed tasks that Sonnet 4 and Opus 4 failed at.

alfalfasprout|6 months ago

Interesting, I've had the complete opposite experience. Opus 4.1 feels like a generational improvement compared to GPT-5.

energy123|6 months ago

And it's almost 10x cheaper via flex, and in #1 position on lmarena. It's not even close.

boomfunky|6 months ago

The real last mover is Apple, because boy are they not moving.

manmal|6 months ago

As an iOS dev, I really hope they acquire Anthropic before it’s too expensive.

echelon|6 months ago

I really don't want the already trillion dollar mega monopoly to own the world.

blitzar|6 months ago

I would rather the already trillion dollar mega monopoly own the world than "Open"Ai

retinaros|6 months ago

The demos were awful. It felt like watching sloppy vibe coded css UIs

m3kw9|6 months ago

Gpt5 high reasoning is a big step up from o3