top | item 44520695

(no title)

dbagr | 7 months ago

Either they overtook other LLMs by simply using more compute (which is reasonable to think as they have a lot of GPUs) or I'm willing to bet there is benchmark contamination. I don't think their engineering team came up with any better techniques than used in training other LLMs, and Elon has a history of making deceptive announcements.

discuss

order

z7|7 months ago

How do you explain Grok 4 achieving new SOTA on ARC-AGI-2, nearly doubling the previous commercial SOTA?

https://x.com/arcprize/status/1943168950763950555

saberience|7 months ago

They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions.

What I've noticed when testing previous versions of Grok, on paper they were better at benchmarks, but when I used it the responses were always worse than Sonnet and Gemini even though Grok had higher benchmark scores.

Occasionally I test Grok to see if it could become my daily driver but it's never produced better answers than Claude or Gemini for me, regardless of what their marketing shows.

dbagr|7 months ago

As I said, either by benchmark contamination (it is semi-private and could have been obtained by persons from other companies which model have been benchmarked) or by having more compute.

ericlewis|7 months ago

I still dont understand why people point to this chart as any sort of meaning. Cost per task is a fairly arbitrary X axis and in no way representing any sort of time scale.. I would love to be told how they didn't underprice their model and give it an arbitrary amount of time to work.

vessenes|7 months ago

anecdotally, output in my tests is pretty good. It's at least competitive to SOTA from other providers right now.