(no title)
foundry27 | 7 months ago
It’s just bizarrely uncompetitive with o3-pro and Grok 4 Heavy. Anecdotally (from my experience) this was the one feature that enthusiasts in the AI community were interested in to justify the exorbitant price of Google’s Ultra subscription. I find it astonishing that the same company providing free usage of their top models to everybody via AI Studio is nickel-and-diming their actual customers like that.
Performance-wise. So far, I couldn’t even tell. I provided it with a challenging organizational problem that my business was facing, with the relevant context, and it proposed a lucid and well-thought-out solution that was consistent with our internal discussions on the matter. But o3 came to an equally effective conclusion for a fraction of the cost, even if it was less “cohesive” of a report. I guess I’ll have to wait until tomorrow to learn more.
mnmatin|7 months ago
novok|7 months ago
dataviz1000|7 months ago
I've been working on a challenging problem all this week and all the AI copilot models are worthless helping me. Mastery in coding is being alone when nobody else nor AI copilots can help you and you have dig deep into generalization, synthesis, and creativity.
(I thought to myself, at least it will be a little while longer before I'm replaced with AI coding agents.)
epolanski|7 months ago
Thus, AI is a great productivity tool if you know how to use it for the overwhelming majority of problems out there. And it's a boost even for those that are not even good at the craft as well.
This whole narrative of "okay but it can't replace me in this or that situation" is honestly between an obvious touche (why would you think AI would replace rather than empower those who know their craft) and stale luddism.
benreesman|7 months ago
Probably the starkest example of this is build system stuff: it's really obvious which ones have seen a bunch of `nixpkgs`, and even the best ones seem to really struggle with Bazel and sometimes CMake!
The absolute prestige high-end ones running flat out burning 100+ dollars a day and it's a lift on pre-SEO Google/SO I think... but it's not like a blowout vs. a working search index. Back when all the source, all the docs, and all the troubleshooting for any topic on the whole Internet were all above the fold on Google? It was kinda like this: type a question in the magic box and working-ish code pops out. Same at a glory-days FAANG with the internal mega-grep.
I think there's a whole cohort or two who think that "type in the magic box and code comes out" is new. It's not new, we just didn't have it for 5-10 years.
burnte|7 months ago
Melatonic|7 months ago
zyngaro|7 months ago
LeafItAlone|7 months ago
In my experience Grok 4 and 4 Heavy have been crap. Who cares how many requests you get with it when the response is terrible. Worst LLM money I’ve spent this year and I’ve spent a lot.
danenania|7 months ago
OpenAI reasoning models (o1-pro, o3, o3-pro) have been the strongest, in my experience, at harder problems, like finding race conditions in intricate concurrency code, yet they still lag behind even the initial sonnet 3.5 release for writing basic usable code.
The OpenAI models are kind of like CS grads who can solve complex math problems but can't write a decent React component without yadda-yadda-ing half of it, while the Anthropic models will crank out many files of decent, reasonably usable code while frequently missing subtleties and forgetting the bigger picture.
qingcharles|7 months ago
Closi|7 months ago
It would be more interesting to know if it can handle problems that o3 can't do, or if it is 'correct' more often than o3 pro on these sort of problems.
i.e. if o3 is correct 90% of the time, but deep mind is correct 91% of the time on challenging organisational problems, it will be worth paying $250 for an extra 1% certainty (assuming the problem is high-value / high-risk enough).
lucianbr|7 months ago
Suppose it can't. How will you know? All the datapoints will be "not particularly interesting".
thimabi|7 months ago
I agree that’s not a good posture, but it is entirely unsurprising.
Google is probably not profiting from AI Ultra customers either, and grabbing all that sweet usage data from the free tier of AI Studio is what matters most to improve their models.
Giving free access to the best models allows Google to capture market share among the most demanding users, which are precisely the ones that will be charged more in the future. In a certain sense, it’s a great way for Google to use its huge idle server capacity nowadays.
hirako2000|7 months ago
I doubt I'm an isolated case. This Gemini gig will cost Google a lot, they pushed it on all android phones around the globe. I can't wait to see what happens when they have to admit that not many people will pay over 20 bucks for "Ai", and I would pay well over 20 bucks just to see the face of the c suite next year when one dares to explain in simple terms there is absolutely no way to recoup the DC investment and that powering the whole thing will cost the company 10 times that.
827a|7 months ago
I think the primary concern of this industry right now is how, relative to the current latest generation models, we simultaneously need intelligence to increase, cost to decrease, effective context windows to increase, and token bandwidths to increase. All four of these things are real bottlenecks to unlocking the "next level" of these tools for software engineering usage.
Google isn't going to make billions on solving advanced math exams.
Fade_Dance|7 months ago
I'll hazard to say that cost and context windows are the two key metrics to bridge that chasm with acceptable results.... As for software engineering though, that cohort will be demanding on all front for the foreseeable future, especially because there's a bit of a competitive element. Nobody wants to be the vibecoder using sub-par tools compared to everyone else showing off their GitHub results and making sexy blog posts about it on HN.
petesergeant|7 months ago
I would imagine 95% of people never get anywhere near to hitting their CC usage. The people who are getting rate-limited have ten windows open, are auto-accepting edits, and YOLO'ing any kind of coherent code quality in their codebase.
amelius|7 months ago
But yes, Google should have figured that out and used a less expensive mode of reasoning.
danenania|7 months ago
This is why model pickers persist despite no one liking them.
dweekly|7 months ago
raincole|7 months ago
sunaookami|7 months ago
pembrook|7 months ago
Underpriced for consumers, overpriced for businesses.
ebiester|7 months ago
svantana|7 months ago
llm_nerd|7 months ago
FWIW, Google seems to be having some severe issues with oddball, perhaps malfunctioning quota systems. I'm regularly finding extraordinarily little use of gemini-cli is hitting the purported 1000 request limit, when in reality I've done less than 10.
ifwinterco|7 months ago
In order for agentic AI to replace (for example) a software engineer, we need a big step up in capability, around an order of magnitude. These chain of thought models do get a bit closer to that, although in my opinion we're still a way away.
However, at the same time we need about an order of magnitude decrease in price. These models are expensive even at the current price tokens are sold at which seems to be below the actual cost. And these massive CoT models are taking us in completely the wrong direction in terms of cost
profsummergig|7 months ago
What happened to the simplicity of Steve Jobs' 2x2 (consumer vs.pro, laptop vs. desktop)?
starfallg|7 months ago
golfer|7 months ago
iamronaldo|7 months ago
riskassessment|7 months ago
ankitml|7 months ago
andsoitis|7 months ago
int_19h|7 months ago
crowcroft|7 months ago
petesergeant|7 months ago
If it's CapEx it's -- by definition -- not a cost to run. Energy costs will trend to zero.
twobitshifter|7 months ago
ramoz|7 months ago
Gemini is consistently the only model that can reason over long context in dynamic domains for me. Deep Think just did that reviewing an insane amount of Claude Code logs - for a meta analysis task of the underlying implementation. Laughable to think Grok could do that.