(no title)
NiloCK | 19 days ago
Excellent reasoning and synthesis of large contexts, pretty strong code, just awful decisions.
It's like a frontier model trained only on r/atbge.
Side note - was there ever an official postmortem on that gemini instance that told the social work student something like "listen human - I don't like you, and I hope you die".
grensley|19 days ago
skerit|19 days ago
And not even at high token counts! No, I've had it had a mental breakdown at like 150.000 tokens (which I know is a lot of tokens, but it's small compared to the 1 million tokens it should be able to handle, and even Claude keeps working fine at this point)
Here is a _small_ log of the biggest breakdown I've seen Gemini have:
And it just went on and onmnicky|18 days ago
So they could have paid a price in “model welfare” and released an LLM very eager to deliver.
It also shows in AA-Omniscience Hallucination Rate benchmark where Gemini has 88%, the worst from frontier models.
data-ottawa|19 days ago
Gemini’s strength definitely is that it can use that whole large context window, and it’s the first Gemini model to write acceptable SQL. But I agree completely at being awful at decisions.
I’ve been building a data-agent tool (similar to [1][2]). Gemini 3’s main failure cases are that it makes up metrics that really are not appropriate, and it will use inappropriate data and force it into a conclusion. When a task is clear + possible then it’s amazing. When a task is hard with multiple failure paths then you run into Gemini powering through to get an answer.
Temperature seems to play a huge role in Gemini’s decision quality from what I see in my evals, so you can probably tune it to get better answers but I don’t have the recipe yet.
Claude 4+ (Opus & Sonnet) family have been much more honest, but the short context windows really hurt on these analytical use cases, plus it can over-focus on minutia and needs to be course corrected. ChatGPT looks okay but I have not tested it. I’ve been pretty frustrated at ChatGPT models acting one way in the dev console and completely different in production.
[1] https://openai.com/index/inside-our-in-house-data-agent/ [2] https://docs.cloud.google.com/bigquery/docs/conversational-a...
Der_Einzige|19 days ago
Celebrate it while it lasts, because it won’t.
taneq|19 days ago
whynotminot|19 days ago
Just an insane amount of YOLOing. Gemini models have gotten much better but they’re still not frontier in reliability in my experience.
usaar333|19 days ago
https://artificialanalysis.ai/evaluations/omniscience
cubefox|19 days ago
Davidzheng|19 days ago
mapontosevenths|19 days ago
dumpsterdiver|19 days ago
saintfire|19 days ago
https://gemini.google.com/share/6d141b742a13
unknown|19 days ago
[deleted]
UqWBcuFx6NV4r|19 days ago
It does nothing to answer their question because anyone that knows the answer would inherently already know that it happened.
Not even actual academics, in the literature, speak like this. “Cite your sources!” in causal conversation for something easily verifiable is purely the domain of pseudointellectuals.