The baked-in assumptions observation is basically the opposite of the impression I get after watching Gemini 3's CoT. With the maximum reasoning effort it's able to break out of the wrong route by rethinking the strategy. For example I gave it an onion address without the .onion part, and told it to figure out what this string means. All reasoning models including Gemini 2.5 and 3 assume it's a puzzle or a cipher (because they're trained on those) and start endlessly applying different algorithms to no avail. Gemini 3 Pro is the only model that can break the initial assumption after running out of ideas ("Wait, the user said it's just a string, what if it's NOT obfuscated"), and correctly identify the string as an onion address. My guess is they trained it on simulations to enforce the anti-jailbreaking commands injected by the Model Armor, as its CoT is incredibly paranoid at times. I could be wrong, of course.
I've had some weird "thinking outside the box" behavior like this. I once asked 3 Pro what Ozzy Osbourne is up to. The CoT was a journey, I can tell you! It's not in its training data that he actually passed away. It did know he was planning a tour though. It had a real struggle trying to consolidate "suspicious search results" and even questioned whether it was fake news, or running against a simulation!, determining it wasn't going to fall for my "test".
It did ultimately decide Ozzy was alive. I pushed back on that, and it instantly corrected itself and partially blamed my query "what is he up to" for being formulated as if he was alive.
“Gemini 3 Pro was often overloaded, which produced long spans of downtime that 2.5 Pro experienced much less often”
I was unclear if this meant that the API was overloaded or if he was on a subscription plan and had hit his limit for the moment. Although I think that the Gemini plans just use weekly limits, so I guess it must be API.
"Crucially, it tells the agent not to rely on its internal training data (which might be hallucinated or refer to a different version of the game) but to ground its knowledge in what it observes. "
Yes, at least to some extent. The author mentions that the base model knows the answer to the switch puzzle but does not execute it properly here.
"It is worth noting that the instruction to "ignore internal knowledge" played a role here. In cases like the shutters puzzle, the model did seem to suppress its training data. I verified this by chatting with the model separately on AI Studio; when asked directly multiple times, it gave the correct solution significantly more often than not. This suggests that the system prompt can indeed mask pre-trained knowledge to facilitate genuine discovery."
I'm wondering about this too. Would be nice to see an ablation here, or at least see some analysis on the reasoning traces.
It definitely doesn't wipe its internal knowledge of Crystal clean (that's not how LLMs work). My guess is that it slightly encourages the model to explore more and second-guess it's likely very-strong Crystal game knowledge but that's about it.
It will definitely have some effect. Why won't it? Even adding noise into prompts (like saying you will be rewarded $1000 for each correct answer) has some effect.
Whether the 'effect' something implied by the prompt, or even something we can understand, is a totally different question.
It's hard to say for sure because Gemini 3 was only tested with this prompt. But for Gemini 2.5, which is who the prompt was originally written for, yes this does cut down on bad assumptions (a specific example: the puzzle with Farfetch'd in Ilex Forest is completely different in the DS remake of the game, and models love to hallucinate elements from the remake's puzzle if you don't emphasize the need to distinguish hypothesis from things it actually observes).
I would imagine that prompting anything like this will have an excessively ironic effect like convincing it to suppress patterns which it would consider to be pre-knowledge.
If you looked inside they would be spinning on something like "oh I know this is the tile to walk on, but I have to only rely on what I observe! I will do another task instead to satisfy my conditions and not reveal that I have pre-knowledge.
LLMs are literal douche genies. The less you say, generally, the better
Nice writeup! I need to start blogging about my antics. I rigged up several cutting edge small local models to an emulator all in-browser and unsuccessfully tried to get them to play different Pokémon games. They just weren't as sharp as the frontier models.
This was a good while back but I'm sure a lot of people might find the process and code interesting even if it didn't succeed. Might resurrect that project.
It would unfortunately also need several runs of each to be reliable. There's nothing in TFA to indicate the results shown aren't to a large degree affected by random chance!
(I do think from personal benchmarks that Gemini 3 is better for the reasons stated by the author, but a single run from each is not strong evidence.)
As a fun comparison, Gemini 3 Pro took 17 days to beat the game.
Twitch Plays Pokemon, which was frequently random, chaotic, even malicious, took 13 days to clear Crystal.
So after years of being gleefully told that AI will replace all jobs an omniscient state of the art model, with heavy assistance, takes more than two weeks and thousands of dollars in tokens to do what child me did in a few days? Huh.
I used to think the same until latest agents started adding perfectly fine features to a large existing react app with just basic input (in English) . Most of the jobs require levels of intelligence below that. It's just a matter of time before agents get to that.
How certain can we be that these improvements aren't just a result of Gemini 3 Pro pre-training on endless internet writeups of where 2.5 has struggled (and almost certainly what a human would have done instead)?
In other words, how much of this improvement is true generalization vs memorization?
You're too kind. Even the CEO of Google retweeted how well Gemini 2.5 did on Pokemon. There is a high chance that now it's explicitly part of the training regime. We kind of need a different kind of game to know how well it generalizes.
There were no such writeups, 99% of the discussion about difficulties in Crystal were in twitch and discord chats where Google doesn't scrape. (It hadn't yet gotten the public attention that Claude and Gemini's runs of Pokemon Red and Blue have gotten.)
That said, this writeup itself will probably be scraped and influence Gemini 4.
> it often makes early assumptions and fails to validate them, which can waste a lot of time
Is this baked into how the models are built? A model outputs a bunch of tokens, then reads them back and treats them as the existing "state" which has to be built on. So if the model has earlier said (or acted like) a given assumption is true, then it is going to assume "oh, I said that, it must be the case". Presumably one reason that hacks like "Wait..." exist is to work around this problem.
Being through the game recently, I am not surprised Goldenrod Underground was a challenge, it is very confusing and even though I solved it through trial and error, I still don't know what I did. Olivine Lighthouse is the real surprise, as it felt quite obvious to me.
I wonder how much of it is due to the model being familiar with the game or parts of it, be it due to training of the game itself, or reading/watching walkthroughs online.
There was a well-publicised "Claude plays Pokémon" stream where Claude failed to complete Pokemon Blue in spectacular fashion, despite weeks of trying. I think only a very gullible person would assume that future LLMs didn't specifically bake this into their training, as they do for popular benchmarks or for penguins riding a bike.
orbital-decay|2 months ago
jug|2 months ago
It did ultimately decide Ozzy was alive. I pushed back on that, and it instantly corrected itself and partially blamed my query "what is he up to" for being formulated as if he was alive.
bbondo|2 months ago
mkoubaa|2 months ago
elephanlemon|2 months ago
I was unclear if this meant that the API was overloaded or if he was on a subscription plan and had hit his limit for the moment. Although I think that the Gemini plans just use weekly limits, so I guess it must be API.
brianwawok|2 months ago
addaon|2 months ago
ogogmad|2 months ago
echelon|2 months ago
Did the streamer get subsidized by Google?
(The stream isn't run by Google themselves, is it?)
oceansky|2 months ago
Does this even have any effect?
ragibson|2 months ago
"It is worth noting that the instruction to "ignore internal knowledge" played a role here. In cases like the shutters puzzle, the model did seem to suppress its training data. I verified this by chatting with the model separately on AI Studio; when asked directly multiple times, it gave the correct solution significantly more often than not. This suggests that the system prompt can indeed mask pre-trained knowledge to facilitate genuine discovery."
tootyskooty|2 months ago
It definitely doesn't wipe its internal knowledge of Crystal clean (that's not how LLMs work). My guess is that it slightly encourages the model to explore more and second-guess it's likely very-strong Crystal game knowledge but that's about it.
raincole|2 months ago
Whether the 'effect' something implied by the prompt, or even something we can understand, is a totally different question.
MrCheeze|2 months ago
blibble|2 months ago
mkoubaa|2 months ago
baby|2 months ago
astrange|2 months ago
elif|2 months ago
If you looked inside they would be spinning on something like "oh I know this is the tile to walk on, but I have to only rely on what I observe! I will do another task instead to satisfy my conditions and not reveal that I have pre-knowledge.
LLMs are literal douche genies. The less you say, generally, the better
soulofmischief|2 months ago
This was a good while back but I'm sure a lot of people might find the process and code interesting even if it didn't succeed. Might resurrect that project.
giancarlostoro|2 months ago
cg5280|2 months ago
kqr|2 months ago
(I do think from personal benchmarks that Gemini 3 is better for the reasons stated by the author, but a single run from each is not strong evidence.)
krige|2 months ago
sussmannbaka|2 months ago
rybosome|2 months ago
- History, most likely
dwaltrip|2 months ago
murukesh_s|2 months ago
squimmy26|2 months ago
In other words, how much of this improvement is true generalization vs memorization?
zurfer|2 months ago
MrCheeze|2 months ago
That said, this writeup itself will probably be scraped and influence Gemini 4.
prmoustache|2 months ago
topaz0|2 months ago
dash2|2 months ago
Is this baked into how the models are built? A model outputs a bunch of tokens, then reads them back and treats them as the existing "state" which has to be built on. So if the model has earlier said (or acted like) a given assumption is true, then it is going to assume "oh, I said that, it must be the case". Presumably one reason that hacks like "Wait..." exist is to work around this problem.
jwrallie|2 months ago
MrCheeze|2 months ago
That said, it's definitely Gem's fault that it struggled so long, considering it ignored the NPCs that give clues.
wild_pointer|2 months ago
andrepd|2 months ago
reilly3000|2 months ago
dpedu|2 months ago
elif|2 months ago