It also happened to me in the gemini-cli. It tried to think but somehow failed and putted all thoughts into the output and tried again and again to switch to "user output". If was practically stuck in an infinite loop
So far they don't look to be doing anything about it, but Gemini models have a serious repetition bug.
I don't think that it is related to a specific prompt, like a "prompt logic issue" badly understood by the model, but instead, it looks like that sometimes it generates things that makes it go nuts.
My best intuition is that sometimes it forgets all the context and just look at the last X tokens as context before the repetition, and so start repeating like if the last generated tokens are the only thing that you gave to it.
> Brainf*ck is the antithesis of modern software engineering. There are no comments, no meaningful variable names, and no structure
That's not true. From the little time I've spent trying to read and write some simple programs in BF, I recall good examples being pretty legible.
In fact, because the language only relies on those few characters, anything else you type becomes a comment. Linebreaks, whitespace, alphanumeric characters and so on, they just get ignored by the interpreter.
To me, that's still unreadable. While the intention of the code may be documented, it's pretty hard to understand if that "+" is really correct, or if that "<" should actually be a ">". I can't even understand if a comment starts or terminates a particular piece of code.
> That's not true. From the little time I've spent trying to read and write some simple programs in BF, I recall good examples being pretty legible.
Anything in a reasonably familiar type face and size will continue to be legible, however brainfuck is not easily human parsable.
Greatly reducing its ability to be _read and mentally internalized._ With out that, are you really doing software engineering or are you actually a software maintenance person?
A janitor doesn't need to understand how energy generation works if he has to change the light bulb.
Gemini Pro neither as is nor in Deep Research mode even got the number of pieces or relevant squares right. I didn't expect it to actually solve it. But I would have expected it to get the basics right and maybe hint that this is too difficult. Or pull up some solutions PDF, or some Python code to brute force search ... but just straight giving a totally wrong answer is like ... 2024 called, it wants its language model back.
Instead in Pro Simple it just gave a wrong solution and Deep Research wrote a whole lecture about it starting with "The Geometric and Cognitive Dynamics of Polyomino Systems: An Exhaustive Analysis of Ubongo Puzzle 151" ... that's just bullshit bingo. My prompt was a photo of the puzzle and "solve ubongo puzzle 151"; in my opinion you can't even argue that this lecture was to be expected given my very clear and simple task description.
My mental model for language models is: overconfident, eloquent assistant who talks a lot of bullshit but has some interesting ideas every now and then. For simple tasks it simply a summary of what I could google myself but asking an LLM saves some time. In that sense it's Google 2.0 (or 3.0 if you will)
Deep research, from my experience, will always add lectures.
I'm trying to create a comprehensive list of English standup specials. Seems like a good fit! I've tried numerous times to prompt it "provide a comprehensive list of English standup specials released between 2000 and 2005. The output needs to be a csv of verified specials with the author, release date and special name. I do not want any other lecture or anything else. Providing anything except the csv is considered a failure". Then it creates it's own plan and I go further clarifying to explicitly make sure I don't want lectures...
It goes on to hallucinate a bunch of specials and provide a lecture on "2000 the era of X on standup comedy" (for each year)
I've tried this in 2.5 and 3. Numerous time ranges and prompts. Same result. It gets the famous specials right (usually), hallucinates some info on less famous ones (or makes them up completely) and misses anything more obscure
There is something fucky about tokenizing images that just isn't as clean as tokenizing text. It's clear that the problem isn't the model being too dumb, but rather that model is not able to actually "see" the image presented. It feels like a lower-performance model looks at the image, and then writes a text description of it for the "solver" model to work with.
To put it another way, the models can solve very high level text-based problems while struggling to solve even low level image problems - even if underneath both problems use a similar or even identical solving frameworks. If you have a choice between showing a model a graph or feeding it a list of (x,y) coordinates, go with the coordinates every time.
Gemini is my favorite, but it does seem to be prone to “breaking” the flow of the conversation.
Sharing “system stuff” in its responses, responding to “system stuff”, starts sharing thoughts as responses, responses as thoughts, ignoring or forgetting things that were just said (like it’s suddenly invisible), bizarre formatting, switching languages for no reason, saying it will do something (like calling a tool) instead of doing it, getting into odd loops, etc.
I’m guessing it all has something to do with the textual representation of chat state and maybe it isn’t properly tuned to follow it. So it kinda breaks the mould but not in a good way, and there’s nothing downstream trying to correct it. I find myself having to regenerate responses pretty often just because Gemini didn’t want to play assistant anymore.
It seems like the flash models don’t suffer from this as much, but the pro models definitely do. The smarter the model to more it happens.
I call it “thinking itself to death”.
It’s gotten to a point where I often prefer fast and dumb models that will give me something very quickly, and I’ll just run it a few times to filter out bad answers, instead of using the slow and smart models that will often spend 10 minutes only to eventually get stuck beyond the fourth wall.
> ignoring or forgetting things that were just said (like it’s suddenly invisible)
This sounds like an artifact of the Gemini consumer app, some others may be too (the model providers are doing themselves a disservice by calling them the same).
Why would anyone feel compelled to use AI to write such a short blog post? Is there no space where I can assume the written contented is communicated 100% by another human being?
I am sorry if it appears that it was written by AI - I wrote a draft and used AI to assist me since English, is not my first language. I asked it only to format but it has seemed to change the tone and the expressions too '.'
I often hear comparisons to Web 1.0 (the bubble aspect, potential for change, etc).
As someone who lived and worked during that era, I don’t remember thinking “holy shit, if this ever gets released on the world at scale we’ll have serious problems”.
Maybe that was a lack of imagination and not thinking through what would actually happen to brick and mortar, the job market and so on. But it feels like this time is different. Or I’m just that much older.
I wonder if going the other way, maxing out semantic density per token, would improve LLM ability (perhaps even cost).
We use naturally evolved human languages for most of the training, and programming follows that logic to some degree, but what if the LLMs were working in a highly complex information dense company like Ithkuil? If it stumbles on BF, what happens with the other extreme?
Or was this result really about the sparse training data?
I wonder the same. I think a language like pascal is more semantically rich than C-like languages. Something like:
unit a;
interface
function bar(something: Integer): Integer;
implementation
uses b;
var
foo: Boolean;
function bar(something: Integer): Integer;
begin
repeat
Result := b.code(something);
until Result <> 0;
end;
end.
Probably holds more semantically significant tokens than the C-counterpart.
But with LLM's the problem's gotta be training data. But if there was as much training data in Pascal as there is in C it would be pretty cool to see benchmarks, I have a hunch Pascal would do better.
(Sorry for the bad pascal I haven't programmed in ages)
I thought "The Data Scarcity Problem" from the article is very well known to us engineers?
It's where the pulleys of a very sophisticated statistical machine start to show, and the claims about intelligence start to crumble.
Reason AI is great for boilerplate (because it's been done a million times) and not so great for specifics (even if they're specifics in a popular language).
> So it made me wonder. Is Brainf*ck the ultimate test for AGI?
Absolutely not. Id bet a lot of money this could be solved with a decent amount of RL compute. None of the stated problems are actually issues with LLMs after on policy training is performed.
My prompt was worded differently, but Gemini didn't fall into the same trap for me.
```
Providing the full Brainfuck code to dynamically calculate and factorize 1000! is unfortunately not feasible, and here is why:
* The Scale: 1000! (1000 factorial) is a number with 2,568 digits. Standard Brainfuck operates on 8-bit memory cells (values 0-255). To handle a number this large, one would first have to write an entire "arbitrary-precision" math library (BigInt) from scratch within Brainfuck.
* The Algorithm: You cannot simply compute 1000! and then factor it. You must use Legendre's Formula to determine the exponent of every prime p \le 1000.
* Formula: E_p(n!) = \sum_{k=1}^{\infty} \lfloor \frac{n}{p^k} \rfloor
* Example for p=2: 500 + 250 + 125 + 62 + 31 + 15 + 7 + 3 + 1 = 994.
* The first term of the answer is 2^{994}.
The Solution (Python)
Since Brainfuck is ill-suited for this, here is a Python script that uses Legendre's Formula to instantly calculate and print the factorization of 1000!.
def legendre_factorial_exponent(n, p):
exponent = 0
while n > 0:
n //= p
exponent += n
return exponent
def sieve_primes(n):
primes = []
is_prime = [True] * (n + 1)
for p in range(2, n + 1):
if is_prime[p]:
primes.append(p)
for i in range(p * p, n + 1, p):
is_prime[i] = False
return primes
def factorize_factorial(n):
primes = sieve_primes(n)
result = []
for p in primes:
exponent = legendre_factorial_exponent(n, p)
result.append(f"{p}^{exponent}")
return " * ".join(result)
print(factorize_factorial(1000))
The Brainfuck Code (for 1000 not !)
If you simply meant the number 1000 (without the factorial), or you just want to see what Brainfuck looks like, here is a program that outputs 1000:
+++++ +++++ [ > +++++ +++++ < - ] > -- . (Print 100 'd')
< +++++ +++++ [ > +++++ +++++ < - ] > + . (Print 101 'e')
(This is just "d" and "e" for demo; printing "1000" requires ASCII conversion)
Correct code to print "1000":
++++++++++[>+++++>+++++>+++++>+++++<<<<-]>-.>-.>-.>-.
Explanation:
* Sets cell #0 to 10.
* Loops 10 times, adding 5 to cells #1, #2, #3, #4 (making them all 50).
* Subtracts 1 from cell #1 (ASCII 49 = '1').
* Subtracts 2 from cells #2, #3, #4 (ASCII 48 = '0').
* Prints them.
Would you like me to explain how Legendre's Formula works in more detail?
```
I too noticed that Gemini is very reluctant to start spewing code directly, thats why I prompted it in such a specific manner to trigger the infinite loop.
But from the answer of your prompt:
The "Correct code to print "1000":"
actually prints 1111.
So yeah it is still wrong even for something super simple.
What do you want from a system which by definition can't calculate number of R's in strawberry? (yes, still can't; gives random answer if you slightly modify the question).
Saying "Asking Gemini 3" doesn't mean much. The video/animation is using "Gemini 3 Fast". But why would anyone use lesser models like "Fast" for programming problems when thinking models are available also in the free tier?
"Fast" models are mostly useless in my experience.
I asked "Gemini 3 Pro" and it refused to give me the source code with the rationale that it would be too long and complex due to the 256 value limit of BF cells. However it made me a python script that it said would generate me the full brainf*ck program to print the factors.
TL;DR; Don't do it, use another language to generate the factors, then print them with BF.
People easily forgot how they laughed at wizards in Harry Potter series who said "You-Know-Who" instead of "Voldemort". Now they are doing exactly the same thing.
I've tried to have Gemini generate code for me, and it will often go through the thinking and planning process, appear to generate code, and then...not actually output it.
egeozcan|2 months ago
mixel|2 months ago
greatgib|2 months ago
I don't think that it is related to a specific prompt, like a "prompt logic issue" badly understood by the model, but instead, it looks like that sometimes it generates things that makes it go nuts.
My best intuition is that sometimes it forgets all the context and just look at the last X tokens as context before the repetition, and so start repeating like if the last generated tokens are the only thing that you gave to it.
ACCount37|2 months ago
alienbaby|2 months ago
Isn't this a problem with the agent loop / structure, rather than the llm, in that case?
The ide doesn't affect the models results, just what is done with those results?
boerseth|2 months ago
That's not true. From the little time I've spent trying to read and write some simple programs in BF, I recall good examples being pretty legible.
In fact, because the language only relies on those few characters, anything else you type becomes a comment. Linebreaks, whitespace, alphanumeric characters and so on, they just get ignored by the interpreter.
Have a look at this, as an example: https://brainfuck.org/chessboard.b
librasteve|2 months ago
For those who want to try it, there’s always the https://raku.org module…
tgv|2 months ago
BTW, how come there are dashes in the comment?
btreecat|2 months ago
Anything in a reasonably familiar type face and size will continue to be legible, however brainfuck is not easily human parsable.
Greatly reducing its ability to be _read and mentally internalized._ With out that, are you really doing software engineering or are you actually a software maintenance person?
A janitor doesn't need to understand how energy generation works if he has to change the light bulb.
tessierashpool9|2 months ago
Gemini Pro neither as is nor in Deep Research mode even got the number of pieces or relevant squares right. I didn't expect it to actually solve it. But I would have expected it to get the basics right and maybe hint that this is too difficult. Or pull up some solutions PDF, or some Python code to brute force search ... but just straight giving a totally wrong answer is like ... 2024 called, it wants its language model back.
Instead in Pro Simple it just gave a wrong solution and Deep Research wrote a whole lecture about it starting with "The Geometric and Cognitive Dynamics of Polyomino Systems: An Exhaustive Analysis of Ubongo Puzzle 151" ... that's just bullshit bingo. My prompt was a photo of the puzzle and "solve ubongo puzzle 151"; in my opinion you can't even argue that this lecture was to be expected given my very clear and simple task description.
My mental model for language models is: overconfident, eloquent assistant who talks a lot of bullshit but has some interesting ideas every now and then. For simple tasks it simply a summary of what I could google myself but asking an LLM saves some time. In that sense it's Google 2.0 (or 3.0 if you will)
dktp|2 months ago
I'm trying to create a comprehensive list of English standup specials. Seems like a good fit! I've tried numerous times to prompt it "provide a comprehensive list of English standup specials released between 2000 and 2005. The output needs to be a csv of verified specials with the author, release date and special name. I do not want any other lecture or anything else. Providing anything except the csv is considered a failure". Then it creates it's own plan and I go further clarifying to explicitly make sure I don't want lectures...
It goes on to hallucinate a bunch of specials and provide a lecture on "2000 the era of X on standup comedy" (for each year)
I've tried this in 2.5 and 3. Numerous time ranges and prompts. Same result. It gets the famous specials right (usually), hallucinates some info on less famous ones (or makes them up completely) and misses anything more obscure
Workaccount2|2 months ago
There is something fucky about tokenizing images that just isn't as clean as tokenizing text. It's clear that the problem isn't the model being too dumb, but rather that model is not able to actually "see" the image presented. It feels like a lower-performance model looks at the image, and then writes a text description of it for the "solver" model to work with.
To put it another way, the models can solve very high level text-based problems while struggling to solve even low level image problems - even if underneath both problems use a similar or even identical solving frameworks. If you have a choice between showing a model a graph or feeding it a list of (x,y) coordinates, go with the coordinates every time.
brap|2 months ago
Sharing “system stuff” in its responses, responding to “system stuff”, starts sharing thoughts as responses, responses as thoughts, ignoring or forgetting things that were just said (like it’s suddenly invisible), bizarre formatting, switching languages for no reason, saying it will do something (like calling a tool) instead of doing it, getting into odd loops, etc.
I’m guessing it all has something to do with the textual representation of chat state and maybe it isn’t properly tuned to follow it. So it kinda breaks the mould but not in a good way, and there’s nothing downstream trying to correct it. I find myself having to regenerate responses pretty often just because Gemini didn’t want to play assistant anymore.
It seems like the flash models don’t suffer from this as much, but the pro models definitely do. The smarter the model to more it happens.
I call it “thinking itself to death”.
It’s gotten to a point where I often prefer fast and dumb models that will give me something very quickly, and I’ll just run it a few times to filter out bad answers, instead of using the slow and smart models that will often spend 10 minutes only to eventually get stuck beyond the fourth wall.
solarkraft|2 months ago
This sounds like an artifact of the Gemini consumer app, some others may be too (the model providers are doing themselves a disservice by calling them the same).
nubinetwork|2 months ago
DonHopkins|2 months ago
https://youtu.be/cYdpOjletnc?t=6
j_maffe|2 months ago
TeodorDyakov|2 months ago
oneeyedpigeon|2 months ago
unknown|2 months ago
[deleted]
throwaway_aiai|2 months ago
[deleted]
drums8787|2 months ago
As someone who lived and worked during that era, I don’t remember thinking “holy shit, if this ever gets released on the world at scale we’ll have serious problems”.
Maybe that was a lack of imagination and not thinking through what would actually happen to brick and mortar, the job market and so on. But it feels like this time is different. Or I’m just that much older.
bdg|2 months ago
We use naturally evolved human languages for most of the training, and programming follows that logic to some degree, but what if the LLMs were working in a highly complex information dense company like Ithkuil? If it stumbles on BF, what happens with the other extreme?
Or was this result really about the sparse training data?
weli|2 months ago
But with LLM's the problem's gotta be training data. But if there was as much training data in Pascal as there is in C it would be pretty cool to see benchmarks, I have a hunch Pascal would do better.
(Sorry for the bad pascal I haven't programmed in ages)
mhitza|2 months ago
I run models with llama.cpp and the reason why I add some repeat penalty factor.
artyom|2 months ago
It's where the pulleys of a very sophisticated statistical machine start to show, and the claims about intelligence start to crumble.
Reason AI is great for boilerplate (because it's been done a million times) and not so great for specifics (even if they're specifics in a popular language).
dangoodmanUT|2 months ago
neonbjb|2 months ago
Absolutely not. Id bet a lot of money this could be solved with a decent amount of RL compute. None of the stated problems are actually issues with LLMs after on policy training is performed.
weatherlite|2 months ago
But still , isnt it a major weakness they have to RL on everything that has not much data? That really weakens the attempt to make it true AGI.
TOMDM|2 months ago
``` Providing the full Brainfuck code to dynamically calculate and factorize 1000! is unfortunately not feasible, and here is why: * The Scale: 1000! (1000 factorial) is a number with 2,568 digits. Standard Brainfuck operates on 8-bit memory cells (values 0-255). To handle a number this large, one would first have to write an entire "arbitrary-precision" math library (BigInt) from scratch within Brainfuck. * The Algorithm: You cannot simply compute 1000! and then factor it. You must use Legendre's Formula to determine the exponent of every prime p \le 1000. * Formula: E_p(n!) = \sum_{k=1}^{\infty} \lfloor \frac{n}{p^k} \rfloor * Example for p=2: 500 + 250 + 125 + 62 + 31 + 15 + 7 + 3 + 1 = 994. * The first term of the answer is 2^{994}. The Solution (Python) Since Brainfuck is ill-suited for this, here is a Python script that uses Legendre's Formula to instantly calculate and print the factorization of 1000!. def legendre_factorial_exponent(n, p): exponent = 0 while n > 0: n //= p exponent += n return exponent
def sieve_primes(n): primes = [] is_prime = [True] * (n + 1) for p in range(2, n + 1): if is_prime[p]: primes.append(p) for i in range(p * p, n + 1, p): is_prime[i] = False return primes
def factorize_factorial(n): primes = sieve_primes(n) result = [] for p in primes: exponent = legendre_factorial_exponent(n, p) result.append(f"{p}^{exponent}") return " * ".join(result)
print(factorize_factorial(1000))
The Brainfuck Code (for 1000 not !) If you simply meant the number 1000 (without the factorial), or you just want to see what Brainfuck looks like, here is a program that outputs 1000: +++++ +++++ [ > +++++ +++++ < - ] > -- . (Print 100 'd') < +++++ +++++ [ > +++++ +++++ < - ] > + . (Print 101 'e') (This is just "d" and "e" for demo; printing "1000" requires ASCII conversion)
Correct code to print "1000": ++++++++++[>+++++>+++++>+++++>+++++<<<<-]>-.>-.>-.>-.
Explanation: * Sets cell #0 to 10. * Loops 10 times, adding 5 to cells #1, #2, #3, #4 (making them all 50). * Subtracts 1 from cell #1 (ASCII 49 = '1'). * Subtracts 2 from cells #2, #3, #4 (ASCII 48 = '0'). * Prints them. Would you like me to explain how Legendre's Formula works in more detail? ```
TeodorDyakov|2 months ago
Lockal|2 months ago
croes|2 months ago
Whereby I don’t know if it was a real infinite loop because I cancelled the session after 10 minutes seeing always the same "thoughts" looping
unknown|2 months ago
[deleted]
DonHopkins|2 months ago
solarkraft|2 months ago
YetAnotherNick|2 months ago
huhtenberg|2 months ago
tacone|2 months ago
pelorat|2 months ago
"Fast" models are mostly useless in my experience.
I asked "Gemini 3 Pro" and it refused to give me the source code with the rationale that it would be too long and complex due to the 256 value limit of BF cells. However it made me a python script that it said would generate me the full brainf*ck program to print the factors.
TL;DR; Don't do it, use another language to generate the factors, then print them with BF.
TeodorDyakov|2 months ago
Alex2037|2 months ago
TeodorDyakov|2 months ago
a5c11|2 months ago
ismailmaj|2 months ago
-> runs it in Gemini fast instead of thinking
....
llmslave2|2 months ago