The problem is the linear nature of markov chains. Sure they can branch but after an observation you are absolutely at a new state. A goes to B goes to C etc. A classic problem to understand why this is an issue is feeding in a 2D bitmap where the patterns are vertical but you’re passing in data left to right which Markov chains can’t handle since they are navigating exclusively on the current left to right inout. They miss the patterns completely. Similar things happen with language. Language is not linear and context from a few sentences ago should change probabilities in the current sequence of characters. The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.
I have played with Markov chains a lot. I tried having skip states and such but ultimately you’re always pushed towards doing something similar to the attention mechanism to handle context.
You can model multiple-hop dependencies as a Markov chain by just blowing up the state space as a Cartesian product. Not that that would necessarily make sense in practice, but in theory Markov chains have enormous expressive power.
That is the future, given the present state, is conditionally independent of the past states.
It's worth recognizing that decoder only LLMs (which are most the major LLMs used by people) maintain the Markov property and can be properly understood as Markov chains.
> The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.
Attention has nothing to do with the maintaining the Markov property but allows for a fantastically more complex representation of state which is where decoder only LLMs derive the majority of their power.
tl;dr most of the LLMs people use are effectively Markov Chains.
Your 2D bitmap example perfectly illustrates the fundamental limitation! The exponential state explosion you encountered (needing 2^32 states for patterns separated by randomness) is precisely why classical Markov chains became intractable for complex dependencies.
What's fascinating is that transformers with attention don't actually escape the Markov property - they're mathematically equivalent to very high-order Markov chains where the entire context window forms the "state." The breakthrough wasn't abandoning Markov chains, but finding a parameterized way to approximate these massive state spaces through learned representations rather than explicit enumeration.
Your observation about inevitably trending toward attention-like mechanisms is spot-on. The attention mechanism essentially provides a tractable approximation to the astronomically large transition matrices that would be required for a classical Markov chain to capture long-range dependencies. It's a more elegant solution to the same fundamental problem you were solving with skip states.
Excellent point about the fundamental limitation of classical Markov chains' linearity. Your 2D bitmap example perfectly illustrates the key insight: traditional Markov chains are inherently constrained by their sequential, memoryless transitions.
However, I'd argue that the distinction between classical n-gram Markov chains and modern transformers isn't as binary as it might appear. When we consider that transformers with context windows are mathematically equivalent to very high-order Markov chains (where the "state" encompasses the entire context), we see that the breakthrough wasn't abandoning the Markov property, but rather expanding the state representation exponentially.
Your observation about inevitably trending toward attention-like mechanisms is particularly insightful. The exponential state explosion you encountered (needing 2^32 states for your example) is precisely why parameterized approaches like transformers became necessary - they provide a tractable way to approximate these massive state spaces through learned representations rather than explicit enumeration.
The key innovation wasn't escaping Markov chains, but rather finding an efficient way to represent and compute over astronomically large state spaces that would be intractable for classical approaches.
Would you say it's interesting to explore after spending much time on them ? Do you feel like one could make an use for it pragmatically within certain context or it's way too much of a toy where most of the time getting a service / coherent llm would ease-in the work ?
Would having a Markov chain of Markov chains help in this situation. One chain does this when 2D bitmap patterns are vertical and another one for left to right?
I once, probably 4-6 years ago, used exports from Slack conversations to train a Markov Chain to recreate a user that was around a lot and then left for a while. I wrote the whole thing in python and wasn't overly well-versed in the statistics and math side but understood the principle. I made a bot and had it join the Slack instance that I administrate and it would interact if you tagged it or if you said things that person always responded to (hardcoded).
Well, the responses were pretty messed up and not accurate but we all got a good chuckle watching the bot sometimes actually sound like the person amidst a mass of random other things that person always said jumbled together :D
I had a similar program designed as my "AWAY" bot that was trained on transcripts of my previous conversations and connected to Skype. At the time (2009) I was living in Taiwan so I would activate it when I went to bed to chat with my friends back in the States given the ~12 hour time difference. Reading back some of the transcripts made it sound like I was on the verge of a psychotic break though.
I'm sure that Gilfoyle's prank in Silicon Valley had its basis in a true story, like most of SV's episodes. i.e. I'm sure this has been done quite a few times ever since Mark V. Shaney.
KingJamesProgramming [1] - a mashup of Knuth books [2] and the King James Bible with a Markov chain, is still one of my favorite reads for a laugh.
It was also the first place I was really exposed to probabilistic-based methods for generation of novel content.
You get gems like:
"they smote the city with the edge of the sword. 22:20 And one of his main motivations was the high cost of proprietary software houses."
and
"37:29 The righteous shall inherit the land, and leave it for an inheritance unto the children of Gad according to the number of steps that is linear in b."
Just stating for clarity: it is a mashup of the King James Bible and the book Structure and Interpretation of Computer Programs, which is not by Knuth (it is by Harold Abelson and Gerald Jay Sussman with Julie Sussman).
And Veritasium's video "The Strange Math That Predicts (Almost) Anything" talks in detail about the history of Markov chains: https://youtu.be/KZeIEiBrT_w
Random tidbit - 15+ years ago Markov chains were the go to for auto generating text. Google was not as advanced as it is today at flagging spam, so most highly affiliate-marketing dense topics (e.g., certain medications, products) search engine results pages were swamped with Markov chain-created websites that were injected with certain keywords.
For auto-generating non-sensical texts and SEO spam; for actual text generation, planners and realizers were the to-go approach (and ~5 years later it was all LTSMs)
About a decade ago I somehow came across Genudi[0], a markov chain based "AI", and had quite a bit of fun with it. The creator has a blog that makes for an interesting reading session.
Back in the early days of Slack I wrote a chat bot that logged all the messages people wrote and used that data to make it say random things prompted by certain trigger words. It was hilarious, until the bosses found out that I was basically logging all their conversations. Unsurprisingly they made me turn it off.
So I think this "talking point" of "LLMs are just Markov chains" is as bad, analogously, as "Your amazing invention that passes the Turing Test is justa giant LUT". Or "Your C++ program is justa FSM". It is all technically true, but it is being used subtly to dismiss/delegitimize something, instead of communicate a scientifically insightful idea. How is this different than "Humans are just collections of cells". True except we're are a very interesting type of collection! LLMs are a very interesting type of Markov chain! Haskell/Java/Ruby programs are a very interesting type of FSM! Etc. The talking point by itself is just reductionism, of which we've seen and heard variants of during other scientific revolutions.
Also, this Markov chain has size O(T^K), exponential in the token length of the context window, which is a beyond astronomically large object
This is most apparent with older models like GPT-2, and more generally base ("text") models like Davinci. They are essentially only trained to predict the next word, based on what they've seen on the internet.
So they're a really fun way to try and "sample" the internet, by seeing the kind of outputs they produce.
--
A really fun thing I did with GPT-2 with a friend back when it came out, is we fine tuned the model on our WhatsApp conversation together.
(The model was small enough to do this on mid range GPU, in a few hours.)
So GPT-2 learned to simulate chats between us.
It was hilarious, but also weirdly illuminating.
My friend mostly talked about art (true), and didn't say much (true), and I rambled about how I finally figured out the One Weird Trick that's finally gonna turn my life around (...also true)...
The text itself was largely incoherent, but the "feel" (statistical properties?) were spot on.
If you program a Markov chain to generate based upon a fairly short sequence length (4 - 5 characters), it can create some neat portamenteaus. I remember back in the early 90's I trained one on some typical tech literature and it came up with the word "marketecture".
If you select drawing mode from the first screen shown and then click the "T" icon you can type some text and it will generate a state diagram for you that you can then "play" and examine the output sequences. If you have states that have multiple exit routes you can click on that state and adjust the probabilities of each option.
This is great. I use Markov chains to describe to normies how LLMs work.
It's very interesting reading what Claude Shannon wrote after producing Markov chains from English sentences.
> In an unpublished spoof written a year later, Shannon imagined the damage his methods would do if they fell into the wrong hands. It seems that an evil Nazi scientist, dr. Hagen Krankheit, had escaped Germany with a prototype of his Müllabfuhrwortmaschine, a fearsome weapon of war "anticipated in the work ... of Dr. Claude Shannon." Krankheit's machine used the principles of randomized text to totally automate the propaganda industry. By randomly stitching together agitprop phrases in a way that approximated human language, the Müllabfuhrwortmaschine could produce an endless flood of demoralizing statements.
From A Mind at Play. I don't remember the exact year he did that work. I think it was some time before 1950, yet strangely timely in 2025.
This was one of my first forays into modifying c code, trying to figure out why 350mb seemed to be the biggest brain size (32 bit memory limits and requiring a contiguous block for the entire brain).
I miss the innocence of those days. Just being a teen, tinkering with things i didn't understand.
I remember reading the source of the original MegaHAL program when I was younger - one of the tricks that made it stand out (particularly in the Loebner competitions [1]) was that it used both a backwards and forwards Markov chain to generate responses.
On a humorous note OP, you seem like exactly the kind of guy who would get a kick out of this postmodern essay generator that a STEM professor wrote using a Recursive Transition Network in 1996:
Every now and again, I come back to it for a good chuckle. Here's what I got this time (link to the full essay below the excerpt):
"If one examines subcultural capitalist theory, one is faced with a choice:
either reject subdialectic cultural theory or conclude that the significance of
the artist is significant form, but only if culture is interchangeable with
art. It could be said that many desituationisms concerning the capitalist
paradigm of context exist. The subject is contextualised into a presemanticist
deappropriation that includes truth as a totality."
I’ve been telling people for years that a reasonably workable initial, simplified mental model of a large language model is a Markov chain generator with an unlimited, weighted corpus trained in. Very few people who know LLMs have said anything to critique that thought more than that it’s a coarse description and downplays the details. Since being simplified is in the initial statement and it’s not meant to capture detail, I say if it walks like a really big duck and it honks instead of quacking then it’s maybe a goose or swan which are both pretty duck-like birds.
I remember creating a chatbot for my minecraft server over a decade ago that used Markov chains. It was funny seeing the crazy things it spat out and the other players enjoyed it.
Could add a "stop" token at the end of the fruit example, so when the little model runs it can call its own stop.
To generalize to ADHD, create another model that can quietly "Listen" to a speaker (token generator) until its Markov chain predicts a "Stop", and starts its own generating model running, regardless of whether the speaker was done.
Then two instances can have time efficient interrupty conversations with each other.
[+] [-] AnotherGoodName|5 months ago|reply
I have played with Markov chains a lot. I tried having skip states and such but ultimately you’re always pushed towards doing something similar to the attention mechanism to handle context.
[+] [-] t_mann|5 months ago|reply
[+] [-] roadside_picnic|5 months ago|reply
The essential property of a Markov chain is maintaining the Markov property:
P(X_n+1 = x_n+1 | X_n = x_n, ..., x_1) = P(X_n+1 = x_n+1 | X_n = x_n)
That is the future, given the present state, is conditionally independent of the past states.
It's worth recognizing that decoder only LLMs (which are most the major LLMs used by people) maintain the Markov property and can be properly understood as Markov chains.
> The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.
Attention has nothing to do with the maintaining the Markov property but allows for a fantastically more complex representation of state which is where decoder only LLMs derive the majority of their power.
tl;dr most of the LLMs people use are effectively Markov Chains.
[+] [-] nadalishhh|5 months ago|reply
What's fascinating is that transformers with attention don't actually escape the Markov property - they're mathematically equivalent to very high-order Markov chains where the entire context window forms the "state." The breakthrough wasn't abandoning Markov chains, but finding a parameterized way to approximate these massive state spaces through learned representations rather than explicit enumeration.
Your observation about inevitably trending toward attention-like mechanisms is spot-on. The attention mechanism essentially provides a tractable approximation to the astronomically large transition matrices that would be required for a classical Markov chain to capture long-range dependencies. It's a more elegant solution to the same fundamental problem you were solving with skip states.
[+] [-] nadalishhh|5 months ago|reply
However, I'd argue that the distinction between classical n-gram Markov chains and modern transformers isn't as binary as it might appear. When we consider that transformers with context windows are mathematically equivalent to very high-order Markov chains (where the "state" encompasses the entire context), we see that the breakthrough wasn't abandoning the Markov property, but rather expanding the state representation exponentially.
Your observation about inevitably trending toward attention-like mechanisms is particularly insightful. The exponential state explosion you encountered (needing 2^32 states for your example) is precisely why parameterized approaches like transformers became necessary - they provide a tractable way to approximate these massive state spaces through learned representations rather than explicit enumeration.
The key innovation wasn't escaping Markov chains, but rather finding an efficient way to represent and compute over astronomically large state spaces that would be intractable for classical approaches.
[+] [-] 6r17|5 months ago|reply
[+] [-] cuttothechase|5 months ago|reply
[+] [-] felineflock|5 months ago|reply
[+] [-] vrighter|5 months ago|reply
[+] [-] chankstein38|5 months ago|reply
Well, the responses were pretty messed up and not accurate but we all got a good chuckle watching the bot sometimes actually sound like the person amidst a mass of random other things that person always said jumbled together :D
[+] [-] vunderba|5 months ago|reply
[+] [-] emmelaich|5 months ago|reply
[+] [-] dfsegoat|5 months ago|reply
KingJamesProgramming [1] - a mashup of Knuth books [2] and the King James Bible with a Markov chain, is still one of my favorite reads for a laugh.
It was also the first place I was really exposed to probabilistic-based methods for generation of novel content.
You get gems like:
"they smote the city with the edge of the sword. 22:20 And one of his main motivations was the high cost of proprietary software houses."
and
"37:29 The righteous shall inherit the land, and leave it for an inheritance unto the children of Gad according to the number of steps that is linear in b."
[1] https://www.tumblr.com/kingjamesprogramming
[2] https://en.wikipedia.org/wiki/Donald_Knuth
[+] [-] foofoo12|5 months ago|reply
"then shall they call upon me, but I will not cause any information to be accumulated on the stack."
[+] [-] svat|5 months ago|reply
[+] [-] jcynix|5 months ago|reply
And Veritasium's video "The Strange Math That Predicts (Almost) Anything" talks in detail about the history of Markov chains: https://youtu.be/KZeIEiBrT_w
[+] [-] ahmedhawas123|5 months ago|reply
[+] [-] gattilorenz|5 months ago|reply
[+] [-] glouwbug|5 months ago|reply
https://github.com/Heatwave/the-practice-of-programming/blob...
[+] [-] jcynix|5 months ago|reply
https://en.wikipedia.org/wiki/Mark_V._Shaney
[+] [-] y7r4m|5 months ago|reply
0: https://www.genudi.com/about
[+] [-] anthk|5 months ago|reply
http://c9x.me/cbits/
[+] [-] fluxusars|5 months ago|reply
Back in the early days of Slack I wrote a chat bot that logged all the messages people wrote and used that data to make it say random things prompted by certain trigger words. It was hilarious, until the bosses found out that I was basically logging all their conversations. Unsurprisingly they made me turn it off.
[+] [-] airstrike|5 months ago|reply
[+] [-] calf|5 months ago|reply
Also, this Markov chain has size O(T^K), exponential in the token length of the context window, which is a beyond astronomically large object
[+] [-] andai|5 months ago|reply
So they're a really fun way to try and "sample" the internet, by seeing the kind of outputs they produce.
--
A really fun thing I did with GPT-2 with a friend back when it came out, is we fine tuned the model on our WhatsApp conversation together.
(The model was small enough to do this on mid range GPU, in a few hours.)
So GPT-2 learned to simulate chats between us.
It was hilarious, but also weirdly illuminating.
My friend mostly talked about art (true), and didn't say much (true), and I rambled about how I finally figured out the One Weird Trick that's finally gonna turn my life around (...also true)...
The text itself was largely incoherent, but the "feel" (statistical properties?) were spot on.
[+] [-] whycombinetor|5 months ago|reply
[+] [-] taolson|5 months ago|reply
[+] [-] ronsor|5 months ago|reply
[+] [-] dugmartin|5 months ago|reply
https://codap.concord.org/app/static/dg/en/cert/index.html?d...
If you select drawing mode from the first screen shown and then click the "T" icon you can type some text and it will generate a state diagram for you that you can then "play" and examine the output sequences. If you have states that have multiple exit routes you can click on that state and adjust the probabilities of each option.
[+] [-] mwcz|5 months ago|reply
It's very interesting reading what Claude Shannon wrote after producing Markov chains from English sentences.
> In an unpublished spoof written a year later, Shannon imagined the damage his methods would do if they fell into the wrong hands. It seems that an evil Nazi scientist, dr. Hagen Krankheit, had escaped Germany with a prototype of his Müllabfuhrwortmaschine, a fearsome weapon of war "anticipated in the work ... of Dr. Claude Shannon." Krankheit's machine used the principles of randomized text to totally automate the propaganda industry. By randomly stitching together agitprop phrases in a way that approximated human language, the Müllabfuhrwortmaschine could produce an endless flood of demoralizing statements.
From A Mind at Play. I don't remember the exact year he did that work. I think it was some time before 1950, yet strangely timely in 2025.
[+] [-] lou1306|5 months ago|reply
[+] [-] allthatineed|5 months ago|reply
This was one of my first forays into modifying c code, trying to figure out why 350mb seemed to be the biggest brain size (32 bit memory limits and requiring a contiguous block for the entire brain).
I miss the innocence of those days. Just being a teen, tinkering with things i didn't understand.
[+] [-] vunderba|5 months ago|reply
[1] https://en.wikipedia.org/wiki/Loebner_Prize
[+] [-] foobarian|5 months ago|reply
[+] [-] anthk|5 months ago|reply
[+] [-] pessimizer|5 months ago|reply
You could load it up with anything, and it would just babble on.
[+] [-] benob|5 months ago|reply
[+] [-] fair_enough|5 months ago|reply
https://www.elsewhere.org/journal/pomo/
Every now and again, I come back to it for a good chuckle. Here's what I got this time (link to the full essay below the excerpt):
"If one examines subcultural capitalist theory, one is faced with a choice: either reject subdialectic cultural theory or conclude that the significance of the artist is significant form, but only if culture is interchangeable with art. It could be said that many desituationisms concerning the capitalist paradigm of context exist. The subject is contextualised into a presemanticist deappropriation that includes truth as a totality."
https://www.elsewhere.org/journal/pomo/?1298795365
[+] [-] 1718627440|5 months ago|reply
[+] [-] chilipepperhott|5 months ago|reply
[+] [-] cestith|5 months ago|reply
[+] [-] westurner|5 months ago|reply
Examples of Markov chains: https://en.wikipedia.org/wiki/Examples_of_Markov_chains
A Hopfield network is an RNN Recurrent Neural Network.
Hopfield network: https://en.wikipedia.org/wiki/Hopfield_network
From "Ask HN: Parameter-free neural network models: Limits, Challenges, Opportunities?" (2024) https://news.ycombinator.com/item?id=41794272 re: neural network topologies :
> The Asimov Institute > The Neural Network Zoo: https://www.asimovinstitute.org/neural-network-zoo
> PNG: "A mostly complete chart of neural networks" (2019) includes Hopfield nets!" https://www.asimovinstitute.org/wp-content/uploads/2019/04/N...
> [ Category:Neural network architectures, Types of artificial neural networks ]
[+] [-] robotnikman|5 months ago|reply
[+] [-] Nevermark|5 months ago|reply
Could add a "stop" token at the end of the fruit example, so when the little model runs it can call its own stop.
To generalize to ADHD, create another model that can quietly "Listen" to a speaker (token generator) until its Markov chain predicts a "Stop", and starts its own generating model running, regardless of whether the speaker was done.
Then two instances can have time efficient interrupty conversations with each other.