top | item 45304980

Markov chains are the original language models

454 points| chilipepperhott | 5 months ago |elijahpotter.dev | reply

174 comments

order
[+] AnotherGoodName|5 months ago|reply
The problem is the linear nature of markov chains. Sure they can branch but after an observation you are absolutely at a new state. A goes to B goes to C etc. A classic problem to understand why this is an issue is feeding in a 2D bitmap where the patterns are vertical but you’re passing in data left to right which Markov chains can’t handle since they are navigating exclusively on the current left to right inout. They miss the patterns completely. Similar things happen with language. Language is not linear and context from a few sentences ago should change probabilities in the current sequence of characters. The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.

I have played with Markov chains a lot. I tried having skip states and such but ultimately you’re always pushed towards doing something similar to the attention mechanism to handle context.

[+] t_mann|5 months ago|reply
You can model multiple-hop dependencies as a Markov chain by just blowing up the state space as a Cartesian product. Not that that would necessarily make sense in practice, but in theory Markov chains have enormous expressive power.
[+] roadside_picnic|5 months ago|reply
> after an observation you are absolutely at a new state

The essential property of a Markov chain is maintaining the Markov property:

P(X_n+1 = x_n+1 | X_n = x_n, ..., x_1) = P(X_n+1 = x_n+1 | X_n = x_n)

That is the future, given the present state, is conditionally independent of the past states.

It's worth recognizing that decoder only LLMs (which are most the major LLMs used by people) maintain the Markov property and can be properly understood as Markov chains.

> The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.

Attention has nothing to do with the maintaining the Markov property but allows for a fantastically more complex representation of state which is where decoder only LLMs derive the majority of their power.

tl;dr most of the LLMs people use are effectively Markov Chains.

[+] nadalishhh|5 months ago|reply
Your 2D bitmap example perfectly illustrates the fundamental limitation! The exponential state explosion you encountered (needing 2^32 states for patterns separated by randomness) is precisely why classical Markov chains became intractable for complex dependencies.

What's fascinating is that transformers with attention don't actually escape the Markov property - they're mathematically equivalent to very high-order Markov chains where the entire context window forms the "state." The breakthrough wasn't abandoning Markov chains, but finding a parameterized way to approximate these massive state spaces through learned representations rather than explicit enumeration.

Your observation about inevitably trending toward attention-like mechanisms is spot-on. The attention mechanism essentially provides a tractable approximation to the astronomically large transition matrices that would be required for a classical Markov chain to capture long-range dependencies. It's a more elegant solution to the same fundamental problem you were solving with skip states.

[+] nadalishhh|5 months ago|reply
Excellent point about the fundamental limitation of classical Markov chains' linearity. Your 2D bitmap example perfectly illustrates the key insight: traditional Markov chains are inherently constrained by their sequential, memoryless transitions.

However, I'd argue that the distinction between classical n-gram Markov chains and modern transformers isn't as binary as it might appear. When we consider that transformers with context windows are mathematically equivalent to very high-order Markov chains (where the "state" encompasses the entire context), we see that the breakthrough wasn't abandoning the Markov property, but rather expanding the state representation exponentially.

Your observation about inevitably trending toward attention-like mechanisms is particularly insightful. The exponential state explosion you encountered (needing 2^32 states for your example) is precisely why parameterized approaches like transformers became necessary - they provide a tractable way to approximate these massive state spaces through learned representations rather than explicit enumeration.

The key innovation wasn't escaping Markov chains, but rather finding an efficient way to represent and compute over astronomically large state spaces that would be intractable for classical approaches.

[+] 6r17|5 months ago|reply
Would you say it's interesting to explore after spending much time on them ? Do you feel like one could make an use for it pragmatically within certain context or it's way too much of a toy where most of the time getting a service / coherent llm would ease-in the work ?
[+] cuttothechase|5 months ago|reply
Would having a Markov chain of Markov chains help in this situation. One chain does this when 2D bitmap patterns are vertical and another one for left to right?
[+] felineflock|5 months ago|reply
Have you tried to sequence the 2D bitmap in a Hilbert space-filling curve before applying the Markov chain?
[+] vrighter|5 months ago|reply
yeah but that state can be as big as you want. say, for example, the state could be a vector of a million tokens...
[+] chankstein38|5 months ago|reply
I once, probably 4-6 years ago, used exports from Slack conversations to train a Markov Chain to recreate a user that was around a lot and then left for a while. I wrote the whole thing in python and wasn't overly well-versed in the statistics and math side but understood the principle. I made a bot and had it join the Slack instance that I administrate and it would interact if you tagged it or if you said things that person always responded to (hardcoded).

Well, the responses were pretty messed up and not accurate but we all got a good chuckle watching the bot sometimes actually sound like the person amidst a mass of random other things that person always said jumbled together :D

[+] vunderba|5 months ago|reply
I had a similar program designed as my "AWAY" bot that was trained on transcripts of my previous conversations and connected to Skype. At the time (2009) I was living in Taiwan so I would activate it when I went to bed to chat with my friends back in the States given the ~12 hour time difference. Reading back some of the transcripts made it sound like I was on the verge of a psychotic break though.
[+] emmelaich|5 months ago|reply
I'm sure that Gilfoyle's prank in Silicon Valley had its basis in a true story, like most of SV's episodes. i.e. I'm sure this has been done quite a few times ever since Mark V. Shaney.
[+] dfsegoat|5 months ago|reply
Related:

KingJamesProgramming [1] - a mashup of Knuth books [2] and the King James Bible with a Markov chain, is still one of my favorite reads for a laugh.

It was also the first place I was really exposed to probabilistic-based methods for generation of novel content.

You get gems like:

"they smote the city with the edge of the sword. 22:20 And one of his main motivations was the high cost of proprietary software houses."

and

"37:29 The righteous shall inherit the land, and leave it for an inheritance unto the children of Gad according to the number of steps that is linear in b."

[1] https://www.tumblr.com/kingjamesprogramming

[2] https://en.wikipedia.org/wiki/Donald_Knuth

[+] foofoo12|5 months ago|reply
The latest post on KingJamesProgramming is hilarious:

"then shall they call upon me, but I will not cause any information to be accumulated on the stack."

[+] svat|5 months ago|reply
Just stating for clarity: it is a mashup of the King James Bible and the book Structure and Interpretation of Computer Programs, which is not by Knuth (it is by Harold Abelson and Gerald Jay Sussman with Julie Sussman).
[+] ahmedhawas123|5 months ago|reply
Random tidbit - 15+ years ago Markov chains were the go to for auto generating text. Google was not as advanced as it is today at flagging spam, so most highly affiliate-marketing dense topics (e.g., certain medications, products) search engine results pages were swamped with Markov chain-created websites that were injected with certain keywords.
[+] gattilorenz|5 months ago|reply
For auto-generating non-sensical texts and SEO spam; for actual text generation, planners and realizers were the to-go approach (and ~5 years later it was all LTSMs)
[+] glouwbug|5 months ago|reply
The Practice of Programming by Kernighan and Pike had a really elegant Markov:

https://github.com/Heatwave/the-practice-of-programming/blob...

[+] y7r4m|5 months ago|reply
About a decade ago I somehow came across Genudi[0], a markov chain based "AI", and had quite a bit of fun with it. The creator has a blog that makes for an interesting reading session.

0: https://www.genudi.com/about

[+] fluxusars|5 months ago|reply
Can't believe no one's mentioned this classic yet: https://www.tumblr.com/kingjamesprogramming

Back in the early days of Slack I wrote a chat bot that logged all the messages people wrote and used that data to make it say random things prompted by certain trigger words. It was hilarious, until the bosses found out that I was basically logging all their conversations. Unsurprisingly they made me turn it off.

[+] airstrike|5 months ago|reply
LMAO thanks for that link, it's hilarious
[+] calf|5 months ago|reply
So I think this "talking point" of "LLMs are just Markov chains" is as bad, analogously, as "Your amazing invention that passes the Turing Test is justa giant LUT". Or "Your C++ program is justa FSM". It is all technically true, but it is being used subtly to dismiss/delegitimize something, instead of communicate a scientifically insightful idea. How is this different than "Humans are just collections of cells". True except we're are a very interesting type of collection! LLMs are a very interesting type of Markov chain! Haskell/Java/Ruby programs are a very interesting type of FSM! Etc. The talking point by itself is just reductionism, of which we've seen and heard variants of during other scientific revolutions.

Also, this Markov chain has size O(T^K), exponential in the token length of the context window, which is a beyond astronomically large object

[+] andai|5 months ago|reply
This is most apparent with older models like GPT-2, and more generally base ("text") models like Davinci. They are essentially only trained to predict the next word, based on what they've seen on the internet.

So they're a really fun way to try and "sample" the internet, by seeing the kind of outputs they produce.

--

A really fun thing I did with GPT-2 with a friend back when it came out, is we fine tuned the model on our WhatsApp conversation together.

(The model was small enough to do this on mid range GPU, in a few hours.)

So GPT-2 learned to simulate chats between us.

It was hilarious, but also weirdly illuminating.

My friend mostly talked about art (true), and didn't say much (true), and I rambled about how I finally figured out the One Weird Trick that's finally gonna turn my life around (...also true)...

The text itself was largely incoherent, but the "feel" (statistical properties?) were spot on.

[+] whycombinetor|5 months ago|reply
Modern LLMs are also "essentially only trained to predict the next word, based on what they've seen on the internet."
[+] taolson|5 months ago|reply
If you program a Markov chain to generate based upon a fairly short sequence length (4 - 5 characters), it can create some neat portamenteaus. I remember back in the early 90's I trained one on some typical tech literature and it came up with the word "marketecture".
[+] ronsor|5 months ago|reply
That sounds like a corporate buzzword generator
[+] dugmartin|5 months ago|reply
If you would like to visually play with a Markov chain model my non-profit research org built a Markov plugin for, CODAP, our data analysis tool:

https://codap.concord.org/app/static/dg/en/cert/index.html?d...

If you select drawing mode from the first screen shown and then click the "T" icon you can type some text and it will generate a state diagram for you that you can then "play" and examine the output sequences. If you have states that have multiple exit routes you can click on that state and adjust the probabilities of each option.

[+] mwcz|5 months ago|reply
This is great. I use Markov chains to describe to normies how LLMs work.

It's very interesting reading what Claude Shannon wrote after producing Markov chains from English sentences.

> In an unpublished spoof written a year later, Shannon imagined the damage his methods would do if they fell into the wrong hands. It seems that an evil Nazi scientist, dr. Hagen Krankheit, had escaped Germany with a prototype of his Müllabfuhrwortmaschine, a fearsome weapon of war "anticipated in the work ... of Dr. Claude Shannon." Krankheit's machine used the principles of randomized text to totally automate the propaganda industry. By randomly stitching together agitprop phrases in a way that approximated human language, the Müllabfuhrwortmaschine could produce an endless flood of demoralizing statements.

From A Mind at Play. I don't remember the exact year he did that work. I think it was some time before 1950, yet strangely timely in 2025.

[+] lou1306|5 months ago|reply
This is amazing, in no small part due to the fact that Krankheit literally means "disease" in German.
[+] allthatineed|5 months ago|reply
I remember playing with megahal eggdrop bots.

This was one of my first forays into modifying c code, trying to figure out why 350mb seemed to be the biggest brain size (32 bit memory limits and requiring a contiguous block for the entire brain).

I miss the innocence of those days. Just being a teen, tinkering with things i didn't understand.

[+] vunderba|5 months ago|reply
I remember reading the source of the original MegaHAL program when I was younger - one of the tricks that made it stand out (particularly in the Loebner competitions [1]) was that it used both a backwards and forwards Markov chain to generate responses.

[1] https://en.wikipedia.org/wiki/Loebner_Prize

[+] foobarian|5 months ago|reply
I'm old now, but thanks to LLMs I can now again tinker with things I don't understand :-)
[+] anthk|5 months ago|reply
Hailo it's still alive under CPAN.
[+] benob|5 months ago|reply
Like it or not, LLMs are effectively high-order Markov chains
[+] fair_enough|5 months ago|reply
On a humorous note OP, you seem like exactly the kind of guy who would get a kick out of this postmodern essay generator that a STEM professor wrote using a Recursive Transition Network in 1996:

https://www.elsewhere.org/journal/pomo/

Every now and again, I come back to it for a good chuckle. Here's what I got this time (link to the full essay below the excerpt):

"If one examines subcultural capitalist theory, one is faced with a choice: either reject subdialectic cultural theory or conclude that the significance of the artist is significant form, but only if culture is interchangeable with art. It could be said that many desituationisms concerning the capitalist paradigm of context exist. The subject is contextualised into a presemanticist deappropriation that includes truth as a totality."

https://www.elsewhere.org/journal/pomo/?1298795365

[+] 1718627440|5 months ago|reply
I think that's how source code sounds to non-programmers. Too far removed from the meaning two distinguish, what makes sense and what is garbage.
[+] chilipepperhott|5 months ago|reply
That essay generator is pretty cool. There are definitely a lot of undergrads whose essays look about the same.
[+] cestith|5 months ago|reply
I’ve been telling people for years that a reasonably workable initial, simplified mental model of a large language model is a Markov chain generator with an unlimited, weighted corpus trained in. Very few people who know LLMs have said anything to critique that thought more than that it’s a coarse description and downplays the details. Since being simplified is in the initial statement and it’s not meant to capture detail, I say if it walks like a really big duck and it honks instead of quacking then it’s maybe a goose or swan which are both pretty duck-like birds.
[+] westurner|5 months ago|reply
Markov chain > History: https://en.wikipedia.org/wiki/Markov_chain#History

Examples of Markov chains: https://en.wikipedia.org/wiki/Examples_of_Markov_chains

A Hopfield network is an RNN Recurrent Neural Network.

Hopfield network: https://en.wikipedia.org/wiki/Hopfield_network

From "Ask HN: Parameter-free neural network models: Limits, Challenges, Opportunities?" (2024) https://news.ycombinator.com/item?id=41794272 re: neural network topologies :

> The Asimov Institute > The Neural Network Zoo: https://www.asimovinstitute.org/neural-network-zoo

> PNG: "A mostly complete chart of neural networks" (2019) includes Hopfield nets!" https://www.asimovinstitute.org/wp-content/uploads/2019/04/N...

> [ Category:Neural network architectures, Types of artificial neural networks ]

[+] robotnikman|5 months ago|reply
I remember creating a chatbot for my minecraft server over a decade ago that used Markov chains. It was funny seeing the crazy things it spat out and the other players enjoyed it.
[+] Nevermark|5 months ago|reply
Nice little example.

Could add a "stop" token at the end of the fruit example, so when the little model runs it can call its own stop.

To generalize to ADHD, create another model that can quietly "Listen" to a speaker (token generator) until its Markov chain predicts a "Stop", and starts its own generating model running, regardless of whether the speaker was done.

Then two instances can have time efficient interrupty conversations with each other.