top | item 46357675

The Illustrated Transformer

500 points| auraham | 2 months ago |jalammar.github.io

88 comments

order

libraryofbabel|2 months ago

I read this article back when I was learning the basics of transformers; the visualizations were really helpful. Although in retrospect knowing how a transformer works wasn't very useful at all in my day job applying LLMs, except as a sort of deep background for reassurance that I had some idea of how the big black box producing the tokens was put together, and to give me the mathematical basis for things like context size limitations etc.

I would strongly caution anyone who thinks that they will be able to understand or explain LLM behavior better by studying the architecture closely. That is a trap. Big SotA models these days exhibit so much nontrivial emergent phenomena (in part due to the massive application of reinforcement learning techniques) that give them capabilities very few people expected to ever see when this architecture first arrived. Most of us confidently claimed even back in 2023 that, based on LLM architecture and training algorithms, LLMs would never be able to perform well on novel coding or mathematics tasks. We were wrong. That points towards some caution and humility about using network architecture alone to reason about how LLMs work and what they can do. You'd really need to be able to poke at the weights inside a big SotA model to even begin to answer those kinds of questions, but unfortunately that's only really possible if you're a "mechanistic interpretability" researcher at one of the major labs.

Regardless, this is a nice article, and this stuff is worth learning because it's interesting for its own sake! Right now I'm actually spending some vacation time implementing a transformer in PyTorch just to refresh my memory of it all. It's a lot of fun! If anyone else wants to get started with that I would highly recommend Sebastian Raschka's book and youtube videos as way into the subject: https://github.com/rasbt/LLMs-from-scratch .

Has anyone read TFA author Jay Alammar's book (published Oct 2024) and would they recommend it for a more up-to-date picture?

crystal_revenge|2 months ago

> massive application of reinforcement learning techniques

So sad that "reinforcement learning" is another term whose meaning has been completely destroyed by uneducated hype around LLMs (very similar to "agents"). 5 years ago nobody familiar with RL would consider what these companies are doing as "reinforcement learning".

RLHF and similar techniques are much, much closer to traditional fine-tuning than they are reinforcement learning. RL almost always, historically, assumes online training and interaction with an environment. RLHF is collecting data from user and using it to reach the LLM to be more engaging.

This fine-tuning also doesn't magically transform LLMs into something different, but it is largely responsible for their sycophantic behavior. RLHF makes LLMs more pleasing to humans (and of course can be exploited to help move the needle on benchmarks).

It's really unfortunate that people will throw away their knowledge of computing in order to maintain a belief that LLMs are something more than they are. LLMs are great, very useful, but they're not producing "nontrivial emergent phenomena". They're increasing trained a products to invoked increase engagement. I've found LLMs less useful in 2025 than in 2024. And the trend in people not opening them up under the hood and playing around with them to explore what they can do has basically made me leave the field (I used to work in AI related research).

holtkam2|2 months ago

I agree and disagree. In my day job as an AI engineer I rarely if ever need to use any “classic” deep learning to get things done. However, I’m a firm believer that understanding the internals of a LLM can set you apart as an gen AI engineer, if you’re interested in becoming the top 1% in your field. There can and will be situations where your intuition about the constraints of your model is superior compared to peers who consider the LLM a black box. I had this advice given directly to me years ago, in person, by Clem Delangue of Hugging Face - I took it seriously and really doubled down on understanding the guts of LLMs. I think it’s served me well.

I’d give similar advice to any coding bootcamp grad: yes you can get far by just knowing python and React, but to reach the absolute peak of your potential and join the ranks of the very best in the world in your field, you’ll eventually want to dive deep into computer architecture and lower level languages. Knowing these deeply will help you apply your higher level code more effectively than your coding bootcamp classmates over the course of a career.

ozgung|2 months ago

I think the biggest problem is that most tutorials use words to illustrate how the attention mechanism works. In reality, there are no word-associated tokens inside a Transformer. Tokens != word parts. An LLM does not perform language processing inside the Transformer blocks, and a Vision Transformer does not perform image processing. Words and pixels are only relevant at the input. I think this misunderstanding was a root cause of underestimating their capabilities.

energy123|2 months ago

An example of why a basic understanding is helpful:

A common sentiment on HN is that LLMs generate too many comments in code.

But comment spam is going to help code quality, due to the way causal transformers and positional encoding works. The model has learned to dump locally-specific reasoning tokens where they're needed, in a tightly scoped cluster that can be attended to easily, and forgetting about just as easily later on. It's like a disposable scratchpad to reduce the errors in the code it's about to write.

The solution to comment spam is textual/AST post-processing of generated code, rather than prompting the LLM to handicap itself by not generating as much comments.

DiscourseFan|2 months ago

Literally the exact thing I tell new hires on projects for training models: theory is far less important than practice.

We are only just beginning to understand how these things work. I imagine it will end up being similar to Freud’s Oedipal complex: when we failed to have a fully physical understanding of cognition, we employed a schematic narrative. Something similar is already emerging.

foobiekr|2 months ago

> would never be able to perform well on novel coding or mathematics tasks. We were wrong

I'm not clear at all we were wrong. A lot of the mathematics announcements have been rolled back and "novel coding" is exactly where the LLMs seem to fail on a daily basis - things that are genuinely not represented in the training set.

brcmthrowaway|2 months ago

How was reinforcement learning used as a gamechanger?

What happens to an LLM without reinforcement learning?

melagonster|2 months ago

Maybe the most benefits are from the condition that people can read another new paper with enough background knowledge.

nrhrjrjrjtntbt|2 months ago

It is almost like understanding wood at a molecular level and being a carpenter. It also may help the carpentery, but you cam be a great one without it. And a bad one with the knowledge.

miki123211|2 months ago

> Most of us confidently claimed even back in 2023 that, based on LLM architecture and training algorithms, LLMs would never be able to perform well on novel coding or mathematics tasks.

I feel like there are three groups of people:

1. Those who think that LLMs are stupid slop-generating machines which couldn't ever possibly be of any use to anybody, because there's some problem that is simple for humans but hard for LLMs, which makes them unintelligent by definition.

2. Those who think we have already achieved AGI and don't need human programmers any more.

3. Those who believe LLMs will destroy the world in the next 5 years.

I feel like the composition of these three groups is pretty much constant since the release of Chat GPT, and like with most political fights, evidence doesn't convince people either way.

boltzmann_|2 months ago

Kudos also to Transformer Explainer team for putting some amazing visualizations https://poloclub.github.io/transformer-explainer/ It really clicked to me after reading this two and watching 3blue1brown videos

gzer0|2 months ago

This is hands down one of the best visualizations I have ever come across.

Koshkin|2 months ago

(Going on a tangent.) The number of transformer explanations/tutorials is becoming overwhelming. Reminds me of monads (or maybe calculus). Someone feels a spark of enlightenment at some point (while, often, in fact, remaining deeply confused), and an urge to share their newly acquired (mis)understanding with a wide audience.

kadushka|2 months ago

Maybe so, but this particular blog post was the first and is still the best explanation of how transformers work.

nospice|2 months ago

So?

There's no rule that the internet is limited to a single explanation. Find the one that clicks for you, ignore the rest. Whenever I'm trying to learn about concepts in mathematics, computer science, physics, or electronics, I often find that the first or the "canonical" explanation is hard for me to parse. I'm thankful for having options 2 through 10.

gustavoaca1997|2 months ago

I have this book. Really a life savior to help me catching up a few months ago when my team decided to use LLMs in our systems.

qoez|2 months ago

Don't really see why you'd need to understand how the transformer works to do LLMs at work. LLMs is just a synthetic human performing reasoning with some failure modes that in-depth knowledge of the transformer interals won't help you predict what they are (just have to use experience with the output to get a sense, or other peoples experiments).

ActorNightly|2 months ago

People need to get away from this idea of Key/Query/Value as being special.

Whereas a standard deep layer in a network is matrix * input, where each row of the matrix is the weights of the particular neuron in the next layer, a transformer is basically input* MatrixA, input*MatrixB, input*MatrixC (where vector*matrix is a matrix), then the output is C*MatrixA*MatrixB*MatrixC. Just simply more dimensions in a layer.

And consequently, you can represent the entire transformer architecture with a set of deep layers as you unroll the matricies, with a lot of zeros for the multiplication pieces that are not needed.

This is a fairly complex blog but it shows that its just all matrix multiplication all the way down. https://pytorch.org/blog/inside-the-matrix/.

throw310822|2 months ago

I might be completely off road, but I can't help thinking of convolutions as my mental model for the K Q V mechanism. Attention has the same property of a convolution kernel of being trained independently of position; it learns how to translate a large, rolling portion of an input to a new "digested" value; and you can train multiple ones in parallel so that they learn to focus on different aspects of the input ("kernels" in the case of convolution, "heads" in the case of attention).

prashant418|2 months ago

This guide is such a beast, Try pairing this guide with say claude code and ask it to generate sample mini pytorch pesudo-code and you can spend hours just learning/re-learning and mentally visualize a lot of these concepts. I am a big fan

Simplita|2 months ago

Visual explanations like this make it clearer why models struggle once context balloons. In practice, breaking problems into explicit stages helped us more than just increasing context length.

zkmon|2 months ago

I think the internal of transformers would become less relevant like internal of compilers, as programmers would only care about how to "use" them instead of how to develop them.

rvz|2 months ago

Their internals are just as relevant (now even more relevant) as any other technology as they always need to be improved to the SOTA (state of the art) meaning that someone has to understand their internals.

It also means more jobs for the people who understand them at a deeper level to advance the SOTA of specific widely used technologies such as operating systems, compilers, neural network architectures and hardware such as GPUs or TPU chips.

Someone has to maintain and improve them.

crystal_revenge|2 months ago

Have you written a compiler? I ask because for me writing a compiler was absolutely an inflection point in my journey as a programmer. Being able to look at code and reason about it all the way down to bytecode/IL/asm etc absolutely improved my skill as a programmer and ability to reason about software. For me this was the first time I felt like a real programmer.

esafak|2 months ago

Practitioners already do not need to know about it to run let alone use LLMs. I bet most don't even know the fundamentals of machine learning. Hands up if you know bias from variance...

edge17|2 months ago

Maybe I'm out of touch, but have transformers replaced all traditional deep learning architectures? (U-nets, etc)?

D-Machine|2 months ago

No, not at all. There is a transformer obsession that is quite possibly not supported by the actual facts (CNNs can still do just as well: https://arxiv.org/abs/2310.16764), and CNNs definitely remain preferable for smaller and more specialized tasks (e.g. computer vision on medical data).

If you also get into more robust and/or specialized tasks (e.g. rotation invariant computer vision models, graph neural networks, models working on point-cloud data, etc) then transformers are also not obviously the right choice at all (or even usable in the first place). So plenty of other useful architectures out there.

profsummergig|2 months ago

Haven't watched it yet...

...but, if you have favorite resources on understanding Q & K, please drop them in comments below...

(I've watched the Grant Sanderson/3blue1brown videos [including his excellent talk at TNG Big Tech Day '24], but Q & K still escape me).

Thank you in advance.

roadside_picnic|2 months ago

It's just a re-invention of kernel smoothing. Cosma Shalizi has an excellent write up on this [0].

Once you recognize this it's a wonderful re-framing of what a transformer is doing under the hood: you're effectively learning a bunch of sophisticated kernels (though the FF part) and then applying kernel smoothing in different ways through the attention layers. It makes you realize that Transformers are philosophically much closer to things like Gaussian Processes (which are also just a bunch of kernel manipulation).

0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...

red2awn|2 months ago

Implement transformers yourself (ie in Numpy). You'll never truly understand it by just watching videos.

throw310822|2 months ago

Have you tried asking e.g. Claude to explain it to you? None of the usual resources worked for me, until I had a discussion with Claude where I could ask questions about everything that I didn't get.

machinationu|2 months ago

Q, K and V are a way of filtering the relevant aspects for the task at hand from the token embeddings.

"he was red" - maybe color, maybe angry, the "red" token embedding carries both, but only one aspect is relevant for some particular prompt.

https://ngrok.com/blog/prompt-caching/

bobbyschmidd|2 months ago

tldr: recursively aggregating packing/unpacking 'if else if (functions)/statements' as keyword arguments that (call)/take them themselves as arguments, with their own position shifting according to the number "(weights)" of else if (functions)/statements needed to get all the other arguments into (one of) THE adequate orders. the order changes based on the language, input prompt and context.

if I understand it all correctly.

implemented it in html a while ago and might do it in htmx sometime soon.

transformers are just slutty dictionaries that Papa Roach and kage bunshin no jutsu right away again and again, spawning clones and variations based on requirements, which is why they tend to repeat themselves rather quickly and often. it's got almost nothing to do with languages themselves and requirements and weights amount to playbooks and DEFCON levels