This is an excellent tool to realize how an LLM actually works from the ground up!
For those reading it and going through each step, if by chance you get stuck on why 48 elements are in the first array, please refer to the model.py on minGPT [1]
It's an architectural decision that it will be great to mention in the article since people without too much context might lose it
Wow, I love the interactive wizzing around and the animation, very neat! Way more explanations should work like this.
I've recently finished an unorthodox kind of visualization / explanation of transformers. It's sadly not interactive, but it does have some maybe unique strengths.
First, it gives array axis semantic names, represented in the diagrams as colors (which this post also uses). So sequence axis is red, key feature dimension is green, multihead axis is orange, etc. This helps you show quite complicated array circuits and get an immediate feeling for what is going on and how different arrays are being combined with each-other. Here's a pic of the the full multihead self-attention step for example:
It also uses a kind of generalization tensor network diagrammatic notation -- if anyone remembers Penrose's tensor notation, it's like that but enriched with colors and some other ideas. Underneath these diagrams are string diagrams in a particular category, though you don't need to know (nor do I even explain that!).
Are you referring specifically to line 141, which sets the number of embedding elements for gpt-nano to 48? That also seems to correspond to the Channel size C referenced in the explanation text?
The visualization I've been looking for for months. I would have happily paid serious money for this... the fact that it's free is such a gift and I don't take it for granted.
Andrej Karpathy twisting his hands as he explains it is also a great device. Not being sarcastic, when he explains it I understand it for a good minute it two. Then need to rewatch as I forget (but that is just me)!
Could as well be titled 'dissecting magic into matmuls and dot products for dummies'. Great stuff. Went away even more amazed that LLMs work as well as they do.
Another visualization I would really love would be a clickable circular set of possible prediction branches, projected onto a Poincare disk (to handle the exponential branching component of it all). Would take forever to calculate except on smaller models, but being able to visualize branch probabilities angularly for the top n values or whatever, and to go forwards and backwards up and down different branches would likely yield some important insights into how they work.
Good visualization precludes good discoveries in many branches of science, I think.
(see my profile for a longer, potentially more silly description ;) )
This is really awesome but I at least wish there were a few added sentences around how I'm supposed to intuitively think about the purpose of why it's like that. For example, I see a T x C matrix of 6 x 48... but at this step, before it's fed into the net, what is this supposed to represent?
A lot of transformer explanations fail to mention what makes self attention so powerful.
Unlike traditional neural networks with fixed weights, self-attention layers adaptively weight connections between inputs based on context. This allows transformers to accomplish in a single layer what would take traditional networks multiple layers.
In case it’s confusing for anyone to see “weight” as a verb and a noun so close together, there are indeed two different things going on:
1. There are the model weights, aka the parameters. These are what get adjusted during training to do the learning part. They always exist.
2. There are attention weights. These are part of the transformer architecture and they “weight” the context of the input. They are ephemeral. Used and discarded. Don’t always exist.
They are both typically 32-bit floats in case you’re curious but still different concepts.
None of this seems obvious just reading the original Attention is all you need paper. Is there a more in-depth explanation of how this adaptive weighting works?
Just to add on, a good way to learn these terms is to look at the history of neural networks rather than looking at transformer architecture in a vacuum
This [1] post from 2021 goes over attention mechanisms as applied to RNN / LSTM networks. It's visual and goes into a bit more detail, and I've personally found RNN / LSTM networks easier to understand intuitively.
1) When you zoom, the cursor doesn't stay in the same position relative to some projected point
2) Panning also doesn't pin the cursor to a projected point, there's just a hacky multiplier there based on zoom
The main issue is that I'm storing the view state as target (on 2D plane) + euler angles + distance. Which is easy to think about, but issues 1 & 2 are better solved by manipulating a 4x4 view matrix. So would just need a matrix -> target-vector-pair conversion to get that working.
This looks pretty cool! Anyone know of visualizations for simpler neural networks? I'm aware of tensorflow playground but that's just for a toy example, is there anything for visualizing a real example (e.g handwriting recognition)?
Rather than looking at the visuals of this network, it is more better to focus on the actual problem with these LLMs which the author already has shown:
With in the transformer section:
> As is common in deep learning, it's hard to say exactly what each of these layers is doing, but we have some general ideas: the earlier layers tend to focus on learning lower-level features and patterns, while the later layers learn to recognize and understand higher-level abstractions and relationships.
That is the problem and yet these black boxes are just as explainable as a magic scroll.
For decades we’ve puzzled at how the inner workings of the brain works, and thought we’ve learned a lot we still don’t fully understand it. So, we figure, we’ll just make an artificial brain and THEN we’ll be able to figure it out.
And here we are, finally a big step closer to an artificial brain and once again, we don’t know how it works :)
(Although to be fair we’re spending all of our efforts making the models better and better and not on learning their low level behaviors. Thankfully when we decide to study them it’ll be a wee less invasive and actually doable, in theory.)
This is a great visualization because original paper on transformers is not very clear and understandable; I tried to read it first and didn't understand so I had to look for other explanations (for example it was unclear for me how multiple tokens are handled).
Also, speaking about transformers: they usually append their output tokens to input and process them again. Can we optimize it, so that we don't need to do the same calculations with same input tokens?
This is a phenomenal visualisation. I wish I saw this when I was trying to wrap my head around transformers a while ago. This would have made it so much easier.
Same here. I blame the popularity of Next.js. More and more of the web is slowly becoming more broken on Firefox on Linux, all with the same tired error: "Application error: a client-side exception has occurred"
This shows how the individual weights and vectors work but unless I’m missing something doesn’t quite illustrate yet how higher order vectors are created at the sentence and paragraph level. This might be an emergent property within this system though so it’s hard to “illustrate”. how all of this ends up with a world simulation needs to be understood better and I hope this advances further.
I've wondered for a while if as LLM usage matures, there will be an effort to optimize hotspots like what happened with VMs, or auto indexed like in relational DBs. I'm sure there are common data paths which get more usage, which could somehow be prioritized, either through pre-processing or dynamically, helping speed up inference.
This does an amazing job of showing the difference in complexity between the different models. Click on GPT-3 and you should be able to see all 4 models side-by-side. GPT-3 is a monster compared to nano-gpt.
Very cool. The explanations of what each part is doing is really insightful.
And I especially like how the scale jumps when you move from e.g. Nano all the way to GPT-3 ....
Honestly reading the pytorch implementation of minGTP is a lot more informative than an inscrutable 3d rendering. It's a well commended and pedagogical implementation. I applaud the intention, and it looks slick, but I'm not sure it really conveys information in an efficient way.
I feel like visualizations like this are what is missing from univeristy curricula. Now imagine a professor going trough each animation describing exactly what is happening, I am pretty sure students would get a much more in-depth understanding!
Isn't it amazing that a random person on the internet can produce free educational content that trumps university courses? With all the resources and expertise that universities have, why do they get shown up all the time? Do they just not know how to educate?
warkanlock|2 years ago
For those reading it and going through each step, if by chance you get stuck on why 48 elements are in the first array, please refer to the model.py on minGPT [1]
It's an architectural decision that it will be great to mention in the article since people without too much context might lose it
[1] https://github.com/karpathy/minGPT/blob/master/mingpt/model....
taliesinb|2 years ago
I've recently finished an unorthodox kind of visualization / explanation of transformers. It's sadly not interactive, but it does have some maybe unique strengths.
First, it gives array axis semantic names, represented in the diagrams as colors (which this post also uses). So sequence axis is red, key feature dimension is green, multihead axis is orange, etc. This helps you show quite complicated array circuits and get an immediate feeling for what is going on and how different arrays are being combined with each-other. Here's a pic of the the full multihead self-attention step for example:
https://math.tali.link/raster/052n01bav6yvz_1smxhkus2qrik_07...
It also uses a kind of generalization tensor network diagrammatic notation -- if anyone remembers Penrose's tensor notation, it's like that but enriched with colors and some other ideas. Underneath these diagrams are string diagrams in a particular category, though you don't need to know (nor do I even explain that!).
Here's the main blog post introducing the formalism: https://math.tali.link/rainbow-array-algebra
Here's the section on perceptrons: https://math.tali.link/rainbow-array-algebra/#neural-network...
Here's the section on transformers: https://math.tali.link/rainbow-array-algebra/#transformers
riemannzeta|2 years ago
https://github.com/karpathy/minGPT/blob/master/mingpt/model....
namocat|2 years ago
jayveeone|2 years ago
holtkam2|2 years ago
terminous|2 years ago
wills_forward|2 years ago
block_dagger|2 years ago
gryfft|2 years ago
quickthrower2|2 years ago
baq|2 years ago
mark_l_watson|2 years ago
Really nice stuff.
flockonus|2 years ago
itslennysfault|2 years ago
Since X now hides replies for non-logged in user here is a nitter link for those without an account (like me) that might want to see the full thread.
https://nitter.net/BrendanBycroft/status/1731042957149827140
tysam_and|2 years ago
Good visualization precludes good discoveries in many branches of science, I think.
(see my profile for a longer, potentially more silly description ;) )
29athrowaway|2 years ago
Not only has the visualization, but it's interactive, has explanations for each item, has excellent performance and is open source: https://github.com/bbycroft/llm-viz/blob/main/src/llm
Another interesting visualization related thing: https://github.com/shap/shap
8f2ab37a-ed6c|2 years ago
Exuma|2 years ago
singularity2001|2 years ago
atgctg|2 years ago
Unlike traditional neural networks with fixed weights, self-attention layers adaptively weight connections between inputs based on context. This allows transformers to accomplish in a single layer what would take traditional networks multiple layers.
WhitneyLand|2 years ago
1. There are the model weights, aka the parameters. These are what get adjusted during training to do the learning part. They always exist.
2. There are attention weights. These are part of the transformer architecture and they “weight” the context of the input. They are ephemeral. Used and discarded. Don’t always exist.
They are both typically 32-bit floats in case you’re curious but still different concepts.
kmeisthax|2 years ago
lchengify|2 years ago
This [1] post from 2021 goes over attention mechanisms as applied to RNN / LSTM networks. It's visual and goes into a bit more detail, and I've personally found RNN / LSTM networks easier to understand intuitively.
[1] https://medium.com/swlh/a-simple-overview-of-rnn-lstm-and-at...
skadamat|2 years ago
Wrote about it here: https://about.xethub.com/blog/visualizing-ml-models-github-n...
thierrydamiba|2 years ago
What an exciting time to be learning about LLMs. Everyday I come across a new resource, and everything is free!
airesQ|2 years ago
So much depth; initially I thought it's "just" a 3d model. The animations are amazing.
shaburn|2 years ago
SiempreViernes|2 years ago
bbycroft|2 years ago
But for me it's really broken haha
1) When you zoom, the cursor doesn't stay in the same position relative to some projected point 2) Panning also doesn't pin the cursor to a projected point, there's just a hacky multiplier there based on zoom
The main issue is that I'm storing the view state as target (on 2D plane) + euler angles + distance. Which is easy to think about, but issues 1 & 2 are better solved by manipulating a 4x4 view matrix. So would just need a matrix -> target-vector-pair conversion to get that working.
stareatgoats|2 years ago
arikrak|2 years ago
Logge|2 years ago
atonalfreerider|2 years ago
crimsoneer|2 years ago
rvz|2 years ago
With in the transformer section:
> As is common in deep learning, it's hard to say exactly what each of these layers is doing, but we have some general ideas: the earlier layers tend to focus on learning lower-level features and patterns, while the later layers learn to recognize and understand higher-level abstractions and relationships.
That is the problem and yet these black boxes are just as explainable as a magic scroll.
nlh|2 years ago
For decades we’ve puzzled at how the inner workings of the brain works, and thought we’ve learned a lot we still don’t fully understand it. So, we figure, we’ll just make an artificial brain and THEN we’ll be able to figure it out.
And here we are, finally a big step closer to an artificial brain and once again, we don’t know how it works :)
(Although to be fair we’re spending all of our efforts making the models better and better and not on learning their low level behaviors. Thankfully when we decide to study them it’ll be a wee less invasive and actually doable, in theory.)
codedokode|2 years ago
Also, speaking about transformers: they usually append their output tokens to input and process them again. Can we optimize it, so that we don't need to do the same calculations with same input tokens?
unknown|2 years ago
[deleted]
hmate9|2 years ago
johnklos|2 years ago
bbycroft|2 years ago
I wasn't expecting it to get quite this popular, so hadn't handled this (rather major) edge case.
lopkeny12ko|2 years ago
altilunium|2 years ago
Check here : https://get.webgl.org/webgl2/
tsunamifury|2 years ago
visarga|2 years ago
russellbeattie|2 years ago
nbzso|2 years ago
abrookewood|2 years ago
tikkun|2 years ago
tikkun|2 years ago
drdg|2 years ago
thistoowontpass|2 years ago
thefourthchime|2 years ago
Heidaradar|2 years ago
cod1r|2 years ago
nikhil896|2 years ago
Simon_ORourke|2 years ago
reexpressionist|2 years ago
crotchfire|2 years ago
nandhinianand|2 years ago
sva_|2 years ago
myself248|2 years ago
gbertasius|2 years ago
meeb|2 years ago
RecycledEle|2 years ago
This is why I love Hacker News!
Arctic_fly|2 years ago
valdect|2 years ago
athulsuresh123|2 years ago
physPop|2 years ago
smy20011|2 years ago
Solvency|2 years ago
unknown|2 years ago
[deleted]
Workaccount2|2 years ago
haltist|2 years ago
BSTRhino|2 years ago
wdiamond|2 years ago
reqo|2 years ago
Arson9416|2 years ago
jwilber|2 years ago
unknown|2 years ago
[deleted]
rikafurude21|2 years ago
[deleted]
beckingz|2 years ago
So Black Magic!