top | item 46535549

(no title)

I will beat loudly on the "Attention is a reinvention of Kernel Smoothing" drum until it is common knowledge. It looks like Cosma Schalizi's fantastic website is down for now, so here's a archive link to his essential reading on this topic [0].

If you're interested in machine learning at all and not very strong regarding kernel methods I highly recommending taking a deep dive. Such a huge amount of ML can be framed through the lens of kernel methods (and things like Gaussian Processes will become much easier to understand).

0. https://web.archive.org/web/20250820184917/http://bactra.org...

discuss

libraryofbabel|1 month ago

This is really useful, thanks. In my other (top-level) comment, I mentioned some vague dissatisfactions around how in explanations of attention the Q, K, V matrices always seem to be pulled out of a hat after being motivated in a hand-wavy metaphorical way. The kernel methods treatment looks much more mathematically general and clean - although for that reason maybe less approachable without a math background. But as a recovering applied mathematician ultimately I much prefer a "here is a general form, now let's make some clear assumptions to make it specific" to a "here's some random matrices you have to combine in a particular way by murky analogy to human attention and databases."

I'll make a note to read up on kernels some more. Do you have any other reading recommendations for doing that?

Atheb|1 month ago

> how in explanations of attention the Q, K, V matrices always seem to be pulled out of a hat after being motivated in a hand-wavy metaphorical way.

Justin Johnson's lecture on Attention [1] mechanisms really helped me understand the concept of attention in transformers. In the lecture he goes through the history and and iterations of attention mechanisms, from CNNs and RNNs to Transformers, while keeping the notation coherent and you get to see how and when in the literature the QKV matrices appear. It's an hour long but it's IMO a must watch for anyone interested in the topic.

[1]: https://www.youtube.com/watch?v=YAgjfMR9R_M

vatsachak|1 month ago

https://arxiv.org/abs/2008.02217

They derive Q, K, V as a continuous analog of a hopfield network

ACCount37|1 month ago

That's kind of how applied ML is most of the time.

The neat chain of "this is how the math of it works" is constructed after the fact once you dialed in something and proven that it works. If ever.

mbeex|1 month ago

Site is still fine (but is and was always http-only):

http://bactra.org/notebooks/nn-attention-and-transformers.ht...

LudwigNagasena|1 month ago

> Such a huge amount of ML can be framed through the lens of kernel methods

And none of them are a reinvention of kernel methods. There is such a huge gap between the Nadaraya and Watson idea and a working Attention model, calling it a reinvention is quite a reach.

One might as well say that neural networks trained with gradient descent are a reinvention of numerical methods for function approximation.

roadside_picnic|1 month ago

> One might as well say that neural networks trained with gradient descent are a reinvention of numerical methods for function approximation.

I don't know anyone who would disagree with that statement, and this is the standard framing I've encountered in nearly all neural network literature and courses. If you read any of the classic gradient based papers they fundamentally assume this position. Just take a quick read of "A Theoretical Framework for Back-Propagation (LeCun, 1988)" [0], here's a quote from the abstract:

> We present a mathematical framework for studying back-propagation based on the Lagrangian formalism. In this framework, inspired by optimal control theory, back-propagation is formulated as an optimization problem with nonlinear constraints.

There's no way you can read that a not recognize that you're reading a paper on numerical methods for function approximation.

The issue is that Vaswani, et al never mentions this relationship.

0. http://yann.lecun.com/exdb/publis/pdf/lecun-88.pdf

lambdaone|1 month ago

The archive link above is broken: this is an earlier archived copy of that page with content intact:

https://web.archive.org/web/20230713101725/http://bactra.org...

niemandhier|1 month ago

In physics we call these things „duality“, depending on the problem one can choose different perspectives on the subject.

Things proven for one domain can than be pulled back to the other domain along the arrows of duality connections.

unknown|1 month ago

[deleted]

lugu|1 month ago

I don't understand what motivate the need for w1 and w2, except if we accept the premise that we are doing attention in the query and key spaces... Which is not the thesis of the author. What am I missing?

Surprisingly, reading this piece helped me better understand the query, key metaphor.

MontyCarloHall|1 month ago

It's utterly baffling to me that there hasn't been more SOTA machine learning research on Gaussian processes with the kernels inferred via deep learning. It seems a lot more flexible than the primitive, rigid dot product attention that has come to dominate every aspect of modern AI.

D-Machine|1 month ago

I think this mostly comes down to (multi-headed) scaled dot-product attention just being very easy to parallelize on GPUs. You can then make up for the (relative) lack of expressivity / flexibility by just stacking layers.

AlexCoventry|1 month ago

Doesn't involve Gaussians, but:

The Free Transformer: https://arxiv.org/abs/2510.17558

Abstract: We propose an extension of the decoder Transformer that conditions its generative process on random latent variables which are learned without supervision thanks to a variational procedure. Experimental evaluations show that allowing such a conditioning translates into substantial improvements on downstream tasks.

revision17|1 month ago

In addition to what others say said, computational complexity, is a big reason. Gaussian Process and Kernelized SVM have fit complexities of O(n^2) to O(n^3) (where n is the # of samples, also using optimal solutions and not approximations). While Neural Nets and Tree Ensembles are O(n).

I think datasets with lots of samples tend to be very common (such as training on huge text datasets like LLMs do). In my travels most datasets for projects tend to be on the larger side (10k+ samples).

donnietb|1 month ago

I think they tried it already in the original transformer paper. THe results were not worth implementing.

From the paper(where Additive attention is the other "similarity function"):

Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

imtringued|1 month ago

The Q, K, V matrices form neural networks at runtime, that's the entire point.

unknown|1 month ago

[deleted]

auntienomen|1 month ago

This might be the single best blog post I've ever read, both in terms of content and style.

Y'all should read this, and make sure you read to the end. The last paragraph is priceless.

aquafox|1 month ago

Oh wow, I wish I could give more than one upvote for this reference!

D-Machine|1 month ago

Yes, this needs to be linked more, you are doing a great service.

esafak|1 month ago

(How) do you find that framing enlightening?

unknown|1 month ago

[deleted]

somethingsome|1 month ago

Hey, can I contact you somehow?