(no title)
roadside_picnic | 1 month ago
If you're interested in machine learning at all and not very strong regarding kernel methods I highly recommending taking a deep dive. Such a huge amount of ML can be framed through the lens of kernel methods (and things like Gaussian Processes will become much easier to understand).
0. https://web.archive.org/web/20250820184917/http://bactra.org...
libraryofbabel|1 month ago
I'll make a note to read up on kernels some more. Do you have any other reading recommendations for doing that?
Atheb|1 month ago
Justin Johnson's lecture on Attention [1] mechanisms really helped me understand the concept of attention in transformers. In the lecture he goes through the history and and iterations of attention mechanisms, from CNNs and RNNs to Transformers, while keeping the notation coherent and you get to see how and when in the literature the QKV matrices appear. It's an hour long but it's IMO a must watch for anyone interested in the topic.
[1]: https://www.youtube.com/watch?v=YAgjfMR9R_M
vatsachak|1 month ago
They derive Q, K, V as a continuous analog of a hopfield network
ACCount37|1 month ago
The neat chain of "this is how the math of it works" is constructed after the fact once you dialed in something and proven that it works. If ever.
mbeex|1 month ago
http://bactra.org/notebooks/nn-attention-and-transformers.ht...
LudwigNagasena|1 month ago
And none of them are a reinvention of kernel methods. There is such a huge gap between the Nadaraya and Watson idea and a working Attention model, calling it a reinvention is quite a reach.
One might as well say that neural networks trained with gradient descent are a reinvention of numerical methods for function approximation.
roadside_picnic|1 month ago
I don't know anyone who would disagree with that statement, and this is the standard framing I've encountered in nearly all neural network literature and courses. If you read any of the classic gradient based papers they fundamentally assume this position. Just take a quick read of "A Theoretical Framework for Back-Propagation (LeCun, 1988)" [0], here's a quote from the abstract:
> We present a mathematical framework for studying back-propagation based on the Lagrangian formalism. In this framework, inspired by optimal control theory, back-propagation is formulated as an optimization problem with nonlinear constraints.
There's no way you can read that a not recognize that you're reading a paper on numerical methods for function approximation.
The issue is that Vaswani, et al never mentions this relationship.
0. http://yann.lecun.com/exdb/publis/pdf/lecun-88.pdf
lambdaone|1 month ago
https://web.archive.org/web/20230713101725/http://bactra.org...
niemandhier|1 month ago
Things proven for one domain can than be pulled back to the other domain along the arrows of duality connections.
unknown|1 month ago
[deleted]
lugu|1 month ago
Surprisingly, reading this piece helped me better understand the query, key metaphor.
MontyCarloHall|1 month ago
D-Machine|1 month ago
AlexCoventry|1 month ago
The Free Transformer: https://arxiv.org/abs/2510.17558
Abstract: We propose an extension of the decoder Transformer that conditions its generative process on random latent variables which are learned without supervision thanks to a variational procedure. Experimental evaluations show that allowing such a conditioning translates into substantial improvements on downstream tasks.
revision17|1 month ago
I think datasets with lots of samples tend to be very common (such as training on huge text datasets like LLMs do). In my travels most datasets for projects tend to be on the larger side (10k+ samples).
donnietb|1 month ago
From the paper(where Additive attention is the other "similarity function"):
Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
imtringued|1 month ago
unknown|1 month ago
[deleted]
auntienomen|1 month ago
Y'all should read this, and make sure you read to the end. The last paragraph is priceless.
aquafox|1 month ago
D-Machine|1 month ago
esafak|1 month ago
unknown|1 month ago
[deleted]
somethingsome|1 month ago