Deep Learning Is Not So Mysterious or Different

[+] rottc0dd|1 year ago|reply

If anyone wants to delve into machine learning, one of the superb resources I have found is, Stanfords "Probability for computer scientists"(https://www.youtube.com/watch?v=2MuDZIAzBMY&list=PLoROMvodv4...).

It delves into theoretical underpinnings of probability theory and ML, IMO better than any other course I have seen. (Yeah, Andrew Ng is legendary, but his course demands some mathematical familarity with linear algebra topics)

And of course, for deep learning, 3b1b is great for getting some visual introduction (https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQ...).

[+] chamomeal|1 year ago|reply

I watched the 3b1b series on neural nets years ago, and it still accounts for 95% of my understanding of AI in general.

I’m not an ML person, but still. That guy has a serious gift for explaining stuff.

His video on the uncertainty principle explained stuff to me that my entire undergrad education failed to!

[+] rottc0dd|1 year ago|reply

From, a comment I posted elsewhere for written versions.

There is a course reader for CS109 [1]. You can download pdf version of this.

There is also book[2] for excellent caltech course[3].

[1] https://chrispiech.github.io/probabilityForComputerScientist...

[2] https://www.amazon.com/Learning-Data-Yaser-S-Abu-Mostafa/dp/...

[3] https://work.caltech.edu/telecourse

[+] rottc0dd|1 year ago|reply

Caltech's learning from data was really good too, if someone is looking for theoretical understanding of ML topics.

https://work.caltech.edu/telecourse

[+] randomtoast|1 year ago|reply

Apparently the word “delve” is the biggest indicator of the use of ChatGPT according to Paul Graham.

[+] andirk|1 year ago|reply

Just watched the whole thing. Thanks! I can't get in to my Masters CS: AI program at UC Berkeley because I'm dumb, but seeing this 1st day of a Probability class kinda felt like I was beginning that program haha.

I will add a great find for starting one's AI journey https://www.youtube.com/watch?v=_xIwjmCH6D4 . Kind of needs one to know intermediate CS since 1st step is "learn Python".

[+] vcdimension|1 year ago|reply

and if anyone is interested in delving more deeply into the statistical concepts & results referenced in the paper of this post (e.g. VC-dimension, PAC-learning, etc), I can recommend this book: https://amzn.eu/d/7Zwe6jw

[+] bogeholm|1 year ago|reply

Looks nice - are there written versions?

[+] cake-rusk|1 year ago|reply

Yeah I took CS109 (through SCPD), it was a blast. But it took some serious time commitment.

[+] Vaslo|1 year ago|reply

Great recommendations

[+] samstave|1 year ago|reply

[deleted]

[+] _ce5e|1 year ago|reply

Fully agree! 3blue1brown is who have single-handedly thought me a majority of what I've needed to know about it.

I actually started building my own neural network framework last week in C++! It's a great way to delve into the details of how they work. It currently supports only dense MLP's, but does so quite well, and work is underway for convolutional layers and pooling layers on a separate branch.

https://github.com/perkele1989/prkl-ann

[+] cgdl|1 year ago|reply

Agreed, but PAC-Bayes or other descendants of VC theory is probably not the best explanation. The notion of algorithmic stability provides a (much) more compelling explanation. See [1] (particularly Sections 11 and 12)

[1] https://arxiv.org/abs/2203.10036

[+] bigfatfrock|1 year ago|reply

I'm a huge fan of HN just for replies such as this that smash the OP's post/product with something better. It's like at least half the reason I stick around here.

Thanks for the great read.

[+] singulargalaxy|1 year ago|reply

Hard disagree. Your link relies on gradient descent as an explanation, whereas OP explains why optimization is not needed to understand DL generalization. PAC-Bayes, and the other different countable hypothesis bounds in OP also are quite divergent from VC dimension. The whole point of OP seems to be that these other frameworks, unlike VC dimension, can explain generalization with an arbitrarily flexible hypothesis space.

[+] esafak|1 year ago|reply

Statistical mechanics is the lens that makes most sense to me, and it's well studied.

[+] mxwsn|1 year ago|reply

Good read, thanks for sharing

[+] gptoperator05|1 year ago|reply

[deleted]

[+] TechDebtDevin|1 year ago|reply

Anyone who wants to demystify ML should read: The StatQuest Illustrated Guide to Machine Learning [0] By Josh Starmer.

To this day I haven't found a teacher who could express complex ideas as clearly and concisely as Starmer does. It's written in an almost children's book like format that is very easy to read and understand. He also just published a book on NN that is just as good. Highly recommend even if you are already an expert as it will give you great ways to teach and communicate complex ideas in ML.

[0]: https://www.goodreads.com/book/show/75622146-the-statquest-i...

[+] Lerc|1 year ago|reply

I have followed a fair few StatQuest and other videos (treadmills with Youtube are great for fitness and learning in one)

I find that no single source seems to cover things in a way that I easily understand, but cumulatively they fill in the blanks of each other.

Serrano Academy has been a good source for me as well. https://www.youtube.com/@SerranoAcademy/videos

The best tutorials give you a clear sense that the teacher has a clear understanding of the underlying principles and how/why they are applied.

I have seen a fair few things that are effectively.

'To do X, you {math thing}' While also creating the impression that they don't understand why {math thing} is the right thing to do, just that {math thing} has a name and it produces the result. Meticulously explaining the minutiae of {math thing} substitutes for a understanding of what it is doing.

It really stood out to me when looking at UMAP and seeing a bunch of things where they got into the weeds in the math without explaining why these were the particular weeds to be looking in.

Then I found a talk by Leland McInnes that had the format.

{math thing} is a tool to do {objective}. It works, there is a proof, you don't need to understand it to use the tool but the info for that is over there if you want to tale a look. These are our objectives, let's use these tools to achieve them.

The tools are neither magical black boxes, nor confused with the actual goal. It really showed the power of fully understanding the topic.

[+] kaptainscarlet|1 year ago|reply

Double Bam

[+] ajitid|1 year ago|reply

Also would like to add that he has a YouTube channel as well https://youtube.com/@statquest

[+] getnormality|1 year ago|reply

> rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem.

How does deep learning do this? The last time I was deeply involved in machine learning, we used a penalized likelihood approach. To find a good model for data, you would optimize a cost function over model space, and the cost function was the sum of two terms: one quantifying the difference between model predictions and data, and the other quantifying the model's complexity. This framework encodes exactly a "soft preference for simpler solutions that are consistent with the data", but is that how deep learning works? I had the impression that the way complexity is penalized in deep learning was more complex, less straightforward.

[+] whiteandnerdy|1 year ago|reply

You're correct, and the term you're looking for is "regularisation".

There are two common ways of doing this: * L1 or L2 regularisation: penalises models whose weight matrices are complex (in the sense of having lots of large elements) * Dropout: train on random subsets of the neurons to force the model to rely on simple representations that are distributed robustly across its weights

[+] jonathanhuml|1 year ago|reply

The solution to the L1 regularization problem is actually a specific form of the classical ReLU nonlinearity used in deep learning. I’m not sure if similar results hold for other nonlinearities, but this gave me good intuition for what thresholding is doing mathematically!

[+] chriskanan|1 year ago|reply

Here is an example for data-efficient vision transformers: https://arxiv.org/abs/2401.12511

Vision transformers have a more flexible hypothesis space, but they tend to have worse sample complexity than convolutional networks which have a strong architectural inductive bias. A "soft inductive bias" would be something like what this paper does where they have a special scheme for initializing vision transformers. So schemes like initialization that encourage the model to find the right solution without excessively constraining it would be a soft preference for simpler solutions.

[+] bornfreddy|1 year ago|reply

I'm not a guru myself, but I'm sure someone will correct me if I'm wrong. :-)

The usual approach to supervised ML is to "invent" the model (layers, their parameters) or more often copy one from known good reference, then define the cost function and feed it data. "Deep" learning just means that instead of a few layers you use a big number of them.

What you describe sounds like an automated way of tweaking the architecture, IIUC? Never done that, usually the cost of a run was too high to let an algorithm do that for me. But I'm curious if this approach is being used?

[+] woopwoop|1 year ago|reply

Yeah, it's straightforward to reproduce the results of the paper whose conclusion they criticize, "Understanding deep learning requires rethinking generalization", without any (explicit) regularization or anything else that can be easily described as a "soft preference for simpler solutions".

[+] unknown|1 year ago|reply

[deleted]

[+] eli_gottlieb|1 year ago|reply

Yeah that's just regularized optimization which is actually just the Bayesian Learning Rule which is actually just variational Bayes.

[+] smus|1 year ago|reply

the AdamW optimizer (basically the default in DL nowadays) is doing exactly that

[+] inciampati|1 year ago|reply

An example, which is interesting, in which "deep" networks are necessary, is discussed in this fascinating and popular recent paper on RNNs [1]. Despite the fact that the minGRU and minLSTM models they propose don't explicitly model ordered state dependency, they can learn them as long as they are deep enough (deep >= 3):

> Instead of explicitly modelling dependencies on previous states to capture long-range dependencies, these kinds of recurrent models can learn them by stacking multiple layers.

[1] https://arxiv.org/abs/2410.01201

[+] fastball|1 year ago|reply

Well it's not called Mysterious Learning or Different Learning for a reason.

In fact, with how many misnomers there are in the world, I think Deep Learning is actually a pretty great name, all things considered.

It properly communicates (imo) that the training data and resulting weights are complex enough that just looking at the learning/training process on its own is not sufficient to understand the resulting system (vs other "less deep" machine learning where it mostly is).

[+] unknown|1 year ago|reply

[deleted]

[+] d_burfoot|1 year ago|reply

DNNs do not have special generalization powers. If anything, their generalization is likely weaker than more mathematically principled techniques like the SVM.

If you try to train a DNN to solve a classical ML problem like the "Wine Quality" dataset from the UCI Machine Learning repo [0], you will get abysmal results and overfitting.

The "magic" of LLMs comes from the training paradigm. Because the optimization is word prediction, you effectively have a data sample size equal to the number of words in the corpus - an inconceivably vast number. Because you are training against a vast dataset, you can use a proportionally immense model (e.g. 400B parameters) without overfitting. This vast (but justified) model complexity is what creates the amazing abilities of GPT/etc.

What wasn't obvious 10 years ago was the principle of "reusability" - the idea that the vastly complex model you trained using the LLM paradigm would have any practical value. Why is it useful to build an immensely sophisticated word prediction machine, who cares about predicting words? The reason is that all those concepts you learned from word-prediction can be reused for related NLP tasks.

[0] https://archive.ics.uci.edu/dataset/186/wine+quality

[+] yomritoyj|1 year ago|reply

You may want to look at this. Neural network models with enough capacity to memorize random labels are still capable of generalizing well when fed actual data

Zhang et al (2021) 'Understanding deep learning (still) requires rethinking generalization'

https://dl.acm.org/doi/10.1145/3446776

[+] gptoperator05|1 year ago|reply

[deleted]

[+] buffalobuffalo|1 year ago|reply

When I was first getting into Deep Learning, learning the proof of the universal approximation theorem helped a lot. Once you understand why neural networks are able to approximate functions, it makes everything built on top of them much easier to understand.

[+] woopwoop|1 year ago|reply

A decade ago the paper "Understanding deep learning requires rethinking generalization" [0] was published. The submission is a response to that paper and subsequent literature.

Deep neural nets are notable for their strong generalization performance: despite being highly overparametrized they do not seem to overfit the training data. They still perform well on hold-out data and very often on out of distribution data "in the wild". The paper [0] noted a particularly odd feature of neural net training: one can train neural nets on standard datasets to fit random labels. There does not seem to be an inductive bias strong enough to rule out bad overfitting. It is in principle possible to train a model which performs perfectly on the training data but gives nonsense on the test data. But this doesn't seem to happen in practice.

The submission argues that this is unsurprising, and fits within standard theoretical frameworks for machine learning. In section 4 it is claimed that this kind of thing ("benign overfitting") is common to any learning algorithm with "a flexible hypothesis space, combined with a loss function that demands we fit the data, and a simplicity bias: amongst solutions that are consistent with the data (i.e., fit the data perfectly), the simpler ones are preferred".

The fact that the third of these conditions is satisfied, however, is non-trivial, and in my opinion is still not well understood. The results of [0] are reproducible with a wide variety of architectures, with or without any form of explicit regularization. If there is an inductive bias toward "simpler solutions" in fitting deep neural nets it has to come either from SGD itself or from some bias which is very generic in architecture. It's not something like "CNNs generalize well on image data because of an inductive bias toward translation invariant features." While there is some work on implicit smoothing by SGD, for example, in my opinion this is not sufficient to explain the phenomena observed in [0]. What I would find satisfying is a reproducible ablation study of neural net training that removed benign overfitting (+), so that it was clear what exactly are the necessary and sufficient conditions for this behavior in the context of neural nets. As far as I know this still has never been done, because it is not known what this would even entail.

(+) To be clear, I think this would not look like "the fit model still generalizes, but we can no longer fit random labels" but rather "the fit model now gives nonsense on holdout data".

[0] https://arxiv.org/abs/1611.03530

[+] YesBox|1 year ago|reply

I wish I had the time to try this:

1.) Grab many GBs of text (books, etc).

2.) For each word, for each next $N words, store distance from current word, and increment count for word pair/distance.

3.) For each word, store most frequent word for each $N distance. [a]

4.) Create a prediction algorithm that determines the next word (or set of words) to output from any user input. Basically this would compare word pairs/distance and find most probable next set of word(s)

How close would this be to GPT 2?

[a] You could go one step further and store multiple words for each distance, ordered by frequency

[+] 0cf8612b2e1e|1 year ago|reply

The scaling is brutal. If you have a 20k word vocabulary and want to do 3 grams, you need a 20000^3 matrix of elements (8 trillion). Most of which is going to be empty.

GPT and friends cheat by not modeling each word separately, but a large dimensional “embedding” (just a vector if you also find new vocabulary silly). The embedding represents similar words near each other in this space. The famous king-man-queen example. So even if your training set has never seen “The Queen ordered the traitor <blank>”, it might have previously seen, “The King ordered the traitor beheaded”. The vector representation lets the model use words that represent similar concepts without concrete examples.

[+] janpmz|1 year ago|reply

You can listen to an explanation of the paper here: https://www.pdftomp3.com/shared/67d8abf0ecf38326f8973e49

I created this tool last year to listen to a machine learning book, now I use it for ML papers. The explanations are still a bit repetitive, its not perfect yet.

[+] talles|1 year ago|reply

Correct me if I'm wrong, but an artificial neuron is just good old linear regression followed by an activation function to make it non linear. Make a network out of it and cool stuff happens.

[+] Tewboo|1 year ago|reply

I've seen the same patterns in neural networks that I've seen in simpler algorithms. It's less about mystery and more about complexity.

[+] totetsu|1 year ago|reply

So where is the line that something becomes ‘AI’ and is regulated?

[+] uoaei|1 year ago|reply

[deleted]

[+] EncomLab|1 year ago|reply

The implication that any software is "mysterious" is problematic - there is no "woo" here - the exact state of the machine running the software may be determined at every cycle. The exact instruction and the data it executed with may be precisely determined, as can the next instruction. The entire mythos of any software being a "black box" is just so much advertising jargon, perpetuated by tech bros who want to believe they are part of some Mr. Robot self-styled priestly class.

126 comments