How do neural networks learn?

[+] cfgauss2718|2 years ago|reply

By minimizing a loss functional with respect to a bunch of numbers that amount to entries in matrices (or tensors, whatever) using an approximate hill climbing approach. I’m not sure what insights there are to be gained here, it doesn’t seem more exotic or interesting to me then asking “how does the pseudo inverse of A ‘learn’ to approximate the formula Ax=b?”. Maybe this seems reductive, but once you nail down what the loss functional is (often MSE loss for regression or diffusion models, cross entropy for classification, and many others) and perhaps the particulars of the model architecture (feed-forward vs recurrent, fully connected bits vs convolutions, encoder/decoders) then it’s unclear to me what is left for us to discover about how “learning” works beyond understanding old fundamental algorithms like Newton-Krylov for minimizing nonlinear functions (which subsumes basically all deep learning and goes well beyond). My gut tells me that the curious among you should spend more time learning about fundamentals of optimization than puzzling over some special (and probably non-existent) alchemy inherent in deep networks.

[+] 6gvONxR4sf7o|2 years ago|reply

> it doesn’t seem more exotic or interesting to me then asking “how does the pseudo inverse of A ‘learn’ to approximate the formula Ax=b?

Asking things like properties of the pseudoinverse against a dataset on some distribution (or even properties of simple regression) is interesting and useful. If we could understand neural networks as well as we understand linear regression, it would be a massive breakthrough, not a boring "it's just minimizing a loss function" statement.

Hell even if you just ask about minimizing things, you get a whole theory of M estimators [0]. This kind of dismissive comment doesn't add anything.

[0] https://en.wikipedia.org/wiki/M-estimator

[+] IanCal|2 years ago|reply

This is overly reductive. Understanding what they're doing at a higher level is useful. If you knew enough about neuron activations and how they change with stimulus that wouldn't be enough for a human to develop a syllabus for teaching maths even if they "understand how people learn".

What you describe also doesn't answer the question of how to structure and train a model, which surely is quite important. How do the choices impact real world problems?

[+] HarHarVeryFunny|2 years ago|reply

Sure, but their title seems poorly chosen and doesn't match what they are claiming in the article itself, which includes understanding how GPT-2 makes it's predictions.

How does GPT-2 learn, for example, that copying a word from way back in the context helps it to minimize the prediction error? How does it even manage to copy a word from the context to the output? We know that it is minimizing prediction errors, and learned to do so via gradient descent, but HOW is it doing it? (we've discovered a few answers, but it's still a research area)

[+] xanderlewis|2 years ago|reply

> By minimizing a loss functional with respect to a bunch of numbers that amount to entries in matrices (or tensors, whatever) using an approximate hill climbing approach.

Are the rules of chess all there is to it? Is there really no more to be said?

[+] calf|2 years ago|reply

Well, if neural nets are nothing more than their optimization problem then why isn't there a mathematical proof of this already?

And why isn't that reductionism? We don't say human learning is merely the product of millions of years of random evolution, and leave it at that. So if we take a position on reductionist account of learning, then how do we prove it or disprove it?

Are there arguments that don't rest on our gut feelings? Otherwise this is just different experts factions arguing that "neural nets are/aren't superautocomplete / stochastic parrots" but with more technobabble.

[+] andoando|2 years ago|reply

Im with you. My only understanding of ML is a class in 2016 where we implemented basic ML algos and not neutral nets, gpts or whatever but I always assumed its no radically different.

Take a bunch of features or make up a billion features, find a function to that best predicts the greatest number of outputs correctly. Any "emergent" behavior I imagine is just a result of finding new features or sets of features.

[+] johndhi|2 years ago|reply

I don't understand many of most of these words (highest I got was college calculus) but this sounds interesting to me.

[+] unknown|2 years ago|reply

[deleted]

[+] HPsquared|2 years ago|reply

LLMs really get those mirror neurons firing and people tend to anthropomorphize them a bit too much.

[+] amelius|2 years ago|reply

You are missing one important point.

Your network can learn some dataset very well. However, that doesn't say anything about how well it generalizes, and thus how useful your network is.

[+] Xcelerate|2 years ago|reply

What would be the requirements in order for most researchers to agree that a “conclusive” answer has been established to the question “How do neural networks work?”

I ask not because this paper isn’t insightful research, but rather because if you search Google Scholar or arXiv for papers purporting to describe how neural networks “actually work”, you get thousands upon thousands of results all claiming to answer the question, and yet you never really come away with the sense that the question has truly been resolved in a satisfactory way (see also: the measurement problem).

I’ve noticed that each paper uses a totally different approach to addressing the matter that just happens to correspond to the researchers’ existing line of work (It’s topology! No, group theory has the answer! Actually, it’s compressed sensing... computational complexity theory... rebranded old-school quantum chemistry techniques... and so on.)

I suppose my question is more about human psychology than neural networks, since neural networks seem to be working just fine regardless of how well we understand them. But I think it could be useful to organize a multi-disciplinary conference where some open questions regarding machine learning are drafted (similar to Hilbert’s problems), that upon their successful resolution would mean neural networks are no longer widely considered “black boxes”.

[+] QuadmasterXLII|2 years ago|reply

If a paper provided general tools that let me sit down with pen and paper and derive that SWIGLU is better than RELU, and batch normalization trains 6 pi /root 5 times faster than no normalization, etc, and I could get the derivation right before running the experiment, then I wpuld believe that I understood how neural networks train

[+] paulblazek|2 years ago|reply

A conclusive answer to how neural networks work should be both descriptive and prescriptive. It should not only tell you how they work but give you new insights into how to train them or how to fix their errors.

For example, does your theory tell you how to initialize weights? How the weights in the NN were derived from specific training samples? If you removed a certain subset of training samples, how would the weights change? If the model makes a mistake, which neurons/layers are responsible? Which weights would have to change, and what training data would need to be added/removed to have the model learn better weights?

If you can't answer these sorts of questions, you can't really say you know how they work. Kind of like steam engines before Carnot, or Koch's principles in microbiology, a theory is often only as good as it can be operationalized.

[+] wrsh07|2 years ago|reply

I think there's also an element of moving goalposts. Maybe we understand certain structures of a fully connected nn, but then we need to understand what's happening within a transformer.

They're going to continue to get more complex, and so we will always have more to understand.

I think what you're hoping for is some theory that once discovered will apply to all future NN architectures (and very likely help us find the "best" ones). Do you think that exists?

[+] smokel|2 years ago|reply

The problem lies in our understanding of the concept "understanding". It is still pretty unclear at a fundamental level what it means to understand, learn, or conceptualize things.

This quickly leads to thought about consciousness and other metaphysical issues that have not been resolved, and probably never will be.

[+] mytailorisrich|2 years ago|reply

We know how neural networks work.

What's tricky is to understand the model a particular network has come up with during training so indeed how they "learn".

[+] unknown|2 years ago|reply

[deleted]

[+] sabas123|2 years ago|reply

What do you mean when you say "how they work"?

A technique that is applied to widely different fields obviously yield a large set of interpretations each with the lense of their own field. But that doesn't invalidate any of those interpretations no?

[+] HarHarVeryFunny|2 years ago|reply

It seems they are trying to answer "WHAT do NN's learn?", and "How do NN's WORK?", as much as their title question of "How do NN's learn?".

Here's an excerpt from the article:

"The researchers found that a formula used in statistical analysis provides a streamlined mathematical description of how neural networks, such as GPT-2, a precursor to ChatGPT, learn relevant patterns in data, known as features. This formula also explains how neural networks use these relevant patterns to make predictions."

The trite answer to "HOW do NN's learn?" is obviously gradient descent - error minimization, with the features being learnt being those that best support error minimization by the higher layers, effectively learning some basis set of features that can be composed into more complex higher level patterns.

The more interesting question perhaps is WHAT (not HOW) do NN's learn, and there doesn't seem to be any single answer to that - it depends on the network architecture. What a CNN learns is not the same as what an LLM such GPT-2 (which they claim to address) learns.

What an LLM learns is tied to the question of how does a trained LLM actually work, and this is very much a research question - the field of mechanistic interpretability (induction head circuits, and so forth). I guess you could combine this with the question of HOW does an LLM learn if you are looking for a higher level transformer-specific answer, and not just the generic error minimization answer: how does a transformer learn those circuits?

Other types of NN may be better understood, but anyone claiming to fully know how an LLM works is deluding themselves. Companies like Anthropic don't themselves fully know, and in fact have mechanistic interpretability as a potential roadblock to further scaling since they have committed to scaling safely, and want to understand the inner workings of the model in order both to control it and provide guarantees that a larger model has not learnt to do anything dangerous.

[+] hnuser123456|2 years ago|reply

It's curve fitting at its core. It just so happens that when you fit a function with enough parameters, you can start using much more abstract and higher-level methods to more quickly describe other outcomes and applications of this curve fitting. Really simple algebra just with a lot of variables. It's not a black box at all and it's disingenuous when people call it that. It outputs what it does because it's multiplying A and B and adding C to the input you gave it.

[+] jprete|2 years ago|reply

We don't understand how GenAI systems work, so we can't analytically predict their behavior; I think they're artifacts of synthetic biology, or chaos theory, not engineering or computer science. Giving GenAI goals and the ability to act outside of a sandbox is roughly parallel to releasing a colony of foreign predators in Australia, or creating an artificial pathogen that is so foreign the immune system doesn't have the tools to fight it.

That's why I consider understanding the internals of GenAI systems to be very important, independent of human psychology.

[+] psyklic|2 years ago|reply

Arxiv (2023 paper w/ similar title&authors): https://arxiv.org/abs/2212.13881

Summary by co-author (of the above): https://twitter.com/dbeagleholeCS/status/1627819164906975232

[+] cs702|2 years ago|reply

Interesting: Given an input x to a layer f(x)=Wσ(x), where σ is an activation function and W is a weight matrix, the authors define the layer's "neural feature matrix" (NFM) as WᵀW, and show that throughout training, it remains proportional to the average outer product of the layer's gradients, i.e., WᵀW ∝ mean(∇f(x)∇f(x)ᵀ), with the mean computed over all samples in the training data. The authors posit that layers learn to up-weight features strongly related to model output via the NFM.

The authors do interesting things with the NFM, including explaining why pruning should even be possible and why we see grokking during learning. They also train a kernel machine iteratively, at each step alternating between (1) fitting the model's kernel matrix to the data and (2) computing the average gradient outer product of the model and replacing the kernel matrix with it. The motivation is to induce the kernel machine to "learn to identify features." The approach seems to work well. The authors' kernel machine outperforms all previous approaches on a common tabular data benchmark.

[+] cfgauss2718|2 years ago|reply

This NFM is a curious quantity . It has the flavor of a metric on the space of inputs to that layer. However, the fact that W’W remains proportional to DfDf’ seems to be an obvious consequence of the very form of f… since Df is itself Ds’WW’Ds, then this should be expected under some assumptions (perhaps mild) on the statistics of Ds, no?

[+] FrustratedMonky|2 years ago|reply

From article

"But these networks remain a black box whose inner workings engineers and scientists struggle to understand."

"We are trying to understand neural networks from first principles,""

There is a large contingent of CS people in HN that think that since we built AI, and can examine the code, the models, the weights, that this means we understand it.

Hope this article helps explain the problem.

[+] biophysboy|2 years ago|reply

Here’s a question I’ve been asking myself with the latest ML advancements: what is the difference between understanding and pretending to understand really, really well?

[+] mikewarot|2 years ago|reply

Learning how deep networks learn features isn't obvious, and teasing out the details is valuable research.

Superhuman levels of feature recognition are at play during the training of LLMs, and those insights are compiled into the resulting weights in ways we have little visibility into.

[+] garyiskidding|2 years ago|reply

The 2013 Zeiler and Fergus paper also explains this based on activations during training of the network resulting in feature detection across layers.

Paper : https://arxiv.org/pdf/1311.2901.pdf

[+] zwaps|2 years ago|reply

Before I read this, is this yet another paper where physicists believe that modern NN are trained by gradient descent and consist of only linear FNN (or another simplification) and can therefore be explained by <insert random grad level physics method here>? Because we already had hundreds of those.

Or is this more substantial?

[+] malux85|2 years ago|reply

Does anyone else find reading sites like this on mobile unbearable? The number and size of ads is crazy, in terms of pixel space I think there’s more ads than content

[+] tomduncalf|2 years ago|reply

Firefox Focus seems to block them all, I just use that as a content blocker in Safari and am rarely bothered by ads.

I really am not opposed to sites using ads to monetise and resisted ad blocking for many years, but advertisers took it way too far with both the number and intrusiveness of ads so I ended up relenting and installing an ad blocker.

[+] zipping1549|2 years ago|reply

uBO + Firefox reader mode man.. I just don't read whatever they show me if I don't have adblock handy

[+] bongodongobob|2 years ago|reply

I block that at the DNS level via pihole. Nothing on my network sees that stuff.

[+] xcv123|2 years ago|reply

https://brave.com

No ads

[+] UberFly|2 years ago|reply

Best not to use the default browser on any device.

[+] salesynerd|2 years ago|reply

Blokada works great for me.

[+] jamboca|2 years ago|reply

The article is not so specific on how the research actually works. Then, "We hope this will help democratize AI," but the linked paper is behind a paywall.

Curious to see how they would go about explaining how a network selects its important features

[+] unknown|2 years ago|reply

[deleted]

[+] swalsh|2 years ago|reply

I've been trying to figure out how LLMs learn logic/reasoning. It's just not intuitive to me how that works.

[+] mikewarot|2 years ago|reply

It's the same way we do it, form a number of possible variants and use the ones that work best.

They have the advantage of millions or even billions of times more compute to throw at the learning process. Something that might be a one in a million insight happens consistently at that scale.

[+] bilsbie|2 years ago|reply

Am I wrong or is this a breakthrough?

[+] canjobear|2 years ago|reply

Does it explain double descent?

[+] tech_ken|2 years ago|reply

Full pub is up on arXiv here: https://arxiv.org/pdf/2212.13881.pdf

My 2c: The phys.org summary isn't great. The authors are focused on a much narrower topic than simply "how NNs learn", they're trying to characterize the mechanism by which deep NNs develop 'features'. They identify a quantity which is definable for each layer of the NN (outer product of the input weight matrix), and posit that this quantity is proportional to the average derivative of the layer with respect to its inputs, where the average is taken over all training data. They (claim to, I haven't evaluated) prove this formally for the case of deep FNNs trained on gradient descent. They argue that, by treating this quantity as a measure of 'feature importance', it can be used to explain certain behaviors of NNs that are otherwise difficult to understand. Specifically they address:

* Simple and spurious features (they argue that their proposal can identify when this has occurred)

* "Lottery ticket" NNs (they argue that their proposal can help explain why pruning the connections of a fully connected NN improves its performance)

* Grokking (they argue that their proposal can help explain why NNs can exhibit sudden performance improvements, even after training performance is 100%)

Finally they propose a heuristic ML algorithm which updates their proposal directly during training, rather than the underlying weights, and shows that this achieves superior performance to some existing alternatives.

Overall I would say that they have defined a nice tool for measuring NN feature importance, in terms of features that the NN itself is defining (not in terms of the original space). I can definitely see why this has a lot of value, and I'm especially intrigued by their comparisons of the NFM to testing performance in their 'grokking' case study.

With that said, I'm not really active in the NN space, so it seems a little surprising that their result is really that novel. The quantity they define (outer product of the weight matrix) seems fairly intuitive as a way to rank the importance of inputs at any given layer of the NN, so I'm wondering if truly nobody else has ever done this before? Possibly the novelty is in their derivation of the proportionality, or in the analysis of this quantity over training iterations. I'd guess that their model proposal is totally new, and I'm curious to try it out on some test cases, it seems promising for cases where light-weight models are required. It also seems interesting to point out how both training performance AND the development of feature importance both jointly influence testing accuracy, but again I'm surprised that this is really novel. I also have to wonder how this extends to more complicated architectures with, eg. recursive elements; it's not discussed anywhere, but seems like it would be an important extension of this framework given where genAI is currently at (although first draft was pub'd in '22 so it's possible that this just wasn't as pressing when it was being written).

[+] richrichie|2 years ago|reply

It is funny how they left out the first author of the paper :) Way too complicated name!

129 comments