cfgauss2718
|
2 years ago
|
on: How do neural networks learn?
Yes of course any positive definite matrix can be used as a metric on the corresponding Euclidean space - but that doesn’t mean it’s necessarily useful as a metric. Hence I think it’s useful to distinguish things which could be a metric (in that a metric can be constructed from them), versus things which when applied as a metric actually provide some benefit.
In particular, if we believe the manifold hypothesis, then one should expect a useful metric on features to be local and not static - the quantity W’W clearly does not depend on the inputs to the layer at inference time, and so is static.
cfgauss2718
|
2 years ago
|
on: How do neural networks learn?
This NFM is a curious quantity . It has the flavor of a metric on the space of inputs to that layer. However, the fact that W’W remains proportional to DfDf’ seems to be an obvious consequence of the very form of f… since Df is itself Ds’WW’Ds, then this should be expected under some assumptions (perhaps mild) on the statistics of Ds, no?
cfgauss2718
|
2 years ago
|
on: How do neural networks learn?
You raise a fair point, I do think that it’s important to understand how the properties of the data manifest in the least-squares solution to Ax=b. Without that, the only insights we have are from analysis, while we would be remiss to overlook the more fundamental theory, which is linear algebra. However, my suspicion is that the answer to these same questions but applied to nonlinear function approximators is probably not much different from the insights we have already gained in more basic systems. However, the overly broad title of the manuscript doesn’t seem to point toward those kinds of questions (specifically, things like “how do properties of the data manifold manifest in the weight tensors”) and I’m not sure that one should equate those things to “learning”.
cfgauss2718
|
2 years ago
|
on: How do neural networks learn?
Indeed, hopefully they can be diverted from interest in LLMs towards actual science, like the neuroscience which revealed the existence of said mirror neurons.
cfgauss2718
|
2 years ago
|
on: How do neural networks learn?
Your point is a salient one. It would be useful if we could provide guarantees/bounds on generalization, or representation power, or understand how brittle a model is to shifts in the data distributions. Are these questions of the kind that are answered in part by the authors? I haven’t read the manuscript, but the title doesn’t indicate this is the aim of the research, but it indicates an eye to something much broader and vague (“learning”).
cfgauss2718
|
2 years ago
|
on: How do neural networks learn?
I agree with your interpretation. There is something there to be learned for sure, but I’m doubtful whatever that thing is will be a breakthrough in machine learning or optimization, nor that it will come by applying the tools of analysis. The idea of “emergence” is interesting although vague and bordering on unscientific. Maybe complexity theory, graph theory, and information theory might provide some insights. But in the end, I would guess those insights impact will be limited to tricks that can be used to engineer marginally better architectures or marginally faster training methods.
cfgauss2718
|
2 years ago
|
on: How do neural networks learn?
I haven’t read the manuscript yet, and am not sure that I will. However I don’t agree with the question. Gradient descent, the properties of the loss function are the “how”. It seems like you want to know how some properties of the data are manifested in the network itself during/after training (what these properties are doesn’t seem to be something that people know they are looking for). Maybe that’s what the authors are interested in as well. If I could bet money in Vegas on the answer to that question, my bet would be in most cases that structures we may probe in the network and see in them correlations to aspects of the problem or task that we (as humans) can recognize, well very likely this will boil down to approximations of fundamental and eminently useful quantities like, say, approximate singular value decompositions of regions in the data manifold, or approximate eigenfunctions etc. I could see how these kind of empirical investigations are interesting, but what would their impact be? Another guess, that these investigations may lead to insights that help engineers design better architectures or incrementally improve training methods. But I think that’s about it - this type of research strikes me as engineering and application.
cfgauss2718
|
2 years ago
|
on: How do neural networks learn?
By minimizing a loss functional with respect to a bunch of numbers that amount to entries in matrices (or tensors, whatever) using an approximate hill climbing approach. I’m not sure what insights there are to be gained here, it doesn’t seem more exotic or interesting to me then asking “how does the pseudo inverse of A ‘learn’ to approximate the formula Ax=b?”. Maybe this seems reductive, but once you nail down what the loss functional is (often MSE loss for regression or diffusion models, cross entropy for classification, and many others) and perhaps the particulars of the model architecture (feed-forward vs recurrent, fully connected bits vs convolutions, encoder/decoders) then it’s unclear to me what is left for us to discover about how “learning” works beyond understanding old fundamental algorithms like Newton-Krylov for minimizing nonlinear functions (which subsumes basically all deep learning and goes well beyond). My gut tells me that the curious among you should spend more time learning about fundamentals of optimization than puzzling over some special (and probably non-existent) alchemy inherent in deep networks.
cfgauss2718
|
2 years ago
|
on: Tata joins hands with PSMC to build India's first 12-inch fab
Am I the only person who saw this headline and thought WOW, 12 inch transistors!?
cfgauss2718
|
2 years ago
|
on: Is Cosine-Similarity of Embeddings Really About Similarity?
Distance measures are only as good as the Pseudo-Riemannian metric they (implicitly)implement. If the manifold hypothesis is believed, then these metrics should be local because the manifold curvature is a local property. You would be mistaken to use an ordinary dot product to compare straight lines on a map of the globe, because those lines aren’t actually straight - they do not account for the rich information in the curvature tensor. Using the wrong inner product is akin to the flat-Earth fallacy.
In particular, if we believe the manifold hypothesis, then one should expect a useful metric on features to be local and not static - the quantity W’W clearly does not depend on the inputs to the layer at inference time, and so is static.