Sometimes I think the reason human memory in some sense is so amazing, is what we lack in storage capacity that machines have, we makeup for in our ability to create patterns that compress the amount of information stored dramatically, and then it is like we compress those patterns together with other patterns and are able to extract things from it. Like it is an incredibly lossy compression, but it gets the job done.
That’s not exactly true, there doesn’t seem to be an upper bound (that we can reach) on storage capacity in the brain [0]. Instead, the brain actually works to actively distill knowledge that doesn’t need to be memorized verbatim into its essential components in order to achieve exactly this “generalized intuition and understanding” to avoid overfitting.
For more information and the related math behind associative memories, please see Hopfield Neural Networks.
While the upper bound is technically "infinity", there is a tradeoff between the amount of concepts stored and the fundamental amount of information storable per concept, similar to how other tradeoff principles like the uncertainty principle, etc work.
Artificial neural networks work a lot like compression algorithms in their ability to predict the future. The trained network is a compression algorithm - it does not store compressed data.
We don’t know if the animal brain works the same way, but I suspect it is mostly compression algorithms designed to predict things, and doesn’t store much data at all.
Good example in my math and physics classes I found it really helpful to understand the general concepts, then instead of memorizing formulas could actually derive them from other known (perhaps easier-to-remember) facts.
Geometry is good for training in this way—and often very helpful for physics proofs too!
yes, when we do this to history, it becomes filled with conspiracies. but is merely a process to 'understand' history by projecting intentionalities.
this 'compression' is what 'understanding' something really entails; at first... but then there's more.
when knowledge becomes understood it enables perception (e.g. we perceive meaning in words once we learn to read).
when we get really good at this understanding-perception we may start to 'manipulate' the abstractions we 'perceive'. an example would be to 'understand a cube' and then being able to rotate it around so to predict what would happen without really needing the cube. but this is an overly simplistic example
It seems the take home is weight decay induces sparsity which helps learn the "true" representation rather than an overfit one. It's interesting the human brain has a comparable mechanism prevalent in development [1]. I would love to know from someone in the field if this was the inspiration for weight decay (or presumably just the more equivalent nn pruning [2]).
ML researcher here wanting to offer a clarification.
L1 induces sparsity. Weight decay explicitly _does not_, as it is L2. This is a common misconception.
Something a lot of people don't know is that weight decay works because when applied as regularization it causes the network to approach the MDL, which reduces regret during training.
Pruning in the brain is somewhat related, but because the brain uses sparsity to (fundamentally, IIRC) induce representations instead of compression, it's basically a different motif entirely.
If you need a hint here on this one, think about the implicit biases of different representations and the downstream impacts that they can have on the learned (or learnable) representations of whatever system is in question.
The inspiration for weight decay was to reduce the capacity to memorize of the model until it perfectly fits the complexity of the task, not more not less. A model more complex than the task is over-fitting, the other one is under-fitting. Got to balance them out.
But the best cure for over-fitting is to make the dataset larger and ensure data diversity. LLMs have datasets so large they usually train one epoch.
The human brain has synaptic pruning. The exact purpose of it is theorized but not actually understood, and it's a gigantic leap to assume some sort of analogous mechanism between LLMs and the human brain.
Afaik weight decay is inspired from L2 regularisation which goes back to linear regression where L2 regularisation is equivalent to having gaussian prior on the weights with zero mean.
Note that L1 regularisation produces much more sparsity but it doesn't perform as well.
"Grok" in AI doesn't quite describe generalization, it's more specific that that. It's more like "delayed and fairly sudden generalization" or something like that. There was some discussion of this in the comments of this post[1], which proposes calling the phenomenon "eventual recovery from overfitting" instead.
“Grok” was Valentine Michael Smith’s rendering for human ears and vocal cords of a Martian word with a precise denotational semantic of “to drink”. The connotational semantics range from to literally or figuratively “drink deeply” all the way up through to consume the absented carcass of a cherished one.
I highly recommend Stranger in A Strange Land (and make sure to get the unabridged re-issue, 1990 IIRC).
They're just defining grokking in a different way.
It's reasonable to me though - grokking suggests elements of intuitive understanding, and a sudden, large increase in understanding. These mirror what happens to the loss.
Same thing. To grok is to fully incorporate the new into your intuitive view of the world - changing your view of both in the process. An AI is training their model with the new data, incorporating it into their existing world view in such a way that may even subtly change every variable they know. A human is doing the same. We integrate it deeper the more we can connect it to existing metaphor and understanding - and it becomes one less thing we need to "remember" precisely because we can then recreate it from "base principles" because we fully understand it. We've grokked it.
I have heard grok used tremendously more frequently in the past year or two and I find it annoying because they're using it as a replacement for the word "understand" for reasons I don't "grok"
I'm not sure if I'm remembering it right, but I think it was on a Raphaël Millière interview on Mindscape, where Raphaël said something along the lines of when there are many dimensions in a machine learning model, the distinction between interpolation and extrapolation is not clear like it is in our usual areas of reasoning. I can't work out if this could be something similar to what the article is talking about.
Does anyone know how that charts are created ?
I bet that it's half generated by some sort of library and them manually improved but the generated animated SVG are beautiful.
PSA: if you’re interested in the details of this topic, it’s probably best to view TFA on a computer as there is data in the visualizations that you can’t explore on mobile.
First of all, great blog post with great examples. Reminds me of distill.pub used to be.
Second, the article correctly states that typically L2 weight decay is used, leading to a lot of weights with small magnitudes. For models that generalize better, would it then be better to always use L1 weight decay to promote sparsity in combination with longer training?
I wonder whether deep learning models that only use sparse fourier features rather than dense linear layers would work better...
Short answer: if the inputs can be represented well on the Fourier basis, yes. I have a patent in process on this, fingers crossed.
Longer answer: deep learning models are usually trying to find the best nonlinear basis in which to represent inputs; if the inputs are well-represented (read that as: can be sparsely represented) in some basis known a-priori, it usually helps to just put them in that basis, e.g., by FFT’ing RF signals.
The challenge is that the overall-optimal basis might not be the same as those of any local minima, so you’ve got to do some tricks to nudge the network closer.
I'm curious how representative the target function is? I get that it is common for you to want a model to learn the important pieces of an input, but a string of bits, and only caring about the first three, feels particularly contrived. Literally a truth table on relevant parameters of size 8? And trained with 4.8 million samples? Or am I misunderstanding something there? (I fully expect I'm misunderstanding something.)
I have observed this pattern before in computer vision tasks (train accuracy flatlining for a while before test acc starts to go up). The point of the simple tasks is to be able to interpret what could be going on behind the scenes when this happens.
There were no auto-discovery RSS/Atom feeds in the HTML, no links to the RSS feed anywhere, but by guessing at possible feed names and locations I was able to find the "Explorables" RSS feed at: https://pair.withgoogle.com/explorables/rss.xml
If you plot a head map of a neuron in the hidden layer on a 2D chart where one axis is $a$ and the other is $b$, I think you might get a triangular lattice. If it's doing what I think it is, then looking at another hidden neuron would give a different lattice with another orientation + scale.
Also you could make a base 67 adding machine by chaining these together.
I also can't help the gut feeling that the relationship between W_in-proj's neurons compared to the relationship between W_out-proj's neurons looks like the same mapping as the one between the semitone circle and the circle of fifths
I don't think I have seen an answer here that actually challenges this question - from my experience, I have yet to see a neural network actually learn representations outside the range in which it was trained. Some papers have tried to use things like sinusoidal activation functions that can force a neural network to fit a repeating function, but on its own I would call it pure coincidence.
On generalization - its still memorization. I think there has been some proof that chatgpt does 'try' to perform some higher level thinking but still has problems due to the dictionary type lookup table it uses. The higher level thinking or agi that people are excited about is a form of generalization that is so impressive we don't really think of it as memorization. But I actually question if our wantingness to generate original thought isn't as actually separate from what we currently are seeing.
> I have yet to see a neural network actually learn representations outside the range in which it was trained
Generalization doesn't require learning representations outside of the training set. It requires learning reusable representations that compose in ways that enable solving unseen problems.
> On generalization - its still memorization
Not sure what you mean by this. This statement sounds self contradictory to me. Generalization requires abstraction / compression. Not sure if that's what you mean by memorization.
Overparameterized models are able to generalize (and tend to, when trained appropriately) because there are far more parameterizations that minimize loss by compressing knowledge than there are parameterizations that minimize loss without compression.
This is fairly easy to see. Imagine a dataset and model such that the model has barely enough capacity to learn the dataset without compression. The only degrees of freedom would be through changes in basis. In contrast, if the model uses compression, that would increase the degrees of freedom. The more compression, the more degrees of freedom, and the more parameterizations that would minimize the loss.
If stochastic gradient descent is sufficiently equally as likely to find any given compressed minimum as any given uncompressed minimum, then the fact that there are exponentially many more compressed minimums than uncompressed minimums means it will tend to find a compressed minimum.
Of course this is only a probabilistic argument, and doesn't guarantee compression / generalization. And in fact we know that there are ways to train a model such that it will not generalize, such as training for many epochs on a small dataset without augmentation.
The issue is that we are prone to inflate the complexity of our own processing logic. Ultimately we are pattern recognition machines in combination with abstract representation. This allows us to connect the dots between events in the world and apply principles in one domain to another.
But, like all complexity, it is reduceable to component parts.
(In fact, we know this because we evolved to have this ability. )
Statistical learning can typically be phrased in terms of k nearest neighbours
In the case of NNs we have a "modal knn" (memorising) going to a "mean knn" ('generalising') under the right sort of training.
I'd call both of these memorising, but the latter is a kind of weighted recall.
Generalisation as a property of statistical models (ie., models of conditional freqs) is not the same property as generalisation in the case of scientific models.
In the latter a scientific model is general because it models causally necessary effects from causes -- so, necessarily if X then Y.
Whereas generalisation in associative stats is just about whether you're drawing data from the empirical freq. distribution or whether you've modelled first. In all automated stats the only diff between the "model" and "the data" is some sort of weighted averaging operation.
So in automated stats (ie., ML,AI) it's really just whether the model uses a mean.
I disagree, it feels like you are just fusing over words and not what's happening in the real world. If you were right, a human doesn't learn anything either, they just memories.
you can look at it by results: I give these models inputs its never seen before but it gives me outputs that are correct / acceptable.
you can look at it in terms of data: we took petabytes of data, and with an 8gb model (stable difusion) we can output an image of anything. That's an unheard of compression, only possible if its generalizing - not memorizing.
What they demonstrate is a neural network learning an algorithm that approximates modular addition. The exact workings of this algorithm is explained in the footnotes. The learned algorithm is general -- it is just as valid on unseen inputs as seen inputs.
There's no memorization going on in this case. It's actually approximating the process used to generate the data, which just isn't possible using k nearest neighbors.
it's been proven that all models learned by gradient descent are equivalent to kernel machines. interpolation isn't generalization. if theres a new input sufficiently different from the training data the behaviour is unknown
I haven't read the latest literature but my understanding is that "grokking" is the phase transition that occurs during the coalescing of islands of understanding (increasingly abstract features) that eventually form a pathway to generalization. And that this is something associated with over-parameterized models, which have the potential to learn multiple paths (explanations).
A bit of both, but it does certainly generalize. Just look into the sentiment neuron from OpenAI in 2017 or come up with an unique question to ChatGPT.
From what I gather they're talking about double descent which afaik is the consequence of overparameterization leading to a smooth interpolation between the training data as opposed to what happens in traditional overfitting. Imagine a polynomial fit with the same degree as the number of data points (swinging up and down wildly away from the data) compared with a much higher degree fit that could smoothly interpolate between the points while still landing right on them.
None of this is what I would call generalization, it's good interpolation, which is what deep learning does in a very high dimensional space. It's notoriously awful at extrapolating, ie generalizing to anything without support in the training data.
"hierarchize" only describes your own mental model of how knowledge organization and reasoning may work in the model, not the actual phenomenon being observed here.
"generalize" means going from specific examples to general cases not seen before, which is a perfectly good description of the phenomenon. Why try to invent a new word?
If you omit the training data points where the baseball hits the ground, what will a machine learning model predict?
You can train a classical ML model on the known orbits of the planets in the past, but it can presumably never predict orbits given unseen n-body gravity events like another dense mass moving through the solar system because of classical insufficiency to model quantum problems, for example.
Church-Turing-Deutsch doesn't say there could not exist a Classical / Quantum correspondence; but a classical model on a classical computer cannot be sufficient for quantum-hard problems. (e.g. Quantum Discord says that there are entanglement and non-entanglement nonlocal relations in the data.)
Regardless of whether they sufficiently generalize,
[LLMs, ML Models, and AutoMLs] don't yet Critically Think and it's dangerous to take action without critical thought.
Well they memorize points and lines (or tanh) between different parts of the space right? So it depends on whether a useful generalization can be extracted from the line estimation and how dense the points on the landscape are no?
Anyone who so much as taken a class on this knows that even the simplest of perceptron networks, decision trees, or any form of machine learning model generalizes. That's why we use them. If they don't, it's called overfit[1], where the model is so accurate on the training data that its inferential ability on new data suffers.
I know that the article might be talking about a higher form of generalization with LLMs or whatever, but I don't see why the same principle of "don't overfit the data" wouldn't apply to that situation.
No, really: what part of their base argument is novel?
The interesting part is the sudden generalization.
Simple models predicting simple things will generally slowly overfit, and regularization keeps that overfitting in check.
This "grokking" phenomenon is when a model first starts by aggressively overfitting, then gradually prunes unnecessary weights until it suddenly converges on the one generalizable combination of weights (as it's the only one that both solves the training data and minimizes weights).
Why is this interesting? Because you could argue that this justifies using overparametrized models with high levels of regularization; e.g. models that will tend to aggressively overfit, but over time might converge to a better solution by gradual pruning of weights. The traditional approach is not to do this, but rather to use a simpler model (which would initially generalize better, but due to its simplicity might not be able to learn the underlying mechanism and reach higher accuracy).
There's so many idiots in the AI space that are completely ignorant of how Machine Learning works. The worst are the grifters that fearmonger about AI safety by regurgitating singularity memes.
It's because you over generalized your simple understanding. There is a lot more nuance to that thing you are calling overfitting (and underfitting). We do not know why it happens or when it happens, in all cases. We do know cases where it does happen and why it happens, but that doesn't me we don't know others. There is still a lot of interpretation left that is needed. How much was overfit? How much underfit? Can these happen at the same time? (yes) What layers do this, what causes this, and how can we avoid it? Reading the article shows you that this is far from a trivial task. This is all before we even introduce the concept of sudden generalization. Once we do that then all these things start again but now under a completely different context that is even more surprising. We also need to talk about new aspects like the rate of generalization and rate of memorization what what affects these.
tldr: don't oversimplify things: you underfit
P.S. please don't fucking review. Your complaints aren't critiques.
Memorise because there is no decision component. It attempts to just brute force a pattern rather than thinking through the information and making a conclusion.
mostertoaster|2 years ago
ComputerGuru|2 years ago
[0]: https://www.scientificamerican.com/article/new-estimate-boos...
bufferoverflow|2 years ago
https://youtu.be/hpTCZ-hO6iI
tbalsam|2 years ago
While the upper bound is technically "infinity", there is a tradeoff between the amount of concepts stored and the fundamental amount of information storable per concept, similar to how other tradeoff principles like the uncertainty principle, etc work.
mr_toad|2 years ago
We don’t know if the animal brain works the same way, but I suspect it is mostly compression algorithms designed to predict things, and doesn’t store much data at all.
bobboies|2 years ago
Geometry is good for training in this way—and often very helpful for physics proofs too!
pillefitz|2 years ago
BSEdlMMldESB|2 years ago
this 'compression' is what 'understanding' something really entails; at first... but then there's more.
when knowledge becomes understood it enables perception (e.g. we perceive meaning in words once we learn to read).
when we get really good at this understanding-perception we may start to 'manipulate' the abstractions we 'perceive'. an example would be to 'understand a cube' and then being able to rotate it around so to predict what would happen without really needing the cube. but this is an overly simplistic example
pyinstallwoes|2 years ago
greenflag|2 years ago
[1] https://en.wikipedia.org/wiki/Synaptic_pruning [2] https://en.wikipedia.org/wiki/Pruning_(artificial_neural_net...
tbalsam|2 years ago
L1 induces sparsity. Weight decay explicitly _does not_, as it is L2. This is a common misconception.
Something a lot of people don't know is that weight decay works because when applied as regularization it causes the network to approach the MDL, which reduces regret during training.
Pruning in the brain is somewhat related, but because the brain uses sparsity to (fundamentally, IIRC) induce representations instead of compression, it's basically a different motif entirely.
If you need a hint here on this one, think about the implicit biases of different representations and the downstream impacts that they can have on the learned (or learnable) representations of whatever system is in question.
I hope this answers your question.
visarga|2 years ago
But the best cure for over-fitting is to make the dataset larger and ensure data diversity. LLMs have datasets so large they usually train one epoch.
BaseballPhysics|2 years ago
pcwelder|2 years ago
Note that L1 regularisation produces much more sparsity but it doesn't perform as well.
unknown|2 years ago
[deleted]
gorjusborg|2 years ago
It means roughly 'to understand completely, fully'.
To use the same term to describe generalization... just shows you didn't grok grokking.
erwald|2 years ago
[1] https://www.lesswrong.com/posts/GpSzShaaf8po4rcmA/qapr-5-gro...
benreesman|2 years ago
“Grok” was Valentine Michael Smith’s rendering for human ears and vocal cords of a Martian word with a precise denotational semantic of “to drink”. The connotational semantics range from to literally or figuratively “drink deeply” all the way up through to consume the absented carcass of a cherished one.
I highly recommend Stranger in A Strange Land (and make sure to get the unabridged re-issue, 1990 IIRC).
mxwsn|2 years ago
whimsicalism|2 years ago
jjk166|2 years ago
paulddraper|2 years ago
And what is the indicator for a machine understanding something?
NikkiA|2 years ago
dogcomplex|2 years ago
thuuuomas|2 years ago
mr_toad|2 years ago
So the AI folks are just borrowing something that had already been co-opted 30+ years ago.
93po|2 years ago
momirlan|2 years ago
jimwhite42|2 years ago
_ache_|2 years ago
1wheel|2 years ago
I also have a couple of little libraries for things like annotations, interleaving svg/canvas and making d3 a bit less verbose.
- https://github.com/PAIR-code/ai-explorables/tree/master/sour...
- https://1wheel.github.io/swoopy-drag/
- https://github.com/gka/d3-jetpack
- https://roadtolarissa.com/hot-reload/
ComputerGuru|2 years ago
SimplyUnknown|2 years ago
Second, the article correctly states that typically L2 weight decay is used, leading to a lot of weights with small magnitudes. For models that generalize better, would it then be better to always use L1 weight decay to promote sparsity in combination with longer training?
I wonder whether deep learning models that only use sparse fourier features rather than dense linear layers would work better...
medium_spicy|2 years ago
Longer answer: deep learning models are usually trying to find the best nonlinear basis in which to represent inputs; if the inputs are well-represented (read that as: can be sparsely represented) in some basis known a-priori, it usually helps to just put them in that basis, e.g., by FFT’ing RF signals.
The challenge is that the overall-optimal basis might not be the same as those of any local minima, so you’ve got to do some tricks to nudge the network closer.
qumpis|2 years ago
taeric|2 years ago
jaggirs|2 years ago
superkuh|2 years ago
lachlan_gray|2 years ago
https://en.wikipedia.org/wiki/Grid_cell
If you plot a head map of a neuron in the hidden layer on a 2D chart where one axis is $a$ and the other is $b$, I think you might get a triangular lattice. If it's doing what I think it is, then looking at another hidden neuron would give a different lattice with another orientation + scale.
Also you could make a base 67 adding machine by chaining these together.
I also can't help the gut feeling that the relationship between W_in-proj's neurons compared to the relationship between W_out-proj's neurons looks like the same mapping as the one between the semitone circle and the circle of fifths
https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Pi...
flyer_go|2 years ago
On generalization - its still memorization. I think there has been some proof that chatgpt does 'try' to perform some higher level thinking but still has problems due to the dictionary type lookup table it uses. The higher level thinking or agi that people are excited about is a form of generalization that is so impressive we don't really think of it as memorization. But I actually question if our wantingness to generate original thought isn't as actually separate from what we currently are seeing.
smaddox|2 years ago
Generalization doesn't require learning representations outside of the training set. It requires learning reusable representations that compose in ways that enable solving unseen problems.
> On generalization - its still memorization
Not sure what you mean by this. This statement sounds self contradictory to me. Generalization requires abstraction / compression. Not sure if that's what you mean by memorization.
Overparameterized models are able to generalize (and tend to, when trained appropriately) because there are far more parameterizations that minimize loss by compressing knowledge than there are parameterizations that minimize loss without compression.
This is fairly easy to see. Imagine a dataset and model such that the model has barely enough capacity to learn the dataset without compression. The only degrees of freedom would be through changes in basis. In contrast, if the model uses compression, that would increase the degrees of freedom. The more compression, the more degrees of freedom, and the more parameterizations that would minimize the loss.
If stochastic gradient descent is sufficiently equally as likely to find any given compressed minimum as any given uncompressed minimum, then the fact that there are exponentially many more compressed minimums than uncompressed minimums means it will tend to find a compressed minimum.
Of course this is only a probabilistic argument, and doesn't guarantee compression / generalization. And in fact we know that there are ways to train a model such that it will not generalize, such as training for many epochs on a small dataset without augmentation.
jhaenchen|2 years ago
But, like all complexity, it is reduceable to component parts.
(In fact, we know this because we evolved to have this ability. )
mjburgess|2 years ago
In the case of NNs we have a "modal knn" (memorising) going to a "mean knn" ('generalising') under the right sort of training.
I'd call both of these memorising, but the latter is a kind of weighted recall.
Generalisation as a property of statistical models (ie., models of conditional freqs) is not the same property as generalisation in the case of scientific models.
In the latter a scientific model is general because it models causally necessary effects from causes -- so, necessarily if X then Y.
Whereas generalisation in associative stats is just about whether you're drawing data from the empirical freq. distribution or whether you've modelled first. In all automated stats the only diff between the "model" and "the data" is some sort of weighted averaging operation.
So in automated stats (ie., ML,AI) it's really just whether the model uses a mean.
autokad|2 years ago
you can look at it by results: I give these models inputs its never seen before but it gives me outputs that are correct / acceptable.
you can look at it in terms of data: we took petabytes of data, and with an 8gb model (stable difusion) we can output an image of anything. That's an unheard of compression, only possible if its generalizing - not memorizing.
ActivePattern|2 years ago
What they demonstrate is a neural network learning an algorithm that approximates modular addition. The exact workings of this algorithm is explained in the footnotes. The learned algorithm is general -- it is just as valid on unseen inputs as seen inputs.
There's no memorization going on in this case. It's actually approximating the process used to generate the data, which just isn't possible using k nearest neighbors.
visarga|2 years ago
We have suspected that neural nets are a kind of kNN. Here's a paper:
Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
https://arxiv.org/abs/2012.00152
bippihippi1|2 years ago
esafak|2 years ago
https://en.wikipedia.org/wiki/Percolation_theory
A relevant, recent paper I found from a quick search: The semantic landscape paradigm for neural networks (https://arxiv.org/abs/2307.09550)
ajuc|2 years ago
It generalized splendidly - it's conclusion was that you always need to press "forward" and do nothing else, no matter what happens :)
huijzer|2 years ago
davidguetta|2 years ago
version_five|2 years ago
From what I gather they're talking about double descent which afaik is the consequence of overparameterization leading to a smooth interpolation between the training data as opposed to what happens in traditional overfitting. Imagine a polynomial fit with the same degree as the number of data points (swinging up and down wildly away from the data) compared with a much higher degree fit that could smoothly interpolate between the points while still landing right on them.
None of this is what I would call generalization, it's good interpolation, which is what deep learning does in a very high dimensional space. It's notoriously awful at extrapolating, ie generalizing to anything without support in the training data.
3cats-in-a-coat|2 years ago
ot|2 years ago
"generalize" means going from specific examples to general cases not seen before, which is a perfectly good description of the phenomenon. Why try to invent a new word?
westurner|2 years ago
You can train a classical ML model on the known orbits of the planets in the past, but it can presumably never predict orbits given unseen n-body gravity events like another dense mass moving through the solar system because of classical insufficiency to model quantum problems, for example.
Church-Turing-Deutsch doesn't say there could not exist a Classical / Quantum correspondence; but a classical model on a classical computer cannot be sufficient for quantum-hard problems. (e.g. Quantum Discord says that there are entanglement and non-entanglement nonlocal relations in the data.)
Regardless of whether they sufficiently generalize, [LLMs, ML Models, and AutoMLs] don't yet Critically Think and it's dangerous to take action without critical thought.
Critical Thinking; Logic, Rationality: https://en.wikipedia.org/wiki/Critical_thinking#Logic_and_ra...
tehjoker|2 years ago
djha-skin|2 years ago
Anyone who so much as taken a class on this knows that even the simplest of perceptron networks, decision trees, or any form of machine learning model generalizes. That's why we use them. If they don't, it's called overfit[1], where the model is so accurate on the training data that its inferential ability on new data suffers.
I know that the article might be talking about a higher form of generalization with LLMs or whatever, but I don't see why the same principle of "don't overfit the data" wouldn't apply to that situation.
No, really: what part of their base argument is novel?
1: https://en.wikipedia.org/wiki/Overfitting
halflings|2 years ago
Simple models predicting simple things will generally slowly overfit, and regularization keeps that overfitting in check.
This "grokking" phenomenon is when a model first starts by aggressively overfitting, then gradually prunes unnecessary weights until it suddenly converges on the one generalizable combination of weights (as it's the only one that both solves the training data and minimizes weights).
Why is this interesting? Because you could argue that this justifies using overparametrized models with high levels of regularization; e.g. models that will tend to aggressively overfit, but over time might converge to a better solution by gradual pruning of weights. The traditional approach is not to do this, but rather to use a simpler model (which would initially generalize better, but due to its simplicity might not be able to learn the underlying mechanism and reach higher accuracy).
rosenjcb|2 years ago
godelski|2 years ago
tldr: don't oversimplify things: you underfit
P.S. please don't fucking review. Your complaints aren't critiques.
unknown|2 years ago
[deleted]
MagicMoonlight|2 years ago
blueyes|2 years ago
visarga|2 years ago
unknown|2 years ago
[deleted]
wwarner|2 years ago
lsh123|2 years ago
tipsytoad|2 years ago
agumonkey|2 years ago
lewhoo|2 years ago
drdeca|2 years ago
aappleby|2 years ago
xaellison|2 years ago