top | item 39256432

(no title)

tysam_and | 2 years ago

Some of the topics in the parent post should not be a major surprise to anyone who has read https://people.math.harvard.edu/~ctm/home/text/others/shanno... ! If we do not have read the foundations of the field that we are in, we are doomed to be mystified by unexplained phenomena which arise pretty naturally as consequences of already-distilled work!

That said, the experiments seem very thorough, on a first pass/initial cursory examination, I appreciate the amount of detail that seemed to go into them.

The tradeoff between learning existing theory, and attempting to re-derive it from scratch, I think, is a hard tradeoff, as not having the traditional foundation allows for the discovery of new things, but having it allows for a deeper understanding of certain phenomena. There is a tradeoff either way.

I've seen several people here in the comments seemingly shocked that a model that maximizes the log likelihood of a sequence given the data somehow does not magically deviate from that behavior when run in inference. It's a density estimation model, do you want it to magically recite Shakespeare from the void?

Please! Let's stick to the basics, it will help experiments like this make much more sense as there already is a very clear mathematical foundation which clearly explains it (and said emergent phenomena).

If you want more specifics, there are several layers, Shannon's treatment of ergodic systems is a good start (though there is some minor deviation from that here, but it likely is a 'close enough' match to what's happening here to be properly instructive to the reader about the general dynamics of what is going on, overall.)

discuss

jackblemming|2 years ago

> the topics in the parent post should not be a major surprise to anyone who has read https://people.math.harvard.edu/~ctm/home/text/others/shanno... !

> which clearly explains it (and said emergent phenomena)

Very smart information theory people have looked at neural networks through the lens of information theory and published famous papers about it years ago. It couldn't explain many things about neural networks, but it was interesting nonetheless.

FWIW it's not uncommon for smart people to say "this mathematical structure looks like this other idea with [+/- some structure]!!" and that it totally explains everything... (kind of with so and so exceptions, well and also this and that and..). Truthfully, we just don't know. And I've never seen theorists in this field actually take the theory and produce something novel or make useful predictions with it. It's all try stuff and see what works, and then retroactively make up some crud on why it worked, if it did work (otherwise brush it under the rug).

There was this one posted recently on transformers being kernel smoothers: https://arxiv.org/abs/1908.11775

Nevermark|2 years ago

I think there is more here than a backward look.

The article introduced a discrete algorithm method for approximating the gradient optimization model.

It would be interesting to optimize the discrete algorithm for both design and inference times, and see if any space or time advantages over gradient learning could be found. Or if new ideas popped as a result of optimization successes or failures.

It also might have an advantage in terms of algorithm adjustments. For instance, given the most likely responses at each step, discard the most likely whenever follow ups are not too far below - and see if that reliably avoided copyright issues.

A lot easier to poke around a discrete algorithm, with zero uncertainty as to what is happening, vs. vast tensor models.

randomNumber7|2 years ago

> It's all try stuff and see what works, and then retroactively make up some crud on why it worked

People have done this in earlier days too. The theory around control systems was developed after PID controllers had been succesfully used in praxis.

rrr_oh_man|2 years ago

> It's all try stuff and see what works, and then retroactively make up some crud on why it worked, if it did work (otherwise brush it under the rug).

Reminds me of how my ex-client's data scientists would develop ML models.

patcon|2 years ago

I appreciate what you're saying, but convergence (via alternative paths, of various depths) is its own signal. Repeated rediscovery perhaps isn't necessarily wastefulness, but affirmation and validation of deep truth for which there are multiple paths of arrival :)

tysam_and|2 years ago

I wish that this worked out in the long run! However, watching the field spin its wheels in the mud over and over with silly pet theories and local results makes it pretty clear that a lot of people are just chasing the butterfly, then after a few years grow disenchanted and sort of just give up.

The bridge comes when people connect concepts to those that are well known and well understood, and that is good. It is all well and good to say in theory that rediscovering things is bad -- it is not necessarily! But when it becomes groundhog day for years on end without significant theoretical change, then that is an indicator that something is amiss in general in how we learn and interpret information in the field.

Of course, this is just my crotchety young opinion coming up on 9 years in the field, so please take that that with a grain of salt and all that.

supriyo-biswas|2 years ago

In another adjacent thread, people are talking about the implications of a neural network conforming to the training data with some error margin with regards to copyright.

Many textbooks on information theory already call out the content-addressable nature of such networks[1], and they’re even used in applications like compression due to this purpose[2][3], and therefore it’s no surprise that the NYT prompting OpenAI models with a few paragraphs of their articles reproduced them nearly verbatim.

[1] https://www.inference.org.uk/itprnn/book.pdf

[2] https://bellard.org/nncp/

[3] https://pub.towardsai.net/stable-diffusion-based-image-compr...

tysam_and|2 years ago

Yes! This is a consequence of empirical risk minimization via maximum likelihood estimation. To have a model not reproduce the density of data it trained on would be like trying to get a horse and buggy to work well at speed, "now just without the wheels this time". It would generally not necessarily go all that well, I think! :'D

uptownfunk|2 years ago

Ok but why didn’t Shannon get us gpt

david_draco|2 years ago

He was busy getting us towards wifi first.

tysam_and|2 years ago

I get the feeling you may not have read the paper as closely as you could have! Section 8 followed by Section 2 may look a tiny bit different if you consider it from this particular perspective.... ;)

varelse|2 years ago

[deleted]

3abiton|2 years ago

Kudos for pluggimg shannomçs masterpiece