(no title)
tysam_and | 2 years ago
That said, the experiments seem very thorough, on a first pass/initial cursory examination, I appreciate the amount of detail that seemed to go into them.
The tradeoff between learning existing theory, and attempting to re-derive it from scratch, I think, is a hard tradeoff, as not having the traditional foundation allows for the discovery of new things, but having it allows for a deeper understanding of certain phenomena. There is a tradeoff either way.
I've seen several people here in the comments seemingly shocked that a model that maximizes the log likelihood of a sequence given the data somehow does not magically deviate from that behavior when run in inference. It's a density estimation model, do you want it to magically recite Shakespeare from the void?
Please! Let's stick to the basics, it will help experiments like this make much more sense as there already is a very clear mathematical foundation which clearly explains it (and said emergent phenomena).
If you want more specifics, there are several layers, Shannon's treatment of ergodic systems is a good start (though there is some minor deviation from that here, but it likely is a 'close enough' match to what's happening here to be properly instructive to the reader about the general dynamics of what is going on, overall.)
jackblemming|2 years ago
> which clearly explains it (and said emergent phenomena)
Very smart information theory people have looked at neural networks through the lens of information theory and published famous papers about it years ago. It couldn't explain many things about neural networks, but it was interesting nonetheless.
FWIW it's not uncommon for smart people to say "this mathematical structure looks like this other idea with [+/- some structure]!!" and that it totally explains everything... (kind of with so and so exceptions, well and also this and that and..). Truthfully, we just don't know. And I've never seen theorists in this field actually take the theory and produce something novel or make useful predictions with it. It's all try stuff and see what works, and then retroactively make up some crud on why it worked, if it did work (otherwise brush it under the rug).
There was this one posted recently on transformers being kernel smoothers: https://arxiv.org/abs/1908.11775
Nevermark|2 years ago
The article introduced a discrete algorithm method for approximating the gradient optimization model.
It would be interesting to optimize the discrete algorithm for both design and inference times, and see if any space or time advantages over gradient learning could be found. Or if new ideas popped as a result of optimization successes or failures.
It also might have an advantage in terms of algorithm adjustments. For instance, given the most likely responses at each step, discard the most likely whenever follow ups are not too far below - and see if that reliably avoided copyright issues.
A lot easier to poke around a discrete algorithm, with zero uncertainty as to what is happening, vs. vast tensor models.
randomNumber7|2 years ago
People have done this in earlier days too. The theory around control systems was developed after PID controllers had been succesfully used in praxis.
rrr_oh_man|2 years ago
Reminds me of how my ex-client's data scientists would develop ML models.
patcon|2 years ago
tysam_and|2 years ago
The bridge comes when people connect concepts to those that are well known and well understood, and that is good. It is all well and good to say in theory that rediscovering things is bad -- it is not necessarily! But when it becomes groundhog day for years on end without significant theoretical change, then that is an indicator that something is amiss in general in how we learn and interpret information in the field.
Of course, this is just my crotchety young opinion coming up on 9 years in the field, so please take that that with a grain of salt and all that.
supriyo-biswas|2 years ago
Many textbooks on information theory already call out the content-addressable nature of such networks[1], and they’re even used in applications like compression due to this purpose[2][3], and therefore it’s no surprise that the NYT prompting OpenAI models with a few paragraphs of their articles reproduced them nearly verbatim.
[1] https://www.inference.org.uk/itprnn/book.pdf
[2] https://bellard.org/nncp/
[3] https://pub.towardsai.net/stable-diffusion-based-image-compr...
tysam_and|2 years ago
uptownfunk|2 years ago
david_draco|2 years ago
tysam_and|2 years ago
varelse|2 years ago
[deleted]
3abiton|2 years ago