(no title)
nocobot | 1 year ago
totally agree that you can be a great engineer and not be familiar with it, but seems weird for an expert in the field to confidently make wrong statements about this.
nocobot | 1 year ago
totally agree that you can be a great engineer and not be familiar with it, but seems weird for an expert in the field to confidently make wrong statements about this.
YeGoblynQueenne|1 year ago
That stuff is still absolutely relevant, btw. Some DL people like to dismiss it as irrelevant but that's just because they lack the background to appreciate why it matters. Also: the arrogance of youth (hey I've already been a postdoc for a year, I'm ancient). Here's a recent paper on Neural Networks and the Chomsky Hierarchy that tests RNNs and Transformers on formal languages (I think it doesn't test on a^nb^n directly but tests similar a-b based CF languages):
https://arxiv.org/abs/2207.02098
And btw that's a good paper. Probably one of the most satisfying DL papers I've read in recent years. You know when you read a paper and you get this feeling of satiation, like "aaah, that hit the spot"? That's the kind of paper.
GistNoesis|1 year ago
A transformer (with relative invariant positional embedding) has full context so can see the whole sequence. It just has to count and compare.
To convince yourself, construct the weights manually.
First layer :
zeros the character which are equal to the previous character.
Second layer :
Build a feature to detect and extract the position embedding of the first a. a second feature to detect and extract the position embedding of the last a, a third feature to detect and extract the position embedding of the first b, a fourth feature to detect and extract the position embedding of the last b,
Third layer :
on top that check whether (second feature - first feature) == (fourth feature - third feature).
The paper doesn't distinguish between what is the expressive capability of the model, and the finding the optimum of the model, aka the training procedure.
If you train by only showing example with varying n, there probably isn't inductive bias to make it converge naturally towards the optimal solution you can construct by hand. But you can probably train multiple formal languages simultaneously, to make the counting feature emerge from the data.
You can't deduce much from negative results in research beside it requiring more work.
aubanel|1 year ago