I'm the original author of this blogpost. I've had discussions that range from "we need to fix this" to "maybe human thought isn't Turing complete..." and I'm interested in hearing what yall think about the topic.
P.S. I'm graduating soon and looking for research scientist roles.
I know nothing about ML, but as far as "maybe human thought isn't Turing complete:"
If you teach a human how a Turing machine works, they are quite capable of simulating it with paper and pencil with a high probability of each step being correct.
If you give several humans the same problem and have them work through it, keeping track of their work, then you can find points of divergence and go with the majority solution, then recalculate starting from the divergence.
Prior to electronic computers, groups of humans using pencil, paper, reference-tables (that themselves had been hand-calculated by humans), and slide-rules worked this way to get accurate answers to arbitrarily difficult computations.
That is to say that while individual humans might arguably not be Turing complete, groups of humans, with the right manipulatives, are limited only by storage, so are as Turing complete as any general-purpose electronic computer.
Given that these recent language models are both non-deterministic, and significantly more human-like in the mistakes they make, I see no reason why similar techniques could not be used, given proper training. It does seem rather inefficient, since for problems in which a well-defined algorithm exists we can already program a computer to do it far better than I suspect any LLM would do.
Thank you for this excellent article, which I almost passed by until I remembered what the original XOR problem was (so, for me at least, the title is inverted click-bait!)
With regard to the question of whether human thought is Turing complete: on the one hand, we clearly can solve Turing-hard problems up to the limits of our memory (a constraint which is considerably expanded through written language), while on the other hand, we can and do make mistakes while attempting such problems. Personally, I feel the former is sufficient to answer the question in the affirmative, despite of the fact of the latter.
The issue of Turing completeness is central to the Penrose-Lucas argument against AGI, as discussed at length in Roger Penrose's book "The Emperor's New Mind". Here, however, the claim is that human thought goes beyond Turing completeness: Gödel-unprovable results are provable by humans. While I feel that the complete argument against AGI fails on account of it equivocating over what it means to prove something, the above claim about the scope of human thought seems to be an empirical fact.
I have no idea how this observation could help in AI development, but we know that one can prompt a current-generation LLM to produce sentences that humans would recognize as expressing the sort of thoughts that would be useful in solving the given problem (though whether the LLM will go on to produce the solution that is obvious to the humans, and consistent with its earlier productions, is hit-or-miss.) This suggests to me that some sort of self-prompting capability might be necessary (but not sufficient) for a general problem-solving ability.
My apologies, I've only skimmed, but maybe this encapsulates the main idea of your post?
"""
How MLPs could be trained was an open problem at the time, I think, but the field could easily have chugged along and we’d have been where we are now, just 30 years earlier.
"""
The main idea here is that we got an AI winter because expectations were too high and the reality didn't meet them?
I think this is fundamentally missing why we're making rapid progress now. In my opinion, the AI winter wasn't primarily because of a backlash against overhyped promises, it was because computation just wasn't there. We're just now starting to get access to the compute and storage needed, at a price point that's reasonable, to make progress in these areas.
In other words, the AI winter came about because the ground wasn't fertile. Now that we have economic massive computational resources, we're in an AI spring.
As a back of the envelope calculation, the human brain is estimated at around 2.5 Pb of storage [0]. At today's prices, that works out to about $50k. The cost of 2.5Pb of storage in 1969 was around $675B [1].
To paraphrase The Graduate, "Recurrence, my child—recurrence!"
I'd take the bet that in the medium term (whatever that means today), the Transformer architecture as implemented today will be replaced by designs which revive recurrent, non-directed architectures, with a few specific properties:
• cycles will be allowed
• equilibrium (over time) will serve as state ("output")
• computationally this is a couple orders more expensive than current approaches, both in training and in use
• investigating emergent internal encoding and computation will similarly be orders more expensive to investigate, and require nearly as expensive tools as subjects
This is The Way.
NB non-coincidentally this looks much more like an animal brain.
At least for short sequences it works, but I guess the trick is to make it work for long sequences!
Prompt and output from GPT-4:
Write a Python function to compute PARITY incrementally on a list of 0 and 1. For example, the output for [0, 1, 1] should be 0, 1, 0. Print out all the intermediate values.
Here is a Python function to compute the parity incrementally on a list of 0s and 1s. The function will print out all intermediate values as well.
def incremental_parity(lst):
parity = 0
result = []
for num in lst:
if num not in (0, 1):
raise ValueError("The list should contain only 0s and 1s.")
parity ^= num # XOR operation
result.append(parity)
print(f'Current list: {lst[:len(result)]}, Intermediate result: {result}')
return result
# Example usage:
lst = [0, 1, 1]
result = incremental_parity(lst)
print(f'Final result: {result}')
This function computes the parity incrementally by XORing each element of the list. The intermediate and final results are printed as requested.
What is the output for this function on the list [0, 1, 1, 0, 0, 1, 0, 1, 1]
To find the output of the function on the given list, you can use the provided incremental_parity function. Here's what the output would look like:
The thought I had while reading, which you also touched on towards the end, was:
You might indeed not need a model powerful enough to compute PARITY in 'one shot', if what you are aiming for is 'only' modeling human thought. Humans would be terrible at a task like computing PARITY for a long binary sequence if you asked them to do it _instantly_, without taking notes and without going through the sequence iteratively. But that's essentially what we are asking the LLMs to do at the moment.
Interestingly, the LLM trained on human language doesn't _know_ that it has this limitation, whereas the humans it is trying to imitate did not, so it is trying its best to one-shot everything (just like the examples it has seen seem to do), though 'talking to itself' about the problem might make it a lot easier.
So, perhaps similar to the 'show your work' approach and to your UT approach, if we could offer some 'internal scratch space' to the LLMs and train them to use it in order to reach an answer, they might get a lot better at all the kinds of tasks that a human might prefer to reason through first as well -- though training and inference would become a lot more expensive, I suppose.
I read a paper which pointed out we have no reason to believe the Chomsky hierarchy is the only formal hierarchy of languages that could exist, and the author asked why aren't we looking for other language complexity classification systems that might better match our intuitions about natural languages?
Physical computers aren't Turing Complete either, because they have finite memory. That hasn't stopped us from making practical use of them.
> A large issue was in not managing expectations when it came to AI, and we’ve had 2 cycles to learn that hype trains can maglev themselves into a 10-year ditch.
With all the hype around language models I don't get the impression that this lesson has been learned particularly well. All those proponents of “AI” remind me more and more of used-car salesmen.
The difference is that people are actually paying for ChatGPT because it is immediately useful for day-to-day work tasks.
As a random example, ChatGPT 4 is noticeably better at grammar checking than Microsoft Word. If I feed it a couple of paragraphs of text, it'll spot several small errors that Word can't. I don't bother sending documents for internal review any more, I just get the AI to do it in seconds!
For the PARITY problem mentioned here, I recall from undergrad cs theory that there are some pretty neat provable lower-bounds on arithmetic circuit complexity, e.g. parity is not in AC0 (bounded-depth, unlimited fan-in logic gates) circuits. Even providing more computational power via mod p gates still really doesn't buy you much: Razborov-Smolensky theorem shows that even adding mod-p gates you still can't compute modulo q, for p and q being distinct primes.
While there is some mention of chain-of-thought prompting, the idea of chain-of-thought can go further, and can actually solve the problem exactly.
The LLM can produce working code (or broken code that it can fix from reading the errors), which, when paired with a sandbox, can give the result of the computation back to the LLM.
The direction things are going is one where some tokens output by the LLM are hidden, just like human thoughts. That is what Bing Chat is doing, for instance.
Yea this is a good solution so long as the task that needs to be outsourced to, say a python interpreter, is explicitly known and can be prompt engineered.
>> The key takeaway? Depth is important for certain functions.
I thought the key takeaway is that nonlinearity is important. Multilayer perceptrons can be collapsed into an equivalent single layer if there isn't nonlinearity thrown in. multi-input XOR can be solved using a form of sin(x) function after summing in a single layer (sin squared if you don't like negative numbers).
Upon reading this I also kept thinking "why not use the same transformer weights on each later?" and then the author goes there with the Universal Transformer concept. How many iterations should one use? Why should that be a question at all? As humans we might need to "think about it for a while" and get back to you when we figure it out. One thing all this AI research keeps ignoring is that real biological neurons are running in real time, not some batch process. We might also be doing training and inference at the same time, but I'm not certain how much - see dreams and other forms of background or delayed processing.
It occurs to me that LLM have the exact opposite strengths and weaknesses of traditional computing - where trad comp is good at hard calculation and storage and bad at soft interpretive, LLMs are good at soft interpreting but bad at anything that requires runtime state storage.
It seems like we’re going to plateau and see incremental improvements in AI and will take a novel technique to bridge that gap.
This isn’t a new idea of course, but as someone with only light AI knowledge this article helped me understand why that is
Similar idea has occurred to me. In a lot of ways, 'traditional' rules-based AI has more 'understanding' than a LLM can have. Because those rules are fixed by humans to the machine's inputs, and they create a relationship between the software's model and reality, and usually in human-understandable terms (a database entry on a specific person with their facts listed, for example). That doesn't really exist in a LLM. It's entirely self-referential to itself. Which is why it's not possible for it to explain its reasoning, or give a (real) source for a fact. There is no fixed anchor, no reference point, to validate the LLM's output with.
Recurrency will help, see the empirical results in this paper: https://arxiv.org/abs/2207.02098. LSTMs can solve the PARITY problem, generalising to longer lengths of the problem without issue.
The issue with re-introducing recurrency is that no one wants to bear the O(N) growth in computation time. I wonder if we are settling in for a local-optima of bounded-depth architecture for a while due to the obsession with rapid scaling.
Reasonable point - in the case of parity, one bit of recurrent state avoids the combinatorial issues mentioned in OP (as OP more or less says in writing the finite state automaton).
It's interesting to think about parity in relation to the image classification tasks that sparked the ANN renaissance. In image classification, the class does not strongly depend on a single feature - you don't change from cat to dog by altering one pixel, and then back to cat by altering a different pixel.
But parity does depend strongly on each bit in the input.
I think it's important to keep this special aspect of the parity problem in mind when thinking about good architectures for various applications.
I've skimmed up and down and read some sections a few times - I'm still not 100% sure what the main conclusions are. A summary section would make the article more accessible and valuable to others.
My understanding is that the main point here is:
- some problems require reasoning through many steps of computation,
- transformers have limited depth, so cannot solve problems that require many steps.
After reading this article, my reaction is that yes, transformers are not the optimal model for NLP and reasoning, but transformers are good enough and scale really well on current GPU hardware. Going forward, I think hardware is the constraint upon which new architectures will need to accommodate
The article quotes a paper by, "Hao et. al." which concludes:
"Finally, given the success that Transformers have had as models of natural language, it is perhaps surprising that these models’ expressive power seems to be best characterized (or at least bounded) in terms of circuit complexity. Mathematical explorations of natural language have most commonly employed the approach to language complexity afforded by the Chomsky hierarchy and its refinements, which is based on automata and formal grammars. The apparent incomparability of these approaches suggests that the exploration of different types of Transformer models might offer a new approach to the study of the formal properties of natural language."
A few years ago I found a paper that made the argument that Chomsky's hierarchy of formal languages is not necessarily the only mathematical hierarchy that can exist over languages. The author advocated that other ways of systematizing the classification of languages could lead to both a different hierarchy and new insights. He lamented that the strength of Chomsky's work and reputation might have inadvertently discouraged us from questioning it and looking for alternatives - especially ones which do more to capture what we intuitively mean by natural language parsing ability.
Edit: The OP article author interprets the first quote as, "maybe human language is simpler than we thought, and doesn't require higher levels of the Chomsky hierarchy... We might be Transformers!" However in light of that other paper I mentioned, it's more like there might be another hierarchy of languages different from and parallel to Chomsky's hierarchy, with different boundaries for the levels that cover the space differently than Chomsky's.
As for how such a thing is possible in detail, remember that a lot of the Chomsky language hierarchy require looking at behavior "in the limit" or "unbounded" in the sense that no finite parser of a lower level can parse all (unbounded) sentences of the higher complexity language. (But an arbitrarily sized lesser-complexity parser can parse sentences of bounded subset of the higher-complexity language.)
And it may be that for parsing actual human language pertaining to a non-infinite world, such arguments as the Chomsky hierarchy deals with are no more relevant than arguing that our physical computers aren't truly universal Turing machines because they have finite memory. Technically correct, yet also irrelevant to most of the practical uses we have found for them!
>> All this to say: I think there are fundamentals we need to view all of this progress through. Lower-bound complexities of problems that we learn in Algorithms 101 shouldn’t get thrown out the window because linear algebra, ReLUs and attention mechanisms are involved. Sorting, for example, has a provable lower-bound complexity of Ω(NlogN). When someone tells you there is a way to do it in O(logN) time, it’s best to squint and ask a few questions. When someone tells you you can gain a speed-up in computing primes, on a naive algorithm, it’s best to ask if that method also has correctness covered.
Keep this thought please. There is a very obvious feeling that "now we have LLMs we don't need CS anymore", or don't need maths, science, knowledge...
In fact, this goes rather further back than Transformers. I was just reading this old interview of Vladimir Vapnik [1], where he's taking the position that only the "easy" questions in science have explanatory answers, and that for the "hard questions" we can only hope to build predictive models. Vapnik is one of my herores of AI, but on that I think he's not just wrong, but pointing the way to a new dark ages, where we lose the little knowledge we have. Similar "intuitions" are expressed in the "Bitter Lesson" blogpost by Rich sutton [2].
Vapnik is right, of course, in that there are "hard" questions for which we don't know any good answers, and our current framework may not be enough to give any simple, explanatory answer anyway. Certainly, CS is full of bottomless pits of despair, undecidabilities, incompletenesses and intractabilities. I think that just says we're shit at creating formal frameworks. All the ones we've created have some flaw. Arithmetic is incomplete. Logic (even first-order logic) is undecidable. You can't divide by 0, there are tons of calculations that cannot be performed because they sink into uncountable infinities after two steps, etc etc and so on. We just suck at making universal abstractions, that are useful in the first instance, but don't end up painting ourselves into a corner, somewhere down the line.
Well, so what? So far we've done pretty well in terms of science, in the sense of understanding the world and how it works. We're not going to give up now! If our tools are not good enough to take our knowledge to the next level, well, then, we just need to make better tools. Better maths. Better Computer Science. Better scientific methods. That has happened in the past at least once, if not many many times. Why can we not do it again? Why do we have to abandon our fate to unknowing, all-predicting machines? That doesn't make sense.
Fully agreed. Plus at a fundamental level this new anti intellectual trend feels wrong.
Do we truly want to become the Eloi to our new GPT Morlocks?
Every time we wrote fiction about this it was in very dystopic terms. Have we suddenly decided it's OK to simply give up and have a chatbot understand the world for us?
> I’ve attempted this prompt several different ways on ChatGPT without a working solution, but I imagine this could work.
I've tried a couple of things, it seems OK with a basic approach but starts falling apart when you get close to 50 bits. A more interesting merge approach looks OK at 50 with some quick testing. What sort of scale do you mean when you talk about a working solution?
Edit - stats may not be holding up the merge approach at 50. It's close, sometimes getting it right or being one out.
Yeah it is. I've only learned it while looking up material for this post. I figured people know NAND is NOT(AND(.)), so NXOR might be more familiar counterpart to XOR.
In principle we think iteratively, do one small computation, memorize the result, then do the next step based on the input and the result from the previous step. If things get too complicated, we grab some paper and write down intermediate results on external storage. But fundamentally we memorize some raw facts and otherwise learn algorithms that allow us doing new things by combining existing abilities. Ignoring memory limitations humans are Turing-complete.
As we repeatedly solve similar tasks, we develop optimizations, the brain learns to perform a couple of steps in one go instead of going through all the individual steps. When children learn to read, to write or to do math, they learn to do it in small steps, they draw and recognize letters and numbers by looking at and drawing individual lines and circles. When they learn to draw lines or circles they first make very deliberate movements with the pen - up, down, left, right - before drawing a circle becomes a primitive. When they learn to add, they break numbers down into digits and only memorize how to add those, larger numbers are added iteratively digit by digit.
With time and practice the brain learns to do many of those things in one step, you read and write letters or words, not deliberately draw individual strokes. You will still fall back to this if you have to deal with characters from a unknown alphabet. You will no longer perform additions digit by digit for small or commonly used numbers but you will fall back to this for big numbers. When you learn to drive, you first very deliberately look at traffic signs, think about turning the steering wheel this way or that way while parking or think about switching on the turn signals. Again with practice many of those things become automatic and you no longer have to think about them.
So overall I think the brain is just following sequential algorithms, combining primitives into more complex result. Then with practice common tasks get reimplemented in some kind of fixed function unit that can do a specific task - maybe under some limited circumstances and maybe only approximately - in one step and this can then be used as a new primitive where appropriate. Actually those fixed function units could internally use iteration but this iterative process would be hidden from the primary loop.
Language models like ChatGPT seem to be more on the fixed function side, they see a set of tokens and spit out a response. The attention mechanism has some similarity with a loop executing an algorithm step by step, it just has a fixed number of iterations and has been unrolled. When prompted to reason in steps, this adds another loop with additional storage for intermediate results. The difference is that this process is externally visible and that the model must be instructed to do so instead of the model deciding on its own to solve the problem in steps and hiding those steps from the outside.
For questions like adding or multiplying numbers or finding the parity of some long bit string the model has some neurons that could directly provide an answer and the answer will probably be correct for simple tasks but the longer the input the more likely it becomes that those neurons will not produce the correct answer. And the model does not know in which region those neurons can be trusted. But the model also knows the underlying algorithm to solve the task, say computing the parity of a long bit string or adding large numbers, and where the steps only involve computations that the model can perform correctly, i.e. adding two digits or XORing two bits. So you can get the correct result out of a model like ChatGPT if you can make it follow the iterative algorithm instead of it trying to give the sum of two 20 digit numbers directly. And I would guess you could - at least to some extend - use ChatGPT itself to rewrite and iteratively solve harder problems. The primary obstacle is probably that it is hard to tell what tasks the model can reliably perform correctly and what tasks must be broken down further.
The first layer consists of 1 node per input pattern, with activation and weights such that only one node is active that matches the pattern. And a second layer of weights that provides the output per input pattern.
True but not very useful. That's like writing an "isPrime" function with a switch statement for every possible input number, rather than coming up with a general algorithm. It can't give you any answer you don't already know.
[+] [-] shawntan|2 years ago|reply
P.S. I'm graduating soon and looking for research scientist roles.
[+] [-] aidenn0|2 years ago|reply
If you teach a human how a Turing machine works, they are quite capable of simulating it with paper and pencil with a high probability of each step being correct.
If you give several humans the same problem and have them work through it, keeping track of their work, then you can find points of divergence and go with the majority solution, then recalculate starting from the divergence.
Prior to electronic computers, groups of humans using pencil, paper, reference-tables (that themselves had been hand-calculated by humans), and slide-rules worked this way to get accurate answers to arbitrarily difficult computations.
That is to say that while individual humans might arguably not be Turing complete, groups of humans, with the right manipulatives, are limited only by storage, so are as Turing complete as any general-purpose electronic computer.
Given that these recent language models are both non-deterministic, and significantly more human-like in the mistakes they make, I see no reason why similar techniques could not be used, given proper training. It does seem rather inefficient, since for problems in which a well-defined algorithm exists we can already program a computer to do it far better than I suspect any LLM would do.
[+] [-] mannykannot|2 years ago|reply
With regard to the question of whether human thought is Turing complete: on the one hand, we clearly can solve Turing-hard problems up to the limits of our memory (a constraint which is considerably expanded through written language), while on the other hand, we can and do make mistakes while attempting such problems. Personally, I feel the former is sufficient to answer the question in the affirmative, despite of the fact of the latter.
The issue of Turing completeness is central to the Penrose-Lucas argument against AGI, as discussed at length in Roger Penrose's book "The Emperor's New Mind". Here, however, the claim is that human thought goes beyond Turing completeness: Gödel-unprovable results are provable by humans. While I feel that the complete argument against AGI fails on account of it equivocating over what it means to prove something, the above claim about the scope of human thought seems to be an empirical fact.
I have no idea how this observation could help in AI development, but we know that one can prompt a current-generation LLM to produce sentences that humans would recognize as expressing the sort of thoughts that would be useful in solving the given problem (though whether the LLM will go on to produce the solution that is obvious to the humans, and consistent with its earlier productions, is hit-or-miss.) This suggests to me that some sort of self-prompting capability might be necessary (but not sufficient) for a general problem-solving ability.
[+] [-] abetusk|2 years ago|reply
""" How MLPs could be trained was an open problem at the time, I think, but the field could easily have chugged along and we’d have been where we are now, just 30 years earlier. """
The main idea here is that we got an AI winter because expectations were too high and the reality didn't meet them?
I think this is fundamentally missing why we're making rapid progress now. In my opinion, the AI winter wasn't primarily because of a backlash against overhyped promises, it was because computation just wasn't there. We're just now starting to get access to the compute and storage needed, at a price point that's reasonable, to make progress in these areas.
In other words, the AI winter came about because the ground wasn't fertile. Now that we have economic massive computational resources, we're in an AI spring.
As a back of the envelope calculation, the human brain is estimated at around 2.5 Pb of storage [0]. At today's prices, that works out to about $50k. The cost of 2.5Pb of storage in 1969 was around $675B [1].
[0] https://www.scientificamerican.com/article/what-is-the-memor...
[1] https://ourworldindata.org/grapher/historical-cost-of-comput...
[+] [-] aaroninsf|2 years ago|reply
I'd take the bet that in the medium term (whatever that means today), the Transformer architecture as implemented today will be replaced by designs which revive recurrent, non-directed architectures, with a few specific properties:
• cycles will be allowed
• equilibrium (over time) will serve as state ("output")
• computationally this is a couple orders more expensive than current approaches, both in training and in use
• investigating emergent internal encoding and computation will similarly be orders more expensive to investigate, and require nearly as expensive tools as subjects
This is The Way.
NB non-coincidentally this looks much more like an animal brain.
[+] [-] DougBTX|2 years ago|reply
Prompt and output from GPT-4:
Write a Python function to compute PARITY incrementally on a list of 0 and 1. For example, the output for [0, 1, 1] should be 0, 1, 0. Print out all the intermediate values.
Here is a Python function to compute the parity incrementally on a list of 0s and 1s. The function will print out all intermediate values as well.
This function computes the parity incrementally by XORing each element of the list. The intermediate and final results are printed as requested.What is the output for this function on the list [0, 1, 1, 0, 0, 1, 0, 1, 1]
To find the output of the function on the given list, you can use the provided incremental_parity function. Here's what the output would look like:
Output: The final result for the list [0, 1, 1, 0, 0, 1, 0, 1, 1] is [0, 1, 0, 0, 0, 1, 1, 0, 1].[+] [-] mckirk|2 years ago|reply
Interestingly, the LLM trained on human language doesn't _know_ that it has this limitation, whereas the humans it is trying to imitate did not, so it is trying its best to one-shot everything (just like the examples it has seen seem to do), though 'talking to itself' about the problem might make it a lot easier.
So, perhaps similar to the 'show your work' approach and to your UT approach, if we could offer some 'internal scratch space' to the LLMs and train them to use it in order to reach an answer, they might get a lot better at all the kinds of tasks that a human might prefer to reason through first as well -- though training and inference would become a lot more expensive, I suppose.
[+] [-] mkl|2 years ago|reply
I also spotted a typo: "cumptation".
[+] [-] phaedrus|2 years ago|reply
I read a paper which pointed out we have no reason to believe the Chomsky hierarchy is the only formal hierarchy of languages that could exist, and the author asked why aren't we looking for other language complexity classification systems that might better match our intuitions about natural languages?
Physical computers aren't Turing Complete either, because they have finite memory. That hasn't stopped us from making practical use of them.
[+] [-] tempodox|2 years ago|reply
With all the hype around language models I don't get the impression that this lesson has been learned particularly well. All those proponents of “AI” remind me more and more of used-car salesmen.
[+] [-] jiggawatts|2 years ago|reply
As a random example, ChatGPT 4 is noticeably better at grammar checking than Microsoft Word. If I feed it a couple of paragraphs of text, it'll spot several small errors that Word can't. I don't bother sending documents for internal review any more, I just get the AI to do it in seconds!
[+] [-] krackers|2 years ago|reply
See also https://blog.computationalcomplexity.org/2023/02/why-cant-li... for more discussion of LLMs from a circuit-complexity perspective.
[+] [-] shawntan|2 years ago|reply
There are a few references also in the blog post.
[+] [-] espadrine|2 years ago|reply
The LLM can produce working code (or broken code that it can fix from reading the errors), which, when paired with a sandbox, can give the result of the computation back to the LLM.
The direction things are going is one where some tokens output by the LLM are hidden, just like human thoughts. That is what Bing Chat is doing, for instance.
[+] [-] shawntan|2 years ago|reply
[+] [-] rain1|2 years ago|reply
[+] [-] phkahler|2 years ago|reply
I thought the key takeaway is that nonlinearity is important. Multilayer perceptrons can be collapsed into an equivalent single layer if there isn't nonlinearity thrown in. multi-input XOR can be solved using a form of sin(x) function after summing in a single layer (sin squared if you don't like negative numbers).
Upon reading this I also kept thinking "why not use the same transformer weights on each later?" and then the author goes there with the Universal Transformer concept. How many iterations should one use? Why should that be a question at all? As humans we might need to "think about it for a while" and get back to you when we figure it out. One thing all this AI research keeps ignoring is that real biological neurons are running in real time, not some batch process. We might also be doing training and inference at the same time, but I'm not certain how much - see dreams and other forms of background or delayed processing.
[+] [-] klik99|2 years ago|reply
It seems like we’re going to plateau and see incremental improvements in AI and will take a novel technique to bridge that gap.
This isn’t a new idea of course, but as someone with only light AI knowledge this article helped me understand why that is
[+] [-] retrac|2 years ago|reply
[+] [-] exabrial|2 years ago|reply
What a beautiful blog. Clean design. Great graphics.
More importantly: The lack of Javascript is refreshing. Loads instantly. Works with js disabled too (other than the top-left graphic)
[+] [-] shawntan|2 years ago|reply
[+] [-] bheadmaster|2 years ago|reply
Folks sometimes forget that raw HTML is by itself responsive :)
[+] [-] yarg|2 years ago|reply
[+] [-] shawntan|2 years ago|reply
The issue with re-introducing recurrency is that no one wants to bear the O(N) growth in computation time. I wonder if we are settling in for a local-optima of bounded-depth architecture for a while due to the obsession with rapid scaling.
[+] [-] mturmon|2 years ago|reply
It's interesting to think about parity in relation to the image classification tasks that sparked the ANN renaissance. In image classification, the class does not strongly depend on a single feature - you don't change from cat to dog by altering one pixel, and then back to cat by altering a different pixel.
But parity does depend strongly on each bit in the input.
I think it's important to keep this special aspect of the parity problem in mind when thinking about good architectures for various applications.
[+] [-] adg33|2 years ago|reply
My understanding is that the main point here is:
- some problems require reasoning through many steps of computation, - transformers have limited depth, so cannot solve problems that require many steps.
Am I missing anything else?
[+] [-] rain1|2 years ago|reply
[+] [-] jostmey|2 years ago|reply
[+] [-] phaedrus|2 years ago|reply
"Finally, given the success that Transformers have had as models of natural language, it is perhaps surprising that these models’ expressive power seems to be best characterized (or at least bounded) in terms of circuit complexity. Mathematical explorations of natural language have most commonly employed the approach to language complexity afforded by the Chomsky hierarchy and its refinements, which is based on automata and formal grammars. The apparent incomparability of these approaches suggests that the exploration of different types of Transformer models might offer a new approach to the study of the formal properties of natural language."
A few years ago I found a paper that made the argument that Chomsky's hierarchy of formal languages is not necessarily the only mathematical hierarchy that can exist over languages. The author advocated that other ways of systematizing the classification of languages could lead to both a different hierarchy and new insights. He lamented that the strength of Chomsky's work and reputation might have inadvertently discouraged us from questioning it and looking for alternatives - especially ones which do more to capture what we intuitively mean by natural language parsing ability.
Edit: The OP article author interprets the first quote as, "maybe human language is simpler than we thought, and doesn't require higher levels of the Chomsky hierarchy... We might be Transformers!" However in light of that other paper I mentioned, it's more like there might be another hierarchy of languages different from and parallel to Chomsky's hierarchy, with different boundaries for the levels that cover the space differently than Chomsky's.
As for how such a thing is possible in detail, remember that a lot of the Chomsky language hierarchy require looking at behavior "in the limit" or "unbounded" in the sense that no finite parser of a lower level can parse all (unbounded) sentences of the higher complexity language. (But an arbitrarily sized lesser-complexity parser can parse sentences of bounded subset of the higher-complexity language.)
And it may be that for parsing actual human language pertaining to a non-infinite world, such arguments as the Chomsky hierarchy deals with are no more relevant than arguing that our physical computers aren't truly universal Turing machines because they have finite memory. Technically correct, yet also irrelevant to most of the practical uses we have found for them!
[+] [-] YeGoblynQueenne|2 years ago|reply
Keep this thought please. There is a very obvious feeling that "now we have LLMs we don't need CS anymore", or don't need maths, science, knowledge...
In fact, this goes rather further back than Transformers. I was just reading this old interview of Vladimir Vapnik [1], where he's taking the position that only the "easy" questions in science have explanatory answers, and that for the "hard questions" we can only hope to build predictive models. Vapnik is one of my herores of AI, but on that I think he's not just wrong, but pointing the way to a new dark ages, where we lose the little knowledge we have. Similar "intuitions" are expressed in the "Bitter Lesson" blogpost by Rich sutton [2].
Vapnik is right, of course, in that there are "hard" questions for which we don't know any good answers, and our current framework may not be enough to give any simple, explanatory answer anyway. Certainly, CS is full of bottomless pits of despair, undecidabilities, incompletenesses and intractabilities. I think that just says we're shit at creating formal frameworks. All the ones we've created have some flaw. Arithmetic is incomplete. Logic (even first-order logic) is undecidable. You can't divide by 0, there are tons of calculations that cannot be performed because they sink into uncountable infinities after two steps, etc etc and so on. We just suck at making universal abstractions, that are useful in the first instance, but don't end up painting ourselves into a corner, somewhere down the line.
Well, so what? So far we've done pretty well in terms of science, in the sense of understanding the world and how it works. We're not going to give up now! If our tools are not good enough to take our knowledge to the next level, well, then, we just need to make better tools. Better maths. Better Computer Science. Better scientific methods. That has happened in the past at least once, if not many many times. Why can we not do it again? Why do we have to abandon our fate to unknowing, all-predicting machines? That doesn't make sense.
______________
[1] http://www.learningtheory.org/learning-has-just-started-an-i...
[2] http://incompleteideas.net/IncIdeas/BitterLesson.html
[+] [-] the_af|2 years ago|reply
Do we truly want to become the Eloi to our new GPT Morlocks?
Every time we wrote fiction about this it was in very dystopic terms. Have we suddenly decided it's OK to simply give up and have a chatbot understand the world for us?
[+] [-] IanCal|2 years ago|reply
I've tried a couple of things, it seems OK with a basic approach but starts falling apart when you get close to 50 bits. A more interesting merge approach looks OK at 50 with some quick testing. What sort of scale do you mean when you talk about a working solution?
Edit - stats may not be holding up the merge approach at 50. It's close, sometimes getting it right or being one out.
[+] [-] SeanLuke|2 years ago|reply
[+] [-] shawntan|2 years ago|reply
[+] [-] danbruc|2 years ago|reply
In principle we think iteratively, do one small computation, memorize the result, then do the next step based on the input and the result from the previous step. If things get too complicated, we grab some paper and write down intermediate results on external storage. But fundamentally we memorize some raw facts and otherwise learn algorithms that allow us doing new things by combining existing abilities. Ignoring memory limitations humans are Turing-complete.
As we repeatedly solve similar tasks, we develop optimizations, the brain learns to perform a couple of steps in one go instead of going through all the individual steps. When children learn to read, to write or to do math, they learn to do it in small steps, they draw and recognize letters and numbers by looking at and drawing individual lines and circles. When they learn to draw lines or circles they first make very deliberate movements with the pen - up, down, left, right - before drawing a circle becomes a primitive. When they learn to add, they break numbers down into digits and only memorize how to add those, larger numbers are added iteratively digit by digit.
With time and practice the brain learns to do many of those things in one step, you read and write letters or words, not deliberately draw individual strokes. You will still fall back to this if you have to deal with characters from a unknown alphabet. You will no longer perform additions digit by digit for small or commonly used numbers but you will fall back to this for big numbers. When you learn to drive, you first very deliberately look at traffic signs, think about turning the steering wheel this way or that way while parking or think about switching on the turn signals. Again with practice many of those things become automatic and you no longer have to think about them.
So overall I think the brain is just following sequential algorithms, combining primitives into more complex result. Then with practice common tasks get reimplemented in some kind of fixed function unit that can do a specific task - maybe under some limited circumstances and maybe only approximately - in one step and this can then be used as a new primitive where appropriate. Actually those fixed function units could internally use iteration but this iterative process would be hidden from the primary loop.
Language models like ChatGPT seem to be more on the fixed function side, they see a set of tokens and spit out a response. The attention mechanism has some similarity with a loop executing an algorithm step by step, it just has a fixed number of iterations and has been unrolled. When prompted to reason in steps, this adds another loop with additional storage for intermediate results. The difference is that this process is externally visible and that the model must be instructed to do so instead of the model deciding on its own to solve the problem in steps and hiding those steps from the outside.
For questions like adding or multiplying numbers or finding the parity of some long bit string the model has some neurons that could directly provide an answer and the answer will probably be correct for simple tasks but the longer the input the more likely it becomes that those neurons will not produce the correct answer. And the model does not know in which region those neurons can be trusted. But the model also knows the underlying algorithm to solve the task, say computing the parity of a long bit string or adding large numbers, and where the steps only involve computations that the model can perform correctly, i.e. adding two digits or XORing two bits. So you can get the correct result out of a model like ChatGPT if you can make it follow the iterative algorithm instead of it trying to give the sum of two 20 digit numbers directly. And I would guess you could - at least to some extend - use ChatGPT itself to rewrite and iteratively solve harder problems. The primary obstacle is probably that it is hard to tell what tasks the model can reliably perform correctly and what tasks must be broken down further.
[+] [-] smeagull|2 years ago|reply
The first layer consists of 1 node per input pattern, with activation and weights such that only one node is active that matches the pattern. And a second layer of weights that provides the output per input pattern.
[+] [-] aezart|2 years ago|reply