I've seen this idea that "LLMs are just guessing the next token" repeated everywhere. It is true that accuracy in that task is what the training algorithms aim at. That is not however, what the output of the model represents in use, in my opinion. I suspect the process is better understood as predicting the next concept, not the next token. As the procedure passes from one level to the next, this concept morphs from a simple token to an ever more abstract representation of an idea. That representation (and all the others being created elsewhere from the text) interact to form the next, even more abstract concept. In this way ideas "close" to each other become combined and can fuse into each other, until an "intelligent" final output is generated. It is true that the present configuration doesn't offer the LLM a very good way to look back to see what its output has been doing, and I suspect that kind of feedback will be necessary for big improvements in performance. Clearly, there is an integration of information occurring, and it is interesting to contemplate how that plays into G. Tononi's definition of consciousness in his "information integration theory".
8crazyideas|1 year ago