Yes, as I said the prompt (the entire history of the conversation, including vendor prompting that the user can't see) entirely determines the internal state according to the LLM's weights. But the fact that at each new token the prediction starts from scratch doesn't mean that the new internal state is very different from the previous one. A state that represents the general meaning of the conversation and where the sentence is going will not be influenced much by a new token appended to the end. So the internal state "persists" and transitions smoothly even if it is destroyed and recreated from scratch at each prediction.
giantrobot|4 days ago
Nothing is persisted in the LLM itself (weights, layer, etc) nor in the hardware (modulo token caching or other scaling mechanisms). In fact this happens all the time with the big inference providers. Two sessions of a chat will rarely (if ever) execute on the same hardware.
throw310822|4 days ago
Maybe it's not clear what I mean by "state". I mean a pattern of activations in the deep layers of the network that encodes for some high level semantic. Not something that is persisted. Something that doesn't need to be persisted precisely because is fully determined by the context, and the context stays roughly the same.