(no title)
miven | 11 months ago
I guess what I'm trying to convey is that the latent representations within a transformer are conditioned on all previous latents through attention, so at least in principle, while the old cache of course does not change, since it grows with new tokens it means that the "state" can be brought up to date by being incorporated in an updated form into subsequent tokens.
unknown|11 months ago
[deleted]