The active bits may change with each token. You need the whole model in memory, even though, for any single token, only a subset of that memory will have been used in computation. The memory efficiency comes when you have multiple sessions in parallel.
rahimnathwani|1 year ago