top | item 44376558 (no title) rar00 | 8 months ago This argument works better for state space models. A transformer would still steps context one token at a time, not maintain an internal 1e18 state. discuss order hn newest mgraczyk|8 months ago That doesn't matter, are you familiar with any theoretical results in which the computation is somehow limited in ways that practically matter when the context length is very long? I am not
mgraczyk|8 months ago That doesn't matter, are you familiar with any theoretical results in which the computation is somehow limited in ways that practically matter when the context length is very long? I am not
mgraczyk|8 months ago