(no title)
korbip | 1 year ago
As a clarification: The speed for training will be on par with FlashAttention-2, when fully optimized and only including the mLSTM. For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture. The sLSTM has memory mixing, that is state tracking capabilities, for problems Transformers and State Space Models (and any other sequence-parallelizable architecture) cannot solve fundamentally.
brookst|1 year ago
Can you opine on how the model will fare on hardware that is optimized for transformers? There is so much investment in accelerating the transformer arch[1][2], will xLSTM / sLSTM benefit as well, or will the hardware optimizations give transformers enough of an advantage that it’s hard to compete on general purpose hardware?
1. https://www.etched.com/
2. https://www.embedded.com/ai-chip-features-hardware-support-f...
deepnet|1 year ago
Can you summarise how the model in your paper differs from this implementation of xLSTM ?
https://github.com/huggingface/transformers/issues/27011
korbip|1 year ago
WithinReason|1 year ago
lucidrains|1 year ago
thomasahle|1 year ago
Unless you give them chain of thought. In which case they do great.
albertzeyer|1 year ago
But you would want to include sLSTM as well to get the best performance, right? How does the speed compares in that case? Specifically when scaling up.
korbip|1 year ago
goldemerald|1 year ago
SpaceManNabs|1 year ago
Can you explain this statement more if you have time? Are you saying the recurrent architecture of xLSTM enables fast inference on par with Mamba? Or the xLSTM architecture slows it down so that its inference is as slow as mamba?
hh1|1 year ago
So in mLSTM, each unit of the vector c is now a matrix (so a 3d tensor)? And we refer to each matrix as a head?
Having a bit of issue understanding this fundamental part
korbip|1 year ago
For the matrix 'C' state, there are also heads/cells in that sense that you have multiple, but they don't talk to each other. So yes, you can view that as a 3D tensor. And here, the matrix is the fundamental building block / concept.
logicchains|1 year ago
korbip|1 year ago