top | item 40295657

(no title)

korbip | 1 year ago

Disclaimer: I'm shared first author of this paper.

As a clarification: The speed for training will be on par with FlashAttention-2, when fully optimized and only including the mLSTM. For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture. The sLSTM has memory mixing, that is state tracking capabilities, for problems Transformers and State Space Models (and any other sequence-parallelizable architecture) cannot solve fundamentally.

discuss

brookst|1 year ago

Congrats on the paper, very interesting.

Can you opine on how the model will fare on hardware that is optimized for transformers? There is so much investment in accelerating the transformer arch[1][2], will xLSTM / sLSTM benefit as well, or will the hardware optimizations give transformers enough of an advantage that it’s hard to compete on general purpose hardware?

1. https://www.etched.com/

2. https://www.embedded.com/ai-chip-features-hardware-support-f...

deepnet|1 year ago

Fascinating work, very promising.

Can you summarise how the model in your paper differs from this implementation of xLSTM ?

https://github.com/huggingface/transformers/issues/27011

korbip|1 year ago

Thanks! I don't see any implementation there. In any case, we are planning a code release soon.

WithinReason|1 year ago

Can you expand on the "cannot solve fundamentally" part?

lucidrains|1 year ago

https://arxiv.org/abs/2404.08819

thomasahle|1 year ago

Transformers and SSMs can't do long computations that are inherently sequential.

Unless you give them chain of thought. In which case they do great.

albertzeyer|1 year ago

Congratulations on the paper. That's some very interesting work!

But you would want to include sLSTM as well to get the best performance, right? How does the speed compares in that case? Specifically when scaling up.

korbip|1 year ago

Thank you! I can say that it is not really a diminishing factor at the scales reported in the paper. So, xLSTM[7:1] is pretty much on par with xLSTM[1:0] in speed. We show that it is helpful on toy tasks, and it shows even better sequence extrapolation performance, so yes.

goldemerald|1 year ago

Great work! I'd love to start using the language model variant of your work. Do you know when/if it will be open sourced? I'd start using it today if it were that soon.

SpaceManNabs|1 year ago

> For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture

Can you explain this statement more if you have time? Are you saying the recurrent architecture of xLSTM enables fast inference on par with Mamba? Or the xLSTM architecture slows it down so that its inference is as slow as mamba?

hh1|1 year ago

When you talk about "c" or "scalar memory" in the paper, does that refer to a single unit in the vector usually referred to as c?

So in mLSTM, each unit of the vector c is now a matrix (so a 3d tensor)? And we refer to each matrix as a head?

Having a bit of issue understanding this fundamental part

korbip|1 year ago

You mainly got it right. Usually one does have many scalar 'c' cells, that talk to each other via memory mixing. For the sLSTM, you group them into heads, talking only to cells within the same head. The reason that we referred to scalar cells here is that these are that fundamental building block. Many of them can and are usually combined and vector notation is useful in this case.

For the matrix 'C' state, there are also heads/cells in that sense that you have multiple, but they don't talk to each other. So yes, you can view that as a 3D tensor. And here, the matrix is the fundamental building block / concept.

logicchains|1 year ago

To clarify, is the sLSTM strictly necessary (to achieve better accuracy than those other architectures), or is the mLSTM good enough? The [1/0] model in the paper seemed to do quite well.

korbip|1 year ago

For language in general it seems fine. But there might be specific tasks where it is necessary indeed.