In case people are wondering why Mamba is exciting:
There's this idea in AI right now that "scaling" models to be bigger and train on more data always makes them better. This has led to a science of "scaling laws" which study just how much bigger models need to be and how much data we need to train them on to make them a certain amount better. The relationship between model size, training data size, and performance turns out to be quite predictable.
Transformers are great because they can continue scaling and giving us better performance – unlike, we think, RNNs. Probably the most exciting thing about Mamba is the claim that it can be a bit smaller, and train on a bit less data, and still provide better performance than the equivalent Transformer, especially at longer sequence lengths.
I believe there is a lot of herding going on due to the influence of people who had compute to play around with than deeply insightful or principled exploration of networks.
“RNN-mode inference” is also extremely exciting because you can precompute the hidden state of any prompt prefix (i.e. a long system prompt, or statically retrieved context) and continued generations pay the same cost irrespective of the prefix length.
1. The Mamba co-author was also the FlashAttention lead author.
2. The secret ingredient that makes SSMs viable for deep learning is HiPPO theory. If you start with random initialization you're not going to get results. What you need is "optimal online function approximation" using Legendre polynomials, a Fourier basis, etc., in matrix form. The Mamba story starts with Legendre Memory Units.
Invariably someone comments, "How do we know that it scales?" We don't. But the lead author has backing and a new startup at cartesia.ai. Could be the next Mistral.
The architecture is completely public. I would be surprised if certain other players (including but not limited to Mistral AI) are not training models yet. We'll hear soon enough if this is viable. Maybe not for official release candidates, but at least for internal testing.
Fantastic blog post, thank you for this. I am not even familiar with transformers, yet the explanation is stellar clear to me, and the included references and context are a trasure trove. The explanation of FlashAttention is the best I have seen, and that is not even the focus of the article.
One question I have on selectivity: footnote 4 says "the continuous A is constant, while our discretization parameter ∆ is input-dependent." What is the effect of varying the discretization instead of the (main, as I understand it) state A? My gut says it simplifies training and provides stability, but I feel A carries most of the behavior of the model, so it should have more wiggle room throughout training.
Thank you for the kind words! I think it’s mostly to reduce complexity during training. Here’s an excerpt from page 9 of the Mamba paper:
“We remark that while the A parameter could also be selective, it ultimately affects the model only through its interaction with ∆ via A = exp(∆A) (the discretization (4)). Thus selectivity in ∆ is enough to ensure selectivity in (A, B), and is the main source of improvement. We hypothesize that making A selective in addition to (or instead of) ∆ would have similar performance, and leave it out for simplicity.”
If I'm not mistaken the largest mamba model right now is 2.8B and undertrained with low quality data (the Pile only). The main problem is that it's new and unproven.
Should become very interesting once someone with both data and significant financial backing takes the plunge and trains something of notable size. Perhaps Llama-3 might already end up being that attempt, as we seem to be heavily into diminishing returns for transformers.
There is one trained on 600B tokens from SlimPajama [1], but that's fairly tiny compared to other recent releases (ex. stablelm-3b [2] trained on 4T tokens).
> low quality data (the Pile only)
The Pile is pretty good quality wise. It's mostly the size (300B tokens) that's limiting.
This was really helpful, but only discusses linear operations, which obviously can’t be the whole story. From the paper it seems like the discretization is the only nonlinear step—in particular the selection mechanism is just a linear transformation. Is that right? How important is the particular form of the nonlinearity?
EDIT: from looking at the paper, it seems like even though the core state space model/selection mechanism is linear (except for discretization?), they incorporate a nonlinearity in the full “mamba block”, which is stacked up with residual connections and layer norm just like in a transformer. They describe this as combining a linear attention and an MLP into a single step, rather than alternating attention and MLP as in a transformer.
Yes you're spot on, the nonlinearities come from the full Mamba blocks, which I left out of this post for simplicity/to focus on the bigger ideas the paper introduced. You can see it marked by the "X" on the right-most part of Figure 3 in the Mamba paper: https://arxiv.org/abs/2312.00752
From what I can tell all the large players in the space are continuing developing on transformers right? Is it just that Mamba is too new, or is the architecture fundamentally not usable for some reason?
Too new is definitely one thing. Someone is going to have to make a gamble to actually paying for a serious pretraining run with this architecture before we know how it really stacks up against transformers.
There are some papers suggesting that transformers are better than SSMs in fundamental ways (e.g. They cannot do arbitrary key-based recall from their context: https://arxiv.org/abs/2402.01032). This means it's not just a no-brainer to switch over.
Not sure how much detail you need but generally there exist implicit and explicit integrators for numerically solving (integrating) ODE. The implicit ones, like the one used here, tend to be more stable. The ideas behind SSM come from control theory ideas that used integrators with stability guarantees so that the rest of the neural network can focus on other aspects of the problem.
Very annoying namespace conflict since a package called "mamba" (faster reimplementation of the python conda package manager) already existed for awhile before this architecture was even dreamed up.
Beyond that, I'll care about an alternative to transformers when it shows superior performance with an open source 7b-34b model compared to transformer model competitors. So far this has not happened yet
> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
I use the former and have been experimenting with the latter. Fortunately, the contexts are separate enough that they never come up in the same sentence.
> Importantly, these recurrent and convolutional forms, which I like to call “RNN mode” and “CNN mode,” are mathematically equivalent. This allows S4 to shape-shift depending on what you need it to do, with no difference in its outputs.
Is this really true? Because it seems to ignore hardware and data type precisions entirely. I mean computing same math thing in a different way with floating points often leads to different results.
I'm very positive I can actually understand the terminology used in discussing machine learning models if it was presented in a way that describes the first principles a little bit better, instead of diving directly into high level abstract equations and symbols.
I'd like a way to learn this stuff as a computer engineer, in the same spirit as "big scary math symbols are just for-loops"
Ironically, you can probably just ask a Transformer model to explain it to you.
I'm the same as you: I have no problem grasping complex concepts, I just always struggled with the mathematical notation. I did pass linear algebra in university, but was glad I could go back to programming after that. Even then, I mostly passed linear algebra because I wrote functions that solve linear algebra equations until I fully grasped the concept.
I've found that GPT-4 is very good at taking a math-notation-rich document and just describing it in terms a math-notation-averse engineer would understand.
I was a data engineer for about 6-7 years at various companies, always working together with data scientists who insist that `x_` or `_phi` are proper variable names. Man am I glad to be working with engineers now.
That's a heuristic that's usually true. You can definitely understand convolution or attention better with a "big scary math symbols are just for-loops" explanation, but there are also things like dopri45 or elliptic curve crypto where we just have to accept that Weird Math Shit is happening and the symbols are inevitable. It looks to me like mamba has dragged a part of llm research into the latter camp.
MoE (Mixture of Experts) is an effective way to scale transformers. Gemini 1.5 is already doing upto 1 million tokens. I have not seen any large scale mamba model, so not aware of its shortcomings, but I am sure there are tradeoffs.
It should be possible to combine Mamba with MoE, I wonder how that would look like... a billion token context?
MoE let's you use scale model size up with compute. That leads to hopefully more intelligent models.
It, however, is independent with context size: the ability to process a lot of tokens / text.
jxmorris12|2 years ago
There's this idea in AI right now that "scaling" models to be bigger and train on more data always makes them better. This has led to a science of "scaling laws" which study just how much bigger models need to be and how much data we need to train them on to make them a certain amount better. The relationship between model size, training data size, and performance turns out to be quite predictable.
Transformers are great because they can continue scaling and giving us better performance – unlike, we think, RNNs. Probably the most exciting thing about Mamba is the claim that it can be a bit smaller, and train on a bit less data, and still provide better performance than the equivalent Transformer, especially at longer sequence lengths.
For more info, see the scaling laws plot in Figure 4 of the Mamba paper: https://arxiv.org/abs/2312.00752
KuriousCat|2 years ago
https://openreview.net/forum?id=TKIFuQHHECj#
I believe there is a lot of herding going on due to the influence of people who had compute to play around with than deeply insightful or principled exploration of networks.
hansonw|2 years ago
5kg|2 years ago
intalentive|2 years ago
1. The Mamba co-author was also the FlashAttention lead author.
2. The secret ingredient that makes SSMs viable for deep learning is HiPPO theory. If you start with random initialization you're not going to get results. What you need is "optimal online function approximation" using Legendre polynomials, a Fourier basis, etc., in matrix form. The Mamba story starts with Legendre Memory Units.
Invariably someone comments, "How do we know that it scales?" We don't. But the lead author has backing and a new startup at cartesia.ai. Could be the next Mistral.
sigmoid10|2 years ago
magnio|2 years ago
One question I have on selectivity: footnote 4 says "the continuous A is constant, while our discretization parameter ∆ is input-dependent." What is the effect of varying the discretization instead of the (main, as I understand it) state A? My gut says it simplifies training and provides stability, but I feel A carries most of the behavior of the model, so it should have more wiggle room throughout training.
jackcook|2 years ago
“We remark that while the A parameter could also be selective, it ultimately affects the model only through its interaction with ∆ via A = exp(∆A) (the discretization (4)). Thus selectivity in ∆ is enough to ensure selectivity in (A, B), and is the main source of improvement. We hypothesize that making A selective in addition to (or instead of) ∆ would have similar performance, and leave it out for simplicity.”
whimsicalism|2 years ago
moffkalast|2 years ago
Should become very interesting once someone with both data and significant financial backing takes the plunge and trains something of notable size. Perhaps Llama-3 might already end up being that attempt, as we seem to be heavily into diminishing returns for transformers.
SekstiNi|2 years ago
> low quality data (the Pile only)
The Pile is pretty good quality wise. It's mostly the size (300B tokens) that's limiting.
[1]: https://huggingface.co/state-spaces/mamba-2.8b-slimpj [2]: https://huggingface.co/stabilityai/stablelm-3b-4e1t
jsenn|2 years ago
EDIT: from looking at the paper, it seems like even though the core state space model/selection mechanism is linear (except for discretization?), they incorporate a nonlinearity in the full “mamba block”, which is stacked up with residual connections and layer norm just like in a transformer. They describe this as combining a linear attention and an MLP into a single step, rather than alternating attention and MLP as in a transformer.
jackcook|2 years ago
paxys|2 years ago
thatguysaguy|2 years ago
There are some papers suggesting that transformers are better than SSMs in fundamental ways (e.g. They cannot do arbitrary key-based recall from their context: https://arxiv.org/abs/2402.01032). This means it's not just a no-brainer to switch over.
whimsicalism|2 years ago
denial|2 years ago
All of the sources I see referred to as derivations of it have a discretization of the form
h_t =Ah_{t-1} + Bx_{t-1} for the first line instead of the given one of the form h_t =Ah_{t-1} + Bx_t.
Does anyone know why this is?
pama|2 years ago
Der_Einzige|2 years ago
https://github.com/mamba-org/mamba
Beyond that, I'll care about an alternative to transformers when it shows superior performance with an open source 7b-34b model compared to transformer model competitors. So far this has not happened yet
jasonjmcghee|2 years ago
lpasselin|2 years ago
Are there any reason why it wouldn't scale to 7b or more? Have they tried it?
woadwarrior01|2 years ago
cztomsik|2 years ago
Is this really true? Because it seems to ignore hardware and data type precisions entirely. I mean computing same math thing in a different way with floating points often leads to different results.
lxe|2 years ago
I'd like a way to learn this stuff as a computer engineer, in the same spirit as "big scary math symbols are just for-loops"
paulluuk|2 years ago
I'm the same as you: I have no problem grasping complex concepts, I just always struggled with the mathematical notation. I did pass linear algebra in university, but was glad I could go back to programming after that. Even then, I mostly passed linear algebra because I wrote functions that solve linear algebra equations until I fully grasped the concept.
I've found that GPT-4 is very good at taking a math-notation-rich document and just describing it in terms a math-notation-averse engineer would understand.
I was a data engineer for about 6-7 years at various companies, always working together with data scientists who insist that `x_` or `_phi` are proper variable names. Man am I glad to be working with engineers now.
QuadmasterXLII|2 years ago
yorwba|2 years ago
whimsicalism|2 years ago
[0]: https://github.com/state-spaces/mamba
esafak|2 years ago
israrkhan|2 years ago
It should be possible to combine Mamba with MoE, I wonder how that would look like... a billion token context?
intalentive|2 years ago
https://github.com/jzhang38/LongMamba
nestorD|2 years ago
whimsicalism|2 years ago
mistrial9|2 years ago
https://anaconda.org/conda-forge/mamba
unknown|2 years ago
[deleted]