Mamba: The Easy Way

In case people are wondering why Mamba is exciting:

There's this idea in AI right now that "scaling" models to be bigger and train on more data always makes them better. This has led to a science of "scaling laws" which study just how much bigger models need to be and how much data we need to train them on to make them a certain amount better. The relationship between model size, training data size, and performance turns out to be quite predictable.

Transformers are great because they can continue scaling and giving us better performance – unlike, we think, RNNs. Probably the most exciting thing about Mamba is the claim that it can be a bit smaller, and train on a bit less data, and still provide better performance than the equivalent Transformer, especially at longer sequence lengths.

For more info, see the scaling laws plot in Figure 4 of the Mamba paper: https://arxiv.org/abs/2312.00752

KuriousCat|2 years ago

People have shown even CNNs can match up the peformance of the transformers.

https://openreview.net/forum?id=TKIFuQHHECj#

I believe there is a lot of herding going on due to the influence of people who had compute to play around with than deeply insightful or principled exploration of networks.

hansonw|2 years ago

“RNN-mode inference” is also extremely exciting because you can precompute the hidden state of any prompt prefix (i.e. a long system prompt, or statically retrieved context) and continued generations pay the same cost irrespective of the prefix length.

5kg|2 years ago

I'd love to see someone who has the resources train a model bigger than 2.8b and show the scaling law still holds.

intalentive|2 years ago

Nice post. A couple things to add:

1. The Mamba co-author was also the FlashAttention lead author.

2. The secret ingredient that makes SSMs viable for deep learning is HiPPO theory. If you start with random initialization you're not going to get results. What you need is "optimal online function approximation" using Legendre polynomials, a Fourier basis, etc., in matrix form. The Mamba story starts with Legendre Memory Units.

Invariably someone comments, "How do we know that it scales?" We don't. But the lead author has backing and a new startup at cartesia.ai. Could be the next Mistral.

sigmoid10|2 years ago

The architecture is completely public. I would be surprised if certain other players (including but not limited to Mistral AI) are not training models yet. We'll hear soon enough if this is viable. Maybe not for official release candidates, but at least for internal testing.

magnio|2 years ago

Fantastic blog post, thank you for this. I am not even familiar with transformers, yet the explanation is stellar clear to me, and the included references and context are a trasure trove. The explanation of FlashAttention is the best I have seen, and that is not even the focus of the article.

One question I have on selectivity: footnote 4 says "the continuous A is constant, while our discretization parameter ∆ is input-dependent." What is the effect of varying the discretization instead of the (main, as I understand it) state A? My gut says it simplifies training and provides stability, but I feel A carries most of the behavior of the model, so it should have more wiggle room throughout training.

jackcook|2 years ago

Thank you for the kind words! I think it’s mostly to reduce complexity during training. Here’s an excerpt from page 9 of the Mamba paper:

“We remark that while the A parameter could also be selective, it ultimately affects the model only through its interaction with ∆ via A = exp(∆A) (the discretization (4)). Thus selectivity in ∆ is enough to ensure selectivity in (A, B), and is the main source of improvement. We hypothesize that making A selective in addition to (or instead of) ∆ would have similar performance, and leave it out for simplicity.”

whimsicalism|2 years ago

How are you not familiar with transformers yet have seen multiple explanations of FlashAttention?

moffkalast|2 years ago

If I'm not mistaken the largest mamba model right now is 2.8B and undertrained with low quality data (the Pile only). The main problem is that it's new and unproven.

Should become very interesting once someone with both data and significant financial backing takes the plunge and trains something of notable size. Perhaps Llama-3 might already end up being that attempt, as we seem to be heavily into diminishing returns for transformers.

SekstiNi|2 years ago

There is one trained on 600B tokens from SlimPajama [1], but that's fairly tiny compared to other recent releases (ex. stablelm-3b [2] trained on 4T tokens).

> low quality data (the Pile only)

The Pile is pretty good quality wise. It's mostly the size (300B tokens) that's limiting.

[1]: https://huggingface.co/state-spaces/mamba-2.8b-slimpj [2]: https://huggingface.co/stabilityai/stablelm-3b-4e1t

jsenn|2 years ago

This was really helpful, but only discusses linear operations, which obviously can’t be the whole story. From the paper it seems like the discretization is the only nonlinear step—in particular the selection mechanism is just a linear transformation. Is that right? How important is the particular form of the nonlinearity?

EDIT: from looking at the paper, it seems like even though the core state space model/selection mechanism is linear (except for discretization?), they incorporate a nonlinearity in the full “mamba block”, which is stacked up with residual connections and layer norm just like in a transformer. They describe this as combining a linear attention and an MLP into a single step, rather than alternating attention and MLP as in a transformer.

jackcook|2 years ago

Yes you're spot on, the nonlinearities come from the full Mamba blocks, which I left out of this post for simplicity/to focus on the bigger ideas the paper introduced. You can see it marked by the "X" on the right-most part of Figure 3 in the Mamba paper: https://arxiv.org/abs/2312.00752

paxys|2 years ago

From what I can tell all the large players in the space are continuing developing on transformers right? Is it just that Mamba is too new, or is the architecture fundamentally not usable for some reason?

thatguysaguy|2 years ago

Too new is definitely one thing. Someone is going to have to make a gamble to actually paying for a serious pretraining run with this architecture before we know how it really stacks up against transformers.

There are some papers suggesting that transformers are better than SSMs in fundamental ways (e.g. They cannot do arbitrary key-based recall from their context: https://arxiv.org/abs/2402.01032). This means it's not just a no-brainer to switch over.

whimsicalism|2 years ago

we have no idea what the large players in the space are doing

denial|2 years ago

Something minor I always wonder about when I read Mamba is the discretization.

All of the sources I see referred to as derivations of it have a discretization of the form

h_t =Ah_{t-1} + Bx_{t-1} for the first line instead of the given one of the form h_t =Ah_{t-1} + Bx_t.

Does anyone know why this is?

pama|2 years ago

Not sure how much detail you need but generally there exist implicit and explicit integrators for numerically solving (integrating) ODE. The implicit ones, like the one used here, tend to be more stable. The ideas behind SSM come from control theory ideas that used integrators with stability guarantees so that the rest of the neural network can focus on other aspects of the problem.

Der_Einzige|2 years ago

Very annoying namespace conflict since a package called "mamba" (faster reimplementation of the python conda package manager) already existed for awhile before this architecture was even dreamed up.

https://github.com/mamba-org/mamba

Beyond that, I'll care about an alternative to transformers when it shows superior performance with an open source 7b-34b model compared to transformer model competitors. So far this has not happened yet

jasonjmcghee|2 years ago

> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.

lpasselin|2 years ago

The mamba paper shows significant improvements in all model sizes, up to 1b, the largest one tested.

Are there any reason why it wouldn't scale to 7b or more? Have they tried it?

woadwarrior01|2 years ago

I use the former and have been experimenting with the latter. Fortunately, the contexts are separate enough that they never come up in the same sentence.

cztomsik|2 years ago

> Importantly, these recurrent and convolutional forms, which I like to call “RNN mode” and “CNN mode,” are mathematically equivalent. This allows S4 to shape-shift depending on what you need it to do, with no difference in its outputs.

Is this really true? Because it seems to ignore hardware and data type precisions entirely. I mean computing same math thing in a different way with floating points often leads to different results.

lxe|2 years ago

I'm very positive I can actually understand the terminology used in discussing machine learning models if it was presented in a way that describes the first principles a little bit better, instead of diving directly into high level abstract equations and symbols.

I'd like a way to learn this stuff as a computer engineer, in the same spirit as "big scary math symbols are just for-loops"

paulluuk|2 years ago

Ironically, you can probably just ask a Transformer model to explain it to you.

I'm the same as you: I have no problem grasping complex concepts, I just always struggled with the mathematical notation. I did pass linear algebra in university, but was glad I could go back to programming after that. Even then, I mostly passed linear algebra because I wrote functions that solve linear algebra equations until I fully grasped the concept.

I've found that GPT-4 is very good at taking a math-notation-rich document and just describing it in terms a math-notation-averse engineer would understand.

I was a data engineer for about 6-7 years at various companies, always working together with data scientists who insist that `x_` or `_phi` are proper variable names. Man am I glad to be working with engineers now.

QuadmasterXLII|2 years ago

That's a heuristic that's usually true. You can definitely understand convolution or attention better with a "big scary math symbols are just for-loops" explanation, but there are also things like dopri45 or elliptic curve crypto where we just have to accept that Weird Math Shit is happening and the symbols are inevitable. It looks to me like mamba has dragged a part of llm research into the latter camp.

yorwba|2 years ago

It is unclear to me whether you're praising the article as particularly easy to understand or complaining that it contains equations like

  h_t = A h_{t-1} + B x_t
  y_t = C h_t

(which the author attempts to illustrate in the "My name is Jack" figure below)

whimsicalism|2 years ago

If you want to learn this stuff as a computer engineer, you can read the code here [0]. I find the math quite helpful.

[0]: https://github.com/state-spaces/mamba

esafak|2 years ago

Ask an LLM to translate it into terms you understand. This is something they excel at.

israrkhan|2 years ago

MoE (Mixture of Experts) is an effective way to scale transformers. Gemini 1.5 is already doing upto 1 million tokens. I have not seen any large scale mamba model, so not aware of its shortcomings, but I am sure there are tradeoffs.

It should be possible to combine Mamba with MoE, I wonder how that would look like... a billion token context?

intalentive|2 years ago

https://arxiv.org/abs/2401.04081

https://github.com/jzhang38/LongMamba

nestorD|2 years ago

MoE let's you use scale model size up with compute. That leads to hopefully more intelligent models. It, however, is independent with context size: the ability to process a lot of tokens / text.

whimsicalism|2 years ago

nope :) MoE does not scale transformers along sequence length

mistrial9|2 years ago

namespace collision detected

https://anaconda.org/conda-forge/mamba

unknown|2 years ago

[deleted]

60 comments