top | item 39484095

(no title)

jackcook | 2 years ago

Thank you for the kind words! I think it’s mostly to reduce complexity during training. Here’s an excerpt from page 9 of the Mamba paper:

“We remark that while the A parameter could also be selective, it ultimately affects the model only through its interaction with ∆ via A = exp(∆A) (the discretization (4)). Thus selectivity in ∆ is enough to ensure selectivity in (A, B), and is the main source of improvement. We hypothesize that making A selective in addition to (or instead of) ∆ would have similar performance, and leave it out for simplicity.”

discuss

nlrk|2 years ago

when I read the paper I thought the idea was changing \Delta permits getting the model to learn different things over different time scales. As you quoted “the main source of improvement".

I don’t have an llm backround, just controls, so I might wrong.