top | item 39964794

(no title)

maxrumpf | 1 year ago

The abstract and the rest of the paper don't really match imo. It's not really allocating more to some sequences, but just introducing ~dropout. Might be different sides to the same coin, but was still a weird read.

discuss

adamsantoro|1 year ago

We spent a fair bit of effort ensuring we were accurate with the language and claims, so we're happy to take any feedback and make updates in subsequent versions. However, I don't see where we claim that MoD allocates more to some sequences and not others (specifically, the abstract says "transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence".

That said, it's a pretty simple change to make the approach work in the way you describe (allocating more to some sequences and not others) by changing the group across which the top-k works. In the paper we use the time (sequence) dimension, but one could also use the batch * time dimension, which would result in asymmetric allocation across sequences

hackerlight|1 year ago

Dropout is at train time this is at inference time. Dropout is random this is determined. Can't compare them.