top | item 42884665

(no title)

harles | 1 year ago

That could explain compute efficiency, but has nothing to do with the parameter efficiency pointed at in the paper.

discuss

order

vlovich123|1 year ago

Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger.

mayukhdeb|1 year ago

In this paper, we don't zero out the weights. We remove them.