top | item 42884665

(no title)

harles | 1 year ago

That could explain compute efficiency, but has nothing to do with the parameter efficiency pointed at in the paper.

discuss

Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger.

mayukhdeb|1 year ago

In this paper, we don't zero out the weights. We remove them.