top | item 42884665 (no title) harles | 1 year ago That could explain compute efficiency, but has nothing to do with the parameter efficiency pointed at in the paper. discuss order hn newest vlovich123|1 year ago Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger. mayukhdeb|1 year ago In this paper, we don't zero out the weights. We remove them. load replies (1)
vlovich123|1 year ago Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger. mayukhdeb|1 year ago In this paper, we don't zero out the weights. We remove them. load replies (1)
vlovich123|1 year ago
mayukhdeb|1 year ago