(no title)
tripplyons | 3 months ago
For weight sparsity, I know the BitNet 1.58 paper has some claims of improved performance by restricting weights to be either -1, 0, or 1, eliminating the need for multiplying by the weights, and allowing the weights with a value of 0 to be ignored entirely.
Another kind of sparsity, while on the topic is activation sparsity. I think there was an Nvidia paper that used a modified ReLU activation function to make more of the models activations set to 0.
p1esk|3 months ago
Based on all examples I’ve seen so far in this thread it’s clear there’s no evidence that sparse models actually work better than dense models.