(no title)
leogao
|
3 months ago
Mixture of experts sparsity is very different from weight sparsity. In a mixture of experts, all weights are nonzero, but only a small fraction get used on each input. On the other hand, weight sparsity means only very few weights are nonzero, but every weight is used on every input. Of course, the two techniques can also be combined.
tripplyons|3 months ago
For weight sparsity, I know the BitNet 1.58 paper has some claims of improved performance by restricting weights to be either -1, 0, or 1, eliminating the need for multiplying by the weights, and allowing the weights with a value of 0 to be ignored entirely.
Another kind of sparsity, while on the topic is activation sparsity. I think there was an Nvidia paper that used a modified ReLU activation function to make more of the models activations set to 0.
p1esk|3 months ago
Based on all examples I’ve seen so far in this thread it’s clear there’s no evidence that sparse models actually work better than dense models.
yorwba|3 months ago
From that perspective, it's disappointing that the paper only enforces modest amounts of activation sparsity, since holding the maximum number of nonzero coefficients constant while growing the number of dimensions seems like a plausible avenue to increase representational capacity without correspondingly higher computation cost.