(no title)
alevskaya | 2 years ago
I certainly don't think it will help at all with stability. Things like Q/K layernorm are better tricks for softmax stability when scaling: https://arxiv.org/pdf/2302.05442.pdf
alevskaya | 2 years ago
I certainly don't think it will help at all with stability. Things like Q/K layernorm are better tricks for softmax stability when scaling: https://arxiv.org/pdf/2302.05442.pdf
ggerganov|2 years ago
How would you have known if the trick actually reduces the outliers in the weights? Even if the transformer quality does not improve overall, having less outliers as a result is very beneficial for more accurate quantization of the data
danielmarkbruce|2 years ago
The "how" is pretty straightforward.