top | item 36854613

(no title)

alevskaya | 2 years ago

Yeah we used to use this in our older models years ago... I don't recall the details exactly, but I don't think it ever did very much.

I certainly don't think it will help at all with stability. Things like Q/K layernorm are better tricks for softmax stability when scaling: https://arxiv.org/pdf/2302.05442.pdf

discuss

order

ggerganov|2 years ago

> I don't recall the details exactly, but I don't think it ever did very much.

How would you have known if the trick actually reduces the outliers in the weights? Even if the transformer quality does not improve overall, having less outliers as a result is very beneficial for more accurate quantization of the data

danielmarkbruce|2 years ago

Are you asking "why would you have bothered to look at"?

The "how" is pretty straightforward.