top | item 41779046

(no title)

miven | 1 year ago

Is there an intuitive reason why this ends up working this well compared to, say, applying some kind of thresholding to attention activations that are below average for a given head to filter that same attention noise out?

discuss

order

No comments yet.