(no title)
miven
|
1 year ago
Is there an intuitive reason why this ends up working this well compared to, say, applying some kind of thresholding to attention activations that are below average for a given head to filter that same attention noise out?
No comments yet.