top | item 41779579

(no title)

This is a good problem to solve but the approach is wrong imo.

It has to be done in a hierarchical way to know what you attended to + full context.

If the differential vector is being computed with the same input as the attention vector how do you know how to modify the attention vector correctly

discuss

quantadev|1 year ago

Doesn't everything just get tweaked in whatever direction the back-propagation derivative says and proportionally to that "slope"? In other words, simply by having back-propagation system in effect there's never any question about which way to adjust the weights, right?