top | item 41779579

(no title)

machinelearning | 1 year ago

This is a good problem to solve but the approach is wrong imo.

It has to be done in a hierarchical way to know what you attended to + full context.

If the differential vector is being computed with the same input as the attention vector how do you know how to modify the attention vector correctly

discuss

order

quantadev|1 year ago

Doesn't everything just get tweaked in whatever direction the back-propagation derivative says and proportionally to that "slope"? In other words, simply by having back-propagation system in effect there's never any question about which way to adjust the weights, right?