top | item 40969894

(no title)

rdlecler1 | 1 year ago

This is a great question, and I don't yet have an answer. I'm going to butcher this description, so please be charitable, but functionally, the attention mechanism reduces the dimensions and uses the coincidence between the Q and K linear layers to narrow down to a subset of the input, and then the softmax amplifies the signal.

One unsatisfying argument might be that this might fall into implementation details for this particular class. Another prediction might be that an attention mechanism is an essential element of these networks that appears in other networks of this class. Another is that this is a decent approximation, but has limitations, and we'll figure out how the brain does it and replace it with that.

discuss

order

No comments yet.