(no title)
djsamseng | 3 years ago
keys = key_weights * x
query = query_weights * x
values = value_weights * what_to_look_at
For self attention, what_to_look_at = x
For regular attention, where_to_look_at could be a database, memory or anything else.
So in this example if we’re trying to predict the second “apple” the first “apples” is very helpful. If we’re predicting “juice” then we’d use one head of self-attention to look at the first “apples” and a second head to also look at the second “apple”
That’s my understanding at least
No comments yet.