top | item 34749336

(no title)

djsamseng | 3 years ago

x = “I hate apples, I only drink _”

keys = key_weights * x

query = query_weights * x

values = value_weights * what_to_look_at

For self attention, what_to_look_at = x

For regular attention, where_to_look_at could be a database, memory or anything else.

So in this example if we’re trying to predict the second “apple” the first “apples” is very helpful. If we’re predicting “juice” then we’d use one head of self-attention to look at the first “apples” and a second head to also look at the second “apple”

That’s my understanding at least

discuss

No comments yet.