top | item 14594482

(no title)

gmitscha | 8 years ago

I do not think graphs is where we're heading. I think flat vectors are fine, and I would argue multi-head attention is not THAT different from gated RNNs like LSTM. The multiplication with weights, which are the outcome of a softmaxed dot-product, is similar to the input gate of LSTM.

discuss

No comments yet.