I do not think graphs is where we're heading. I think flat vectors are fine, and I would argue multi-head attention is not THAT different from gated RNNs like LSTM. The multiplication with weights, which are the outcome of a softmaxed dot-product, is similar to the input gate of LSTM.
No comments yet.