top | item 30534656

(no title)

5 BLEU points more is massive with 4 times less params.

The fact that layers themselves are narrower means that training and evaluation of the NN is also much faster.

discuss

oofbey|4 years ago

Why would evaluation be faster with more narrow layers? If there were fewer tokens it would definitely be faster, because transformers scale by tokens^2, but here "narrow" means number of channels, for presumably the same number of tokens.

chabons|4 years ago

If I remember correctly the fully connected layers after the attention block are [?, a*h] * [a*h, b*h] (for some scalars a,b and hidden size h), which means that transformers also scale with h^2. I don't know what fraction of the total FLOPs that section of the model takes for practical model sizes, but it would indicate that making the model narrower for the same number of params would reduce compute.

londons_explore|4 years ago

It is notable that they got only 2.5 extra BLEU while translating text to english, and 6.2 extra when translating text from english to another language.

Since the network will have seen far more english text than text of other languages, it suggests that performance on limited training data is more improved.

modeless|4 years ago

That's exciting. Getting better performance with more data is trivial, the hard part is getting better performance with less data.

yoquan|4 years ago

Actually no. Each layer requires output from previous one, which means sequentially computation. While wider layers can utilize GPU parallel computation better. This is kind of trade-off between less memory (less parameters) vs longer time.