(no title)
_hl_ | 1 year ago
Here's the bit from the paper:
> We set the number of heads h = dmodel/2d, where d is equal to the head dimension of Transformer. So we can align the parameter counts and computational complexity.
In other words, they make up for it by having only half as many attention heads per layer.
No comments yet.