top | item 46726671

The pragmatic tradeoff of tied embeddings

1 points| SilenN | 1 month ago |blog.silennai.com

1 comment

SilenN|1 month ago

Simply, it's when your output embedding matrix = input.

You save vocab_dim*model_dim params (ex. 617m for GPT-3).

But the residual stream means that the weight matrices are roughly connected via a matmul, which means they struggle to encode bigrams (commutative property enforces symmetry).

Attention + MLP adds nonlinearity, but it still means less expressivity.

Which is why they aren't SOTA, but are useful in smaller models.