(no title)
taykolasinski | 1 month ago
However, transformer-based (which their technical reports confirm they are) implies the standard pre-norm/post-nnorm residual block structure. Without those additive residual connections, training networks of that depth (100+ layers) becomes difficult due to the vanishing gradient problem.
If they had solved deep signal propagation without residual streams, that would likely be a bigger architectural breakthrough than the model itself (akin to Mamba/SSMs). It’s a very high-confidence assumption, but you are right that it is still an assumption.
No comments yet.