Yes there's very good theoretical reasons for skip connections. If your initial matrix M is noise centered at 0, then 1+M is a noisy identity operation, while 0+M is a noisy deletion... It's better to do nothing if you don't know what to do, and avoid destroying information.
I appreciate the sibling comment perspective that memory pressure is a problem, but that can be mediated by using fewer/longer skip connections across blocks of layers.
sdenton4|2 years ago
I appreciate the sibling comment perspective that memory pressure is a problem, but that can be mediated by using fewer/longer skip connections across blocks of layers.