top | item 38444352

(no title)

jksk61 | 2 years ago

also removing skip connections leads to a rougher loss landscape, hence it should be harder to find the optimal weights.

discuss

order

sdenton4|2 years ago

Yes there's very good theoretical reasons for skip connections. If your initial matrix M is noise centered at 0, then 1+M is a noisy identity operation, while 0+M is a noisy deletion... It's better to do nothing if you don't know what to do, and avoid destroying information.

I appreciate the sibling comment perspective that memory pressure is a problem, but that can be mediated by using fewer/longer skip connections across blocks of layers.