top | item 40345045

(no title)

How do modern foundation models avoid multi-layer perceptron scaling issues? Don't they have big feed-forward components in addition to the transformers?

discuss

elcomet|1 year ago

They rely heavily on what we call residual or skip connexions. This means each layer does something like x = x + f(x). This helps the training a lot, ensuring the gradient can flow nicely in the whole network.

This is heavily used in ResNets (residual networks) for computer vision, and is what allows training much deeper convolutional networks. And transformers use the same trick.

heavenlyblue|1 year ago

They don't do global optimisation of all layers at the same time, instead training all layers independently of each other.