top | item 42058282

(no title)

2sk21 | 1 year ago

I'm surprised that the article doesn't mention that one of the key factors that enabled deep learning was the use of RELU as the activation function in the early 2010s. RELU behaves a lot better than the logistic sigmoid that we used until then.

discuss

sanxiyn|1 year ago

Geoffrey Hinton (now a Nobel Prize winner!) himself did a summary. I think it is the single best summary on this topic.

  Our labeled datasets were thousands of times too small.
  Our computers were millions of times too slow.
  We initialized the weights in a stupid way.
  We used the wrong type of non-linearity.

helltone|1 year ago

I'm curious and it's not obvious to me: what changed in terms of weight initialisation?

imjonse|1 year ago

That is a pithier formulation of the widely accepted summary of "more data + more compute + algo improvements"

HarHarVeryFunny|1 year ago

Also:

nets too small (not enough layers)

gradients not flowing (residual connections)

layer outputs not normalized

training algorithms and procedures not optimal (Adam, warm-up, etc)

cma|1 year ago

As compute has outpaced memory bandwidth most recent stuff has moved away from ReLU. I think Llama 3.x uses SwiGLU. Still probably closer to ReLU than logistic sigmoid, but it's back to being something more smooth than ReLU.

2sk21|1 year ago

Indeed, there have been so many new activation functions that I have stopped following the literature after I retired. I am glad to see that people are trying out new things.