(no title)
newrotik | 1 year ago
You might think that it doesn't matter because ReLU is, e.g., non-differentiable "only at one point".
Gradient based methods (what you find in pytorch) generally rely on the idea that gradients should taper to 0 in the proximity of a local optimum. This is not the case for non-differentiable functions, and in fact gradients can be made to be arbitrarily large even very close to the optimum.
As you may imagine, it is not hard to construct examples where simple gradient methods that do not properly take these facts into account fail to converge. These examples are not exotic.
No comments yet.