top | item 21564542

(no title)

czr | 6 years ago

problem is most likely that your initialization are bad (see e.g. https://openreview.net/pdf?id=H1gsz30cKX for explanation). make sure to use variance scaling, taking activation into account (relu cuts variance by half). probably you need to multiply all initial kernel weights by 2. make sure that initial prediction of model before training is same order of magnitude as typical target, not saturated to zero or other extreme value. batchnorm and skip connections can also ease problems of bad initialization, so worth trying.

discuss

No comments yet.