top | item 22097263

(no title)

octbash | 6 years ago

Yes, the Reformer is basically trading off noisier for faster training / memory savings.

discuss

MiroF|6 years ago

Yep, and I'm not saying its a bad approach! Just trying to answer "why is that any worse than, say, starting with randomly initialized weights in general?" wrt gradient passing

I'm not sure I'd agree with the "noisy" characterization - which to me implies stochasticity-, whereas this is just blocking off the flow of gradient information to save memory.