top | item 34432985

(no title)

mota7 | 3 years ago

> Gradient accumulation doesn't work with batch norms so you really need that memory.

Last I looked, very few SOTA models are trained with batch normalization. Most of the LLMs use layer norms which can be accumulated? (precisely because of the need to avoid the memory blowup).

Note also that batch normalization can be done in a memory efficient way: It just requires aggregating the batch statistics outside the gradient aggregation.

discuss

order

fxtentacle|3 years ago

wav2vec2, whisper, HifiGAN, Stable Diffusion, and Imagen all use BatchNorm.