(no title)
mota7 | 3 years ago
Last I looked, very few SOTA models are trained with batch normalization. Most of the LLMs use layer norms which can be accumulated? (precisely because of the need to avoid the memory blowup).
Note also that batch normalization can be done in a memory efficient way: It just requires aggregating the batch statistics outside the gradient aggregation.
fxtentacle|3 years ago
unknown|3 years ago
[deleted]