top | item 41373464

(no title)

huac | 1 year ago

in particular it appears that they only implement data parallel DP - at 1.2B you can fit full copy of model into memory, but larger models require splitting the weights across multiple machines (different techniques eg distributed data parallel DDP, tensor parallel TP, pipeline parallel TP, ...)

without more details it's unclear if the proposed technique keeps its speedups in that case

discuss

No comments yet.