top | item 45491279

(no title)

alexeldeib | 4 months ago

Larger memory, weaker comms. You can optimize for this by doing things like increasing batch size/data parallelism vs sharding schemes with more comms.

At scale training won’t be able to avoid comms entirely, while many models can fit in a single MI300 for serving.

discuss

No comments yet.