top | item 42311574

(no title)

potlee | 1 year ago

> The Nova family of models were trained on Amazon’s custom Trainium1 (TRN1) chips,10 NVidia A100 (P4d instances), and H100 (P5 instances) accelerators. Working with AWS SageMaker, we stood up NVidia GPU and TRN1 clusters and ran parallel trainings to ensure model performance parity

Does this mean they trained multiple copies of the models?

discuss

order

glomgril|1 year ago

Models like this are experimentally pretrained or tuned hundreds of times over many months to optimize the datamix, hyperparams, architecture, etc. When they say "ran parallel trainings" they are probably referring to parity tests that were performed along the way (possibly also for the final training runs). Different hardware means different lower-level libraries, which can introduce unanticipated differences. Good to know what they are so they can be ironed out.

Part of it could also be that they'd prefer to move all operations to the in-house trn chips, but don't have full confidence in the hardware yet.

Def ambiguous though. In general reporting of infra characteristics for LLM training is left pretty vague in most reports I've seen.