top | item 28790100

(no title)

apl | 4 years ago

Several hints here are severely outdated.

For instance, never train a model in end-to-end FP16. Use mixed precision, either via native TF/PyTorch or as a freebie when using TF32 on A100s. This’ll ensure that only suitable ops are run with lower precision; no need to fiddle with anything. Also, PyTorch DDP in multi-node regimes hasn’t been slower or less efficient than Horovod in ages.

Finally, buying a local cluster of TITAN Xs is an outright weird recommendation for massive models. VRAM limitations alone make this a losing proposition.

discuss

dylanbfox|4 years ago

Hi there - OP here - thanks for reading!

This blog is more of an intro to a few high level concepts (multi-GPU and multi-node training, fp32 vs fp16, buying hardware and dedicated machines vs AWS/GCP, etc) for startups that are early into their deep learning journey, and that might need a nudge in the right direction.

If you're looking for a deep dive into the best GPUs to buy (cost/perf, etc), the link in the below comment gives a pretty good overview.

PS - I can send you some benchmarks we did that show (at least for us) Horovod is ~10% faster than DDP for multi-node training FWIW. Email is in my profile!

jxcole|4 years ago

> Finally, buying a local cluster of TITAN Xs is an outright weird recommendation for massive models. VRAM limitations alone make this a losing proposition.

Do you have an alternative recommendation?

sabalaba|4 years ago

You can check out some of the benchmarks here: https://lambdalabs.com/blog/nvidia-rtx-a6000-benchmarks/

It provides some modern, real life, deep learning benchmarks using the mixed precision (TF32) that gp was referring to.