top | item 32575798

(no title)

loser777 | 3 years ago

The most striking thing about the architecture is that it appears so heterogeneous and complex. Considering the vast amount of software/machine learning engineering behind model/data/pipeline parallelism schemes like Megatron-LM and ZeRO (which target hardware topologies that seem almost simple by comparison) I'm curious what abstractions are in place to make this beast of an architecture friendly to programmers. Can you program a small tile in the same way you would a large tile and like you would in CUDA for a large/small GPU? Are there dedicated kernel teams that implement common blocks like multiheaded attention with the topology in mind so researchers/engineers doing modeling don't have to worry about scaling the model architecture in a hardware-friendly way? Do they have a monstrous fork of PyTorch with "Dojo" supported natively?

discuss

pclmulqdq|3 years ago

I'm not sure I would call the architecture very complex. It's about as simple as you can make a scale-out supercomputer. I assume they essentially do static positioning of the cluster for training jobs, and have a translation layer from the TensorFlow middle-end to their thing. Google did a similar thing with their TPUs, so it makes sense that they would have architected TF to accept exotic supercomputers as backends.

dekhn|3 years ago

Tensorflow (and pytorch) convert your computation graph (constructed in python) to XLA, which is then specialized to a specific hardware architecture. XLA is a good intermediate language and in fact, you can convert some memory movement in the graph to network calls, allowing you to run on parallel systems (like a cluster of GPUs or TPUs with their own non-host-based networking).

It still requires many experts, both to write the XLA to hardware translation, and ML engineers who know how to write TF python that executes quickly.

(note: Google has transitioned many projects to Jax, which also writes to XLA, as TF ended up being a bit of a pig with wings)