The most striking thing about the architecture is that it appears so heterogeneous and complex. Considering the vast amount of software/machine learning engineering behind model/data/pipeline parallelism schemes like Megatron-LM and ZeRO (which target hardware topologies that seem almost simple by comparison) I'm curious what abstractions are in place to make this beast of an architecture friendly to programmers. Can you program a small tile in the same way you would a large tile and like you would in CUDA for a large/small GPU? Are there dedicated kernel teams that implement common blocks like multiheaded attention with the topology in mind so researchers/engineers doing modeling don't have to worry about scaling the model architecture in a hardware-friendly way? Do they have a monstrous fork of PyTorch with "Dojo" supported natively?
pclmulqdq|3 years ago
dekhn|3 years ago
It still requires many experts, both to write the XLA to hardware translation, and ML engineers who know how to write TF python that executes quickly.
(note: Google has transitioned many projects to Jax, which also writes to XLA, as TF ended up being a bit of a pig with wings)