(no title)
benchess | 5 years ago
We use a pretty standard tech stack of PyTorch + NCCL + MPI. We've used both OpenMPI and MPICH to varying degrees.
Kubeflow is interesting, but it solves a slightly different problem of scheduling/coordinating ML workflows on top of Kube. It doesn't get involved with how an ML job communicates within itself cross-node.
mmq|5 years ago
One difference is that these operators use the Kubernetes service interface for communication, generally exposing a headless service for each replica.
sandGorgon|5 years ago
But more generally, MPI over ssh on a large kubernetes deployment is not a very common pattern. Any reason you chose that ?
Have you looked at Ray or Torch-Elastic (which seems to be officially supported by AWS, etc as well) https://github.com/pytorch/elastic ?
alculquicondor|5 years ago
We started a discussion in https://github.com/kubeflow/mpi-operator/issues/315