top | item 25908996

(no title)

benchess | 5 years ago

Hi, co-author here!

We use a pretty standard tech stack of PyTorch + NCCL + MPI. We've used both OpenMPI and MPICH to varying degrees.

Kubeflow is interesting, but it solves a slightly different problem of scheduling/coordinating ML workflows on top of Kube. It doesn't get involved with how an ML job communicates within itself cross-node.

discuss

order

mmq|5 years ago

Probably OP was referring to the MPIOperator, TFOperator, PytorchOperator, ... they are under the Kuberflow org, but can be deployed independently of Kubeflow itself. Several other projects are using those operators to provide similar abstractions you mentioned in your blog post, e.g. Gang scheduling, cross-nodes communication, ...

One difference is that these operators use the Kubernetes service interface for communication, generally exposing a headless service for each replica.

sandGorgon|5 years ago

@benchess - yes this is what i meant. Using the operator framework.

But more generally, MPI over ssh on a large kubernetes deployment is not a very common pattern. Any reason you chose that ?

Have you looked at Ray or Torch-Elastic (which seems to be officially supported by AWS, etc as well) https://github.com/pytorch/elastic ?