top | item 45632819

(no title)

This looks impressive. As someone who is not familiar with ML, I do have a question -- surely in 2025 there must be a way to schedule a large pytorch job across multiple k8s clusters? EKS and GKE already provide VPC native flat network by default .

discuss

sailingparrot|4 months ago

The issue isn’t so much scheduling as it is stability.

More clusters means one more layer of things that can crash your (very expensive) training.

You also then still need to write tooling to manage cross cluster trainings correctly just as starting/stopping roughly at the same time, resuming from checkpoints, node health monitoring etc.

Nothing dealbreaking, but if it could just work in a single cluster that would be nicer.