top | item 46978678

(no title)

The author doesn't seem to now even the basics of SLURM?

"gang scheduling" according to the official docs: https://slurm.schedmd.com/gang_scheduling.html

-- maybe I've read the docs wrong the last decade using SLURM.

discuss

Author here. I've seen the docs you linked to: Slurm uses "gang scheduling" to mean something specific (timesliced oversubscription where jobs alternate on shared resources).

I'm using the term in its broader CS sense: all-or-nothing co-scheduling of related processes across multiple processors [1]. This is the definition used across the K8s ecosystem e.g. Volcano [2], Kueue [3], and its Coscheduling plugin all define gang scheduling as "all or nothing" allocation.

I still stand by the origianl claim:

Slurm allocates multi-node jobs atomically, while vanilla K8s doesn't. its default scheduler places pods as resources become available, leading to partial allocations and deadlocks for distributed training. It's just a terminology clash. Thanks for the comment anyway.

[1] https://en.wikipedia.org/wiki/Gang_scheduling [2] https://volcano.sh/en/docs/plugins/ [3] https://www.coreweave.com/blog/kueue-a-kubernetes-native-sys...

GuestFAUniverse|18 days ago

Thanks for the clarification!

covi|18 days ago

The post says Slurm supports gang scheduling, k8s doesn't (out of the box).