top | item 46979098

(no title)

alex000kim | 19 days ago

Author here. I've seen the docs you linked to: Slurm uses "gang scheduling" to mean something specific (timesliced oversubscription where jobs alternate on shared resources).

I'm using the term in its broader CS sense: all-or-nothing co-scheduling of related processes across multiple processors [1]. This is the definition used across the K8s ecosystem e.g. Volcano [2], Kueue [3], and its Coscheduling plugin all define gang scheduling as "all or nothing" allocation.

I still stand by the origianl claim:

Slurm allocates multi-node jobs atomically, while vanilla K8s doesn't. its default scheduler places pods as resources become available, leading to partial allocations and deadlocks for distributed training. It's just a terminology clash. Thanks for the comment anyway.

[1] https://en.wikipedia.org/wiki/Gang_scheduling [2] https://volcano.sh/en/docs/plugins/ [3] https://www.coreweave.com/blog/kueue-a-kubernetes-native-sys...

discuss

order