(no title)
GuestFAUniverse | 18 days ago
"gang scheduling" according to the official docs: https://slurm.schedmd.com/gang_scheduling.html
-- maybe I've read the docs wrong the last decade using SLURM.
GuestFAUniverse | 18 days ago
"gang scheduling" according to the official docs: https://slurm.schedmd.com/gang_scheduling.html
-- maybe I've read the docs wrong the last decade using SLURM.
alex000kim|18 days ago
I'm using the term in its broader CS sense: all-or-nothing co-scheduling of related processes across multiple processors [1]. This is the definition used across the K8s ecosystem e.g. Volcano [2], Kueue [3], and its Coscheduling plugin all define gang scheduling as "all or nothing" allocation.
I still stand by the origianl claim:
Slurm allocates multi-node jobs atomically, while vanilla K8s doesn't. its default scheduler places pods as resources become available, leading to partial allocations and deadlocks for distributed training. It's just a terminology clash. Thanks for the comment anyway.
[1] https://en.wikipedia.org/wiki/Gang_scheduling [2] https://volcano.sh/en/docs/plugins/ [3] https://www.coreweave.com/blog/kueue-a-kubernetes-native-sys...
GuestFAUniverse|18 days ago
covi|18 days ago