top | item 46381558

Show HN: Slurmq – GPU quota enforcement for Slurm

3 points| windsor | 2 months ago |dedalus-labs.github.io

When I was a student at Princeton, we had this neat Slurm quota management tool (https://github.com/klieret/pli-slurm-tool) that prevented people from hogging all the GPUs.

It was really specific to Princeton’s clusters, though, so I decided to make a generalized version for everyone to use: slurmq

Slurm's built-in fairshare only deprioritizes heavy users. Sometimes you need a hard cap. slurmq tracks GPU-hours per user over a rolling window and kills jobs when they go over quota set by the sysadmin.

Some quick commands:

  $ pip install slurmq
  $ slurmq check                     # check your quota
  $ slurmq report                    # admin: see who's over limit
  $ slurmq monitor --once --enforce  # cron: warn, then cancel
Docs: https://dedalus-labs.github.io/slurmq

Source: https://github.com/dedalus-labs/slurmq

Hoping this helps other HPC sysadmins. We're using it internally and would love to hear how others handle GPU quota enforcement.

discuss

order

No comments yet.