top | item 44747116

GPU Memory Snapshots: fast container cold boots

9 points| luiscape | 7 months ago |modal.com

1 comment

order

luiscape|7 months ago

Modal eng here.

We have been using the new CUDA Checkpoint API (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CH...) in combination with gVisor's checkpoint / restore API and our custom file system to greatly reduce container cold boot. This is particularly impactful if you need to warm-up GPUs, for example if you are using torch.compile (i.e. you entirely skip torch.compile on restore cold boot).