(no title)
eadwu | 4 months ago
I haven't exactly bisected the issue but I'm pretty sure convolutions are broken on sm_121 after a certain size, getting 20x memory blowup from a convolution from a 2x batch size increase _only_ on the DGX Spark.
I haven't had any problems with inference, but I also don't use the transformers library that much.
llama.cpp was working for openai-oss last time I checked and on release, not sure if something broke along the way.
I don't exactly know if memory fragmentation is something fixable on the driver side - this might just be the problem with kernel's policy and GPL, it prevents them from automatically interfering with the memory subsystem to the granularity they'd like - see zfs and their page table antics - or so my thoughts on it is.
If you've done stuff on WSL, you have similar issues and you can fix it by running a service that normally compacts and clean memory, I have it run every hour. Note that this does impact at the very least CPU performance and memory allocation speeds, but I have not have any issue with long training runs with it (24hr+, assuming that is the issue, I have never tried without it and put that service in place since getting it due to my experience on WSL).
No comments yet.