top | item 45714844

(no title)

eadwu | 4 months ago

There are bleeding edge issues, everyone dials into transformers so that's generally pain proof.

I haven't exactly bisected the issue but I'm pretty sure convolutions are broken on sm_121 after a certain size, getting 20x memory blowup from a convolution from a 2x batch size increase _only_ on the DGX Spark.

I haven't had any problems with inference, but I also don't use the transformers library that much.

llama.cpp was working for openai-oss last time I checked and on release, not sure if something broke along the way.

I don't exactly know if memory fragmentation is something fixable on the driver side - this might just be the problem with kernel's policy and GPL, it prevents them from automatically interfering with the memory subsystem to the granularity they'd like - see zfs and their page table antics - or so my thoughts on it is.

If you've done stuff on WSL, you have similar issues and you can fix it by running a service that normally compacts and clean memory, I have it run every hour. Note that this does impact at the very least CPU performance and memory allocation speeds, but I have not have any issue with long training runs with it (24hr+, assuming that is the issue, I have never tried without it and put that service in place since getting it due to my experience on WSL).

discuss

order

No comments yet.