top | item 42559063

(no title)

Yes! I think this a great area of research. If you think of the gradient values as a blame score for why you got the answer wrong, then you can have a lot of fun with exploring which weights light up for different problems. A note, in Ring All Reduce they actually don’t ever share the FULL gradient but instead blocks. So to put this into practice you’d have to show that you can do the thresholding on the block of gradients vs the full gradient which you may never be able to fit in VRAM. Will results still hold? I don’t know. I believe it would but that’s for the next paper.

discuss

szvsw|1 year ago

Very cool! Glad to hear my intuition is on the right track… I’m very much on the applied ML for engineering design side as opposed to the bleeding edge research side, so in terms of multi-node training I haven’t done much more than spin up a few GPUs and let PyTorch Lightning handle the parallelism, but cool to try to keep up with this stuff.

Thanks for the response and good luck with this!