Yes! I think this a great area of research. If you think of the gradient values as a blame score for why you got the answer wrong, then you can have a lot of fun with exploring which weights light up for different problems. A note, in Ring All Reduce they actually don’t ever share the FULL gradient but instead blocks. So to put this into practice you’d have to show that you can do the thresholding on the block of gradients vs the full gradient which you may never be able to fit in VRAM. Will results still hold? I don’t know. I believe it would but that’s for the next paper.
szvsw|1 year ago
Thanks for the response and good luck with this!