I just skimmed over LoRA+ and DoRA and I see no reason why these improvements could not go hand in hand. Actually, LoRA+ seems to be about efficient training while DoRA seems about improving the ability to actually learn, making it significantly more robust. Although I still have my questions on how the improvements of LoRA+ would be applied to the magnitude vector.
The two methods seem to be independent, wonder if you can combine them for even better performance.
Interestingly both seem to indirectly modify the optimisation process, in my opinion effectively trying to fix a bad optimiser. Seems like we still have a long way to go after Adam...
> Seems like we still have a long way to go after Adam...
A preprint in arxiv suggests that Adam works better than SGD for training LLMs due to the issue of class-imbalance [0]. It appears that scaling the gradient step helps with the training, for example, see another approach suggested in [1].
josalhor|1 year ago
WithinReason|1 year ago
Interestingly both seem to indirectly modify the optimisation process, in my opinion effectively trying to fix a bad optimiser. Seems like we still have a long way to go after Adam...
neodypsis|1 year ago
A preprint in arxiv suggests that Adam works better than SGD for training LLMs due to the issue of class-imbalance [0]. It appears that scaling the gradient step helps with the training, for example, see another approach suggested in [1].
0. https://arxiv.org/pdf/2402.19449 1. https://arxiv.org/pdf/2402.02347
Ger_Onimo|1 year ago
allpaca|1 year ago
cooljoseph|1 year ago