top | item 40390640

(no title)

This paper [1] does atempt that and reports similar performance compared to conventional pre-training. However, they do start off by doing a normal full-rank training and claim that it is needed to 'warm start' the training process.

[1] https://arxiv.org/abs/2307.05695

discuss

danielhanchen|1 year ago

Oh yes this paper! The main issue is the scaling of the A and B LoRA matrices. Some papers show scaling the B matrix with larger learning rates (LoRA+) could be beneficial. DoRA for eg learns an auto scaling vector of numbers which tries to alleviate these issues.

Galore might be more equivalent to full pretraining with the gradients being low rank.