This paper [1] does atempt that and reports similar performance compared to conventional pre-training. However, they do start off by doing a normal full-rank training and claim that it is needed to 'warm start' the training process.[1] https://arxiv.org/abs/2307.05695
danielhanchen|1 year ago
Galore might be more equivalent to full pretraining with the gradients being low rank.