To be more exact, LoRA adds two matrices `A` and `B` to any layers that contain trainable weights. The original weights (`W_0`) have the shape `d × k`. These are frozen. Matrix `A` has dimensions `d × <rank>` (`rank` is configurable) and matrix `B` has the shape `<rank> × k`. A and B are then multiplied and added to `W_0` to get altered weights. The benefit here is that the extra matrices are small compared to `W_0`, which means less parameters need to be optimized, so less activations need to be stored in memory.
twic|2 years ago
Why is fine-tuning done with separate alterations, rather than by mutating the original weights?
arugulum|2 years ago
The goal of most parameter-efficient methods is to store one gold copy of the original model, and learn minor modifications/additions to the model. The easiest way to think about this is in some kind of deployment setting, where you have 1 capable model and you learn different sets of LoRA weights for different tasks and applications.
The original intent of parameter-efficient methods is to reduce the amount of storage space needed for models (do you really want to keep a whole additional copy of LLaMA for each different task?). A secondary benefit is that because you are fine-tuning a smaller number of parameters, the optimizer states (can take up to 2x the size of your model) are also heavily shrunk, which makes it more economical (memory-wise) to (parameter-efficient) fine-tune your model.
stu2b50|2 years ago
It's actually larger. If you just have two equally large matrices of the same dimension, one original, and one of "altercations"... then you can just add them together.
> Why is fine-tuning done with separate alterations, rather than by mutating the original weights?
Then you'd have to compute the gradients for the whole network, which is very expensive when the model has 7b, 65b, 165b parameters. The intent is to make that cheaper by only computing gradients for a low rank representation of the change in the weight matrix from training.
TuringTest|2 years ago
seydor|2 years ago
metanonsense|2 years ago
grph123dot|2 years ago
stu2b50|2 years ago