By the way, some time ago when I checked there were two cool applications of LoRa: (1) a mesh, for (hopefully) truly decentralized and more difficult to disrupt communication, (2) a gateway, so that you could get data from your sensors in remote places via standard internet protocols.
Both are very cool, but I wonder if I missed something else?
> The paper describes the finding that LoRA and full fine-tuning, with equal performance on the fine-tuning task, can have solutions with very different generalization behaviors outside the fine-tuning task distribution. We found that LoRA and full fine-tuning yield models with significant differences spectral properties of their weight matrices: LoRA models often containing “intruder dimensions”, high-ranking singular vectors approximately orthogonal to the singular vectors of pre-trained weight matrices. The existence of intruder dimensions correlates with the fine-tuned model forgetting more of the pre-training distribution as well as forgetting more when trained on tasks sequentially in a continual learning setup.
I'm surprised they didn't cite this; it's a well known paper.
> LoRA works well when not capacity constrained, i.e., the number of trainable parameters exceeds the amount of information to be learned, which can be estimated in terms of dataset size
I’m shocked they didn’t look at progressive merging of LoRAs. Research shows that’s the best way of improving its ability to model higher level features.
Seems like a massive miss, not to mention there is other research that contradicts a lot of their findings. This feels a bit like a researchers first pass at learning LoRA
I'm not sure why progressive LoRa merging needs to be addressed here. They show there is a regime of problem where LoRa performs equivalently to FFT.
Progressive merging of LoRa is somewhere inbetween and categorically more complex than just LoRa so would be dominated by standard LoRa in that case.
While progressive merging could train faster as fewer params are trainable at any given time, it results in very larger adapter diffs OTO the size of the original model and doesn't retain the benefits of being able to deploy multiple adapters over the same base model idt.
Question for dudes building modern nn's... what's the thinking on estimating structural capacity for real world problem? How should I estimate how many parameters to choose for the model?
Can someone explain the bit counting argument in the reinforcement learning part?
I don’t get why a trajectory would provide only one bit of information.
Each step of the trajectory is at least giving information about what state transitions are possible.
An infinitely long trajectory can explore the whole state space if there are no absorbing states. Such a trajectory would provide a massive amount of information about the system, even if we ignored the final reward.
I believe it's because the way you measure things in RL, each episode only tells you whether it was good (say reward +1) or bad (say 0 or negative reward), it does not tell you anything about the trace that was produced to get the outcome. This reward is the only thing measured to produce your gradients. Hence why the amount of info in it is O(1).
This is in contrast to more "supervised" forms of learning where you could get a loss for each token produced (e.g. cross entropy loss), and where you'd get, as a consequence O(number of tokens) information into your gradients.
A fair amount of research has shown that RL doesn’t add knowledge to the base model it just optimizes paths that already exist.
Now ProRL from Nvidia showed there are ways of adding knowledge, mostly through progressive merging.
I’m still not fully convinced of the 1bit claim, they made other mistakes in the blog post
I've been curious about LoRA and find a lot of these articles interesting. But I've been unable to find a good "LoRA for idiots" kind of starting point that gets me started actually doing some training with my data. Anybody know of a more practical guide I could use for that?
Be sure to validate everything you're reading though as of late I've come across more and more things that don't seem 100% accurate in their docs, seems to heavily depend on what section.
Thinking Machines have put out a string of incredibly high-quality posts lately. Hard to oversell how much cred it's buying them with the AI research community! Keep up the great work folks
It might be useful to use this thread in a dataset to train a LoRa so that LLM agents can more easily disambiguate the great LoRa acronym collision of ‘25. No longer will future generations suffer the indignity of either/or/both confusions.
HumblyTossed|5 months ago
papascrubs|5 months ago
dannyfritz07|5 months ago
halfmatthalfcat|5 months ago
dvfjsdhgfv|4 months ago
Both are very cool, but I wonder if I missed something else?
sifar|5 months ago
canadiantim|5 months ago
unknown|4 months ago
[deleted]
mrandish|5 months ago
frostyel|5 months ago
[deleted]
kouteiheika|4 months ago
I think the literature is clear on that?
"LoRA vs Full Fine-tuning: An Illusion of Equivalence" -- https://arxiv.org/abs/2410.21228v1
Quoting from the conclusions:
> The paper describes the finding that LoRA and full fine-tuning, with equal performance on the fine-tuning task, can have solutions with very different generalization behaviors outside the fine-tuning task distribution. We found that LoRA and full fine-tuning yield models with significant differences spectral properties of their weight matrices: LoRA models often containing “intruder dimensions”, high-ranking singular vectors approximately orthogonal to the singular vectors of pre-trained weight matrices. The existence of intruder dimensions correlates with the fine-tuned model forgetting more of the pre-training distribution as well as forgetting more when trained on tasks sequentially in a continual learning setup.
I'm surprised they didn't cite this; it's a well known paper.
adhi01|4 months ago
lelanthran|4 months ago
I'm surprised you copied and pasted all of that without explaining what it means.
Does LoRA perform worse, better or statistically insignificantly different to FullFT?
You aren't able to tell from what you pasted, are you?
richardvsu|4 months ago
mountainriver|5 months ago
I’m shocked they didn’t look at progressive merging of LoRAs. Research shows that’s the best way of improving its ability to model higher level features.
Seems like a massive miss, not to mention there is other research that contradicts a lot of their findings. This feels a bit like a researchers first pass at learning LoRA
let_tim_cook_|4 months ago
Progressive merging of LoRa is somewhere inbetween and categorically more complex than just LoRa so would be dominated by standard LoRa in that case.
While progressive merging could train faster as fewer params are trainable at any given time, it results in very larger adapter diffs OTO the size of the original model and doesn't retain the benefits of being able to deploy multiple adapters over the same base model idt.
yenepho|4 months ago
logannyeMD|5 months ago
sgt101|4 months ago
p1esk|4 months ago
_def|5 months ago
ellisv|5 months ago
markisus|5 months ago
I don’t get why a trajectory would provide only one bit of information.
Each step of the trajectory is at least giving information about what state transitions are possible.
An infinitely long trajectory can explore the whole state space if there are no absorbing states. Such a trajectory would provide a massive amount of information about the system, even if we ignored the final reward.
navar|4 months ago
This is in contrast to more "supervised" forms of learning where you could get a loss for each token produced (e.g. cross entropy loss), and where you'd get, as a consequence O(number of tokens) information into your gradients.
mountainriver|5 months ago
I’m still not fully convinced of the 1bit claim, they made other mistakes in the blog post
lewtun|4 months ago
_spduchamp|4 months ago
Stumbled on this today... https://hackerpager.net/
I really want something like this with flip out keyboard and could do Signal on LTE/WiFi.
rco8786|4 months ago
CaptainOfCoit|4 months ago
Be sure to validate everything you're reading though as of late I've come across more and more things that don't seem 100% accurate in their docs, seems to heavily depend on what section.
Yenrabbit|5 months ago
mijoharas|5 months ago
sudohalt|5 months ago
raaron773|4 months ago
ineedasername|4 months ago
eagsalazar2|5 months ago