(no title)
mendigou | 1 year ago
Most modern large models cannot be trained on one instance of anything (GPU, accelerators, whatever), so there's no alternative to distributed training. They also wouldn't even fit in the memory of one GPU/accelerator, so there are even more complex ways to split the model across instances.
mirekrusin|1 year ago
You can't scale averaging parallel runs much. You need to munch through evolutions/iterations fast.
You can't ie. start with random state, schedule parallel training averaging it all out and expect that you end up with well trained network in one step.
Every next step invalidates input state for everything and the state is gigantic.
It's dominated by huge transfers at high frequency.
You can't for example have 2x gpus connected with network cable and expect speedup. You need to put them on the same motherboard to have any gains.
SETI for example is unlike that - it can be easily distributed - partial readonly snapshot, intense computation, thin result submission.
mendigou|1 year ago