you could absolutely use DisTrO for federated learning. The DeMo optimizer on its own doesn't solve the adverserial aspects of training on local-only data, nor does it solve tensor parallelism across devices, so you're still limited to only what fits on your local GPUs, but it does enable distributed data parallelism over the internet at a bandwidth orders of magnitude lower than before.
No comments yet.