Heh. This problem reminds me of back in 2019 when I was working with Shawn Presser on finetuning GPT-2 using Google Colab - there was a problem where it would randomly error out every once in a while, but also it would take like 10 minutes to redownload the last saved checkpoint from our server IIRC and it would take minutes to save the current checkpoint, so the question was, how often should we save to minimize the time spent restoring+saving? I did a bit of algebra and I think we wound up with an answer like '40 minutes'!
DL infrastructure & training practices have gotten better since then...
Interesting work! This is really an engineering achievement and I wish there was usable code. Real-time checkpointing seems like obviously the future to me, but it's going to be an easy-to-use, high-performance implementation that make that reality.
One of the things I would like to have seen in the paper is a better analysis of simply checkpointing more often. It's briefly touched on:
> It is infeasible to arbitrarily increase the checkpoint frequency because checkpoint frequency is bottlenecked by the bandwidth of the remote persistent storage [28]. For example, it takes 42 minutes to checkpoint the model states of MT-NLG [68] to the remote persistent storage when the bandwidth is 20Gbps.
and
> Both baselines, Strawman and HighFreq, have the same checkpoint time and it stays almost the same as the number of machines increases from 4 to 16 because the aggregated bandwidth of the remote persistent storage is fixed
But that smells a bit off to me. That's a 530B model (unrealistically large given current trends IMO) where each model replica has 280 A100s and then there is data parallelism on top. Where exactly are you storing your (sharded?) checkpoint where the read/write bandwidth isn't also scaling horizontally beyond 20Gbps?
It's strange because supercomputing centers have long built compute and storage in parallel to address this problem. Older companies like SGI had the storage accessing the high-speed, low-latency interconnect. Others build clusters with different nodes for each.
Companies that can train models this big should hire people with HPC experience. They'd point out the need for storage clusters with high-speed interconnects. If they lack storage capabilities, I wonder why they're doing HPC like that. They clearly need the storage.
Example that BLOOM was trained on lists 100+GB of RAM per node and PB's of storage:
For anyone else who was confused to see a paper use the same name as a commercial product, it looks like Google Gemini was announced in May, whereas this was submitted to SOSP that had an April submission deadline.
It's not a good name to give to anything. Unless you're a corporate giant, name creativity is really important to making your work findable and re-findable.
I think this points more to how slow the paper submission process is compared to the product creation velocity. No wonder arxiv has been such a hit for the ML community.
gwern|2 years ago
DL infrastructure & training practices have gotten better since then...
solidasparagus|2 years ago
One of the things I would like to have seen in the paper is a better analysis of simply checkpointing more often. It's briefly touched on:
> It is infeasible to arbitrarily increase the checkpoint frequency because checkpoint frequency is bottlenecked by the bandwidth of the remote persistent storage [28]. For example, it takes 42 minutes to checkpoint the model states of MT-NLG [68] to the remote persistent storage when the bandwidth is 20Gbps.
and
> Both baselines, Strawman and HighFreq, have the same checkpoint time and it stays almost the same as the number of machines increases from 4 to 16 because the aggregated bandwidth of the remote persistent storage is fixed
But that smells a bit off to me. That's a 530B model (unrealistically large given current trends IMO) where each model replica has 280 A100s and then there is data parallelism on top. Where exactly are you storing your (sharded?) checkpoint where the read/write bandwidth isn't also scaling horizontally beyond 20Gbps?
nickpsecurity|2 years ago
Companies that can train models this big should hire people with HPC experience. They'd point out the need for storage clusters with high-speed interconnects. If they lack storage capabilities, I wonder why they're doing HPC like that. They clearly need the storage.
Example that BLOOM was trained on lists 100+GB of RAM per node and PB's of storage:
http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-hw-eng.htm...
rs42|2 years ago
pavelstoev|2 years ago
Mr_P|2 years ago
inciampati|2 years ago
knowriju|2 years ago
optimalsolver|2 years ago
Maybe ask ChatGPT for ideas.
m00x|2 years ago