Full title: "Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips"
Summary from Bard: "This article is about training large language models (LLMs) on Google Cloud TPUs. It discusses the challenges of training LLMs at scale, and how Google Cloud TPU Multislice Training addresses these challenges. The article also details the results of a recent experiment in which Google trained a 128B parameter LLM on 50,944 TPU v5e chips. This experiment is the largest publicly disclosed LLM distributed training job to date."
Great questions! Slices are a set of TPU chips that share a fast, private inter-chip-interconnect. Unlike the current GPU generation in clouds, the TPUs on different machines can communicate through this private network. Multislice means that we're using a hierarchical network, where there is both inter-chip-interconnect and normal data-center netowrking.
Unlikely. One reason Google Cloud is so terrible is that nobody in Google actually uses Google Cloud. It used to be that every time I mentioned this, somebody would jump in and say, "Well actually, Google Domains runs on Google Cloud," and we'd discuss whether Google Domains was a business critical part of Google. https://support.google.com/domains/answer/13689670?hl=en
Question for rwitten or anyone else involved in this project:
I see a per-device batch size of 6 for the 16B model. With 256x199 = 50944 TPUs and a sequence length of 2048, this works out to 104M tokens per batch. This is much larger than typical for training runs of dense LMs of this size, which are usually closer to ~4M tokens per batch.
Was your critical batch size really this large? In other words, did you really see a benefit as compared to a much smaller batch size (and probably many fewer TPUs)? Did you use some special learning rate schedule or optimizer to achieve this?
Ok so they claim in the article, 50000 TPU’s is equivalent to 10 exaflop floating point computations. That is equivalent to ~2,512 NVIDIA H100’s, which is like really small. Just shows the difference between TPU’s and GPU’s I guess. Inflection, a new LLM company created a 20,000 H100 cluster, I’m positive OpenAI, Tesla, Meta etc have orchestrated a job on more than 2500 H100 GPU’s.
Hey! I'm an contributor on this (Rafi Witten), all opinions my own.
You're asking the right question but I think the math is off by a bit. The equivalent number on the H100's is 989 TFLOP/s/chip so the equivalent job is ~10K H100's = (10 * 10^18) / (989 * 10^12). (Both chips also have 8-bit acceleration!)
I believe this is the largest ML job both by exaflops and number of chips every demonstrated. Other companies own more chips or exaflops than we show in this job but getting all the hardware working at once on a single job is a different matter! :-)
It's worth noting that just because an H100 has a higher flops number doesn't mean your program is actually hitting that number of flops. Modern TPUs are surprisingly competitive with Nvidia on a perf/$ metric, if you're doing cloud ML they are absolutely worth a look. We have been keeping costs down by racking our own GPUs but TPUs are so cost effective that we need to do some thinking about changing our approach.
I'm not certain but I think part of this is that XLA (for example) is a mountain of chip-specific optimizations between your code and the actual operations. So comparing your throughput between GPU and TPU is not just flops-to-flops.
This is a blog post from Google Cloud marketing. It's saying that you, too, could train an LLM on Google Cloud if you hand them enough money. You can't do that on Inflection's or Tesla's clusters. Similar marketing blog post from last year: https://cloud.google.com/blog/products/compute/calculating-1...
The PaLM paper linked in the blog post is about how to get something actually useful out of that compute.
Something that doesn't seem worth bragging about is that the startup time increases linearly with the cluster size. Wouldn't you want it to be constant? What's the issue there?
Disclaimer: work associated with this team, didn't write or review the blog post
Article stated that it was throughput scheduling the pods on the clusters (from unrelated benchmarks that's usually ~300 pods/sec throughput for kube scheduler today) and then doing XLA compilation at pod launch, rather than amortizing once for all jobs.
Optimizing throughput of kube scheduler is a good general opportunity and something I believe we would like to see.
I believe AOT compilation just not a critical optimization for the test, we would recommend it when running large and long training jobs to AOT compile to keep pod start latency low for hardware failures and job restarts (from checkpoints).
> The start times we observed were impressive, but we believe we can improve these even further. We are working on areas such as optimizing scheduling in GKE to increase throughput and enabling ahead-of-time compilation in MaxText to avoid just-in-time compilations on the full cluster.
(Contributor on the blog post, all opinions my own)
Agreed with you and we definitely weren't trying to brag! This is fast compared to people's expectations in the space but slow compared to what we should be able to accomplish and will accomplish in the future.
xnx|2 years ago
Summary from Bard: "This article is about training large language models (LLMs) on Google Cloud TPUs. It discusses the challenges of training LLMs at scale, and how Google Cloud TPU Multislice Training addresses these challenges. The article also details the results of a recent experiment in which Google trained a 128B parameter LLM on 50,944 TPU v5e chips. This experiment is the largest publicly disclosed LLM distributed training job to date."
jrk|2 years ago
rwitten|2 years ago
More details: https://cloud.google.com/tpu/docs/multislice-introduction
(P.S. - contributor on blog post, Google employee, all thoughts my own)
leumassuehtam|2 years ago
lern_too_spel|2 years ago
antifa|2 years ago
DavidSJ|2 years ago
I see a per-device batch size of 6 for the 16B model. With 256x199 = 50944 TPUs and a sequence length of 2048, this works out to 104M tokens per batch. This is much larger than typical for training runs of dense LMs of this size, which are usually closer to ~4M tokens per batch.
Was your critical batch size really this large? In other words, did you really see a benefit as compared to a much smaller batch size (and probably many fewer TPUs)? Did you use some special learning rate schedule or optimizer to achieve this?
DavidSJ|2 years ago
p1esk|2 years ago
sashank_1509|2 years ago
rwitten|2 years ago
You're asking the right question but I think the math is off by a bit. The equivalent number on the H100's is 989 TFLOP/s/chip so the equivalent job is ~10K H100's = (10 * 10^18) / (989 * 10^12). (Both chips also have 8-bit acceleration!)
I believe this is the largest ML job both by exaflops and number of chips every demonstrated. Other companies own more chips or exaflops than we show in this job but getting all the hardware working at once on a single job is a different matter! :-)
aschleck|2 years ago
I'm not certain but I think part of this is that XLA (for example) is a mountain of chip-specific optimizations between your code and the actual operations. So comparing your throughput between GPU and TPU is not just flops-to-flops.
lern_too_spel|2 years ago
The PaLM paper linked in the blog post is about how to get something actually useful out of that compute.
latchkey|2 years ago
https://inflection.ai/inflection-ai-announces-1-3-billion-of...
https://inflection.ai/nvidia-coreweave-mlperf
marmaduke|2 years ago
On what number or op for the h100?
jeffbee|2 years ago
smarterclayton|2 years ago
Article stated that it was throughput scheduling the pods on the clusters (from unrelated benchmarks that's usually ~300 pods/sec throughput for kube scheduler today) and then doing XLA compilation at pod launch, rather than amortizing once for all jobs.
Optimizing throughput of kube scheduler is a good general opportunity and something I believe we would like to see.
I believe AOT compilation just not a critical optimization for the test, we would recommend it when running large and long training jobs to AOT compile to keep pod start latency low for hardware failures and job restarts (from checkpoints).
> The start times we observed were impressive, but we believe we can improve these even further. We are working on areas such as optimizing scheduling in GKE to increase throughput and enabling ahead-of-time compilation in MaxText to avoid just-in-time compilations on the full cluster.
rwitten|2 years ago
Agreed with you and we definitely weren't trying to brag! This is fast compared to people's expectations in the space but slow compared to what we should be able to accomplish and will accomplish in the future.
behnamoh|2 years ago
jeffbee|2 years ago
filterfiber|2 years ago
they just showed they could indeed make 50k TPUs do some flops.
With no paper this is just a marketing press release - the only takeaway is that existing tech stacks can utilize it probably.