top | item 35447914

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning

166 points| mfiguiere | 3 years ago |arxiv.org | reply

53 comments

order
[+] nighthawk454|3 years ago|reply
some interesting tidbits:

> stretched our ML supercomputer scale .. to 4096 TPU v4 nodes

> The Google tradition is to write retrospective papers ... TPU v4s and A100s deployed in 2020 and both use 7nm technology

> The appropriate H100 match would be a successor to TPU v4 deployed in a similar time frame and technology (e.g., in 2023 and 4 nm).

> TPU v4 supercomputers [are] the workhorses of large language models (LLMs) like LaMDA, MUM, and PaLM]. These features allowed the 540B parameter PaLM model to sustain a remarkable 57.8% of the peak hardware floating point performance over 50 days while training on TPU v4 supercomputers

> Google has deployed dozens of TPU v4 supercomputers for both internal use and for external use via Google Cloud

> Moreover, the large size of the TPU v4 supercomputer and its reliance on OCSes looks prescient given that the design began two years before the paper was published that has stoked the enthusiasm for LLMs

[+] ttul|3 years ago|reply
Brought to you by the future: "Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization, modularity, deployment, security, power, and performance; users can pick a twisted 3D torus topology if desired."
[+] sanxiyn|3 years ago|reply
As the paper explains, optical circuit switches are not new in TPU v4 and not the main topic of this paper. Google was already using it for networking and published about it last year. For details, see https://arxiv.org/abs/2208.10041.
[+] pclmulqdq|3 years ago|reply
The future of the 90's. Optical matrix switches like this have been around for a long time. These aren't doing packet switching (which honestly would be the future if done optically), it's more of a layer 1 thing - the switch replaces you plugging and unplugging a cable. Bell labs was building these kinds of switches back in the day.
[+] abcdabcd987|3 years ago|reply
On a related note, Google also uses optical circuit switches in their datacenter network. See the paper form SIGCOMM'22 [1].

[1] Jupiter Evolving: Transforming Google’s Datacenter Network via Optical Circuit Switches and Software-Defined Networking. https://research.google/pubs/pub51587/

[+] rpcope1|3 years ago|reply
So something like an optical FPGA?
[+] CGamesPlay|3 years ago|reply
Is there any way to purchase anything like a TPU? I guess the Cerebras Andromeda product is one, but I don't know if those are sold or leased. Any others?

https://www.cerebras.net/andromeda/

[+] amrb|3 years ago|reply
Cerebres systems are whole rack with high cost and cooling requirement, pci cards are more sensible for home gamers:

https://tenstorrent.com/

[+] wiz21c|3 years ago|reply
From the page : "Andromeda, a 13.5 Million Core AI Supercomputer". Blown away by the number of cores (I considered myself lucky to have 2 10000+ cores GPU in my workstation) I then realized that the word "core" is singular in the sentence. Is it just a mistake or does it mean something else ? (genuine question, English is not my first language)

EDIT: Ahhh a bit below on the page it is written "13.5 million AI-optimized cores" and there it's plural. So it was probably just a mistake.

[+] einpoklum|3 years ago|reply
I hate the enormous waste of human ability, ingenuity and effort in the creation of proprietary technologies like this. You've made a chip? Offer it for everyone to use. Same goes for Amazon and Apple. It's not as though it's a chip that's only usable for Google-specific work.
[+] alfor|3 years ago|reply
It’s because Google is a monopoly it doesn’t operate on normal economic incentives.

It’s goal is only to keep the monopoly: appear bening, keep tech advance in house, share it’s spying network with government so the government don’t regulate them, win-win.

[+] omeysalvi|3 years ago|reply
If people are not allowed to monetize their innovations, there is no incentive to innovate. While this needs to have its limits, sharing it to everyone immediately upon creation is not an answer.