(no title)
querez | 1 month ago
You are wrong. Gemini was definitely trained entirely on TPU. Of course your point of "you need to count failed experiments, too". Is correct. But you seem to have misconceptions around how deepmind operates and what infra it possess. Deepmind (or barely any of Google internal stuff) runs on Borg, an internal cloud system, which is completely separate (and different) from gcp. Deepmind does not have access to any meaningful gcp resources. And Borg barely has any GPUs. At the time I left deepmind, the amount of tpu compute available was probably 1000x to 10000x larger than the amount of gpu compute. You would never even think of seriously using GPUs for neural net training, it's too limited (in terms of available compute) and expensive (in terms of internal resource allocation units), and frankly less well supported by internal tooling than tpu. Even for small, short experiments, you would always use TPUs.
YetAnotherNick|1 month ago
A big segment of the market just uses GPU/TPU to train LLMs, so they don't exactly need flexibility if some tool is well supported.
querez|1 month ago
hansvm|1 month ago