top | item 39149625

(no title)

jchonphoenix | 2 years ago

None of this matters if they can't get the hardware stack to work correctly.

The media keeps missing the real lock in Nvidia has: CUDA. It's not the hardware. It's the ability for someone to use it painlessly.

discuss

order

nl|2 years ago

TPUs have the second best software stack after CUDA though. JAX and Tensorflow support it before CUDA in some cases and it's the only Pytorch environment that comes close to CUDA for support.

pjmlp|2 years ago

TPUs are single case use, contrary to CUDA.

bootsmann|2 years ago

Google has historically been weak at breaking into markets that someone else has already established and I think the TPUs are suffering from the same fate. There is not enough investment in making the chips compatible with anything other googles preferred stack (which happens to not be the established industry stack). Committing to getting torch to switch from device = “cuda” to device = “tpu” (or whatever) without breaking the models would go a long way imo.

ugh123|2 years ago

I always thought Google was actually pretty good at taking over established, or rising markets, depending on the opportunity or threat they see from a competitor. Either by timely acquisition and/or ability to scale faster due to their own infrastructure capabilities.

- Google search (vs previous entrenched search engines in the early '00s)

- Adsense/doubleclick (vs early ad networks at the time)

- Gmail (vs aol, hotmail, etc)

- Android (vs iOS, palm, etc)

- Chrome (vs all other browsers)

Sure, i'm picking the obvious winners, but these are all market leaders now (Android by global share) where earlier incumbents were big, but not Google-big.

Even if Google's use of TPUs are purely self-serving, it will have a noticeable effect on their ability to scale their consumer AI usage at diminishing costs. Their ability to scale AI inference to meet "Google scale" demand, and do it cheaply (at least by industry standards), will make them formidable in the "ai race". This is why altman/microsoft and others are investing heavily in AI chips.

But I don't think their TPU will be only self-serving, rather, they'll scale it's use through GCP for enterprise customers to run AI. Microsoft is already tapping their enterprise customers for this new "product". But those kinds of customers will care more about cost than anything else.

The long-term game here is a cost game, and Google is very, very good at that and has a headstart on the chip side.

dekhn|2 years ago

TPUs were originally intended to just be for internal use (to keep google from being dependent on Intel and nvidia). Making them an external product through cloud was a mistake (in my opinion). It was a huge drain on internal resources in many ways and few customers were truly using them in the optimal way. They also competed with google's own nvidia GPU offering in cloud.

The TPU hardware is great in a lot of ways and it allowed google to move quickly in ML research and product deployments, but I don't think it was ever a money-maker for cloud.

amelius|2 years ago

> The media keeps missing the real lock in Nvidia has: CUDA. It's not the hardware. It's the ability for someone to use it painlessly.

Really? What if someone writes a new back-end to PyTorch, TensorFlow and perhaps a few other popular libraries? Then will CUDA still matter that much?

p1esk|2 years ago

if someone writes a new back-end to PyTorch

If that was easy to do surely AMD would have done it by now? After many years of trying?

fritzo|2 years ago

PyTorch has had an XLA backend for years. I don't know how performant it is though. https://pytorch.org/xla

pjmlp|2 years ago

Can you do Unreal engine's Nanite, or Otoy Ray tracing in Pytorch?

kjkjhgkjyj|2 years ago

TensorFlow and PyTorch support TPUs. It's pretty painless.

Mehdi2277|2 years ago

Having used it heavily it is nowhere near painless. Where can you get a TPU? To train models you basically need to use GCP services. There are multiple services that offer TPU support, Cloud AI Platform, GKE, and Vertex AI. For GPU you can have a machine and run any tf version you like. For tpu you need different nodes depending on tf version. Which tf versions are supported per GCP service is inconsistent. Some versions are supported on Cloud AI Platform but not Vertex AI and vice versa. I have had a lot of difficulty trying to upgrade to recent tf versions and discovering the inconsistent service support.

Additionally many operations that run on GPU but are just unsupported for TPU. Sparse tensors have pretty limited support and there's bunch of models that will crash on TPU and require refactoring. Sometimes pretty heavy thousands of lines refactoring.

edit: Pytorch is even worse. Pytorch does not implement efficient tpu device data loading and generally has poor performance no where comparable to tensorflow/jax numbers. I'm unaware of any pytorch benchmarks where tpu actually wins. For tensorflow/jax if you can get it running and your model suits tpu assumptions (so basic CNN) then yes it can be cost effective. For pytorch even simple cases tend to lose.

htrp|2 years ago

> TensorFlow and PyTorch support TPUs. It's pretty painless.

Unless you physically work next to the TPU hardware team, the torch support for TPUs is pretty brittle.

dkarras|2 years ago

mojo language joins the chat: https://www.modular.com/max/mojo

ipsum2|2 years ago

Mojo is a closed source language that will never reach mainstream adoption among ML engineers and scientists.

moffkalast|2 years ago

And Nvidia does actually sell their hardware. Nobody will ever get their hands on one of these outside Google Cloud. It might as well not exist.

sidibe|2 years ago

Doesn't really matter. Google's infra is all the client you need to continue pouring tens of billions into a project like this, bonus if others start using it more in the cloud, but they have so much use for accelerators across their own projects they aren't going to stop