top | item 39149854

(no title)

Mehdi2277 | 2 years ago

Having used it heavily it is nowhere near painless. Where can you get a TPU? To train models you basically need to use GCP services. There are multiple services that offer TPU support, Cloud AI Platform, GKE, and Vertex AI. For GPU you can have a machine and run any tf version you like. For tpu you need different nodes depending on tf version. Which tf versions are supported per GCP service is inconsistent. Some versions are supported on Cloud AI Platform but not Vertex AI and vice versa. I have had a lot of difficulty trying to upgrade to recent tf versions and discovering the inconsistent service support.

Additionally many operations that run on GPU but are just unsupported for TPU. Sparse tensors have pretty limited support and there's bunch of models that will crash on TPU and require refactoring. Sometimes pretty heavy thousands of lines refactoring.

edit: Pytorch is even worse. Pytorch does not implement efficient tpu device data loading and generally has poor performance no where comparable to tensorflow/jax numbers. I'm unaware of any pytorch benchmarks where tpu actually wins. For tensorflow/jax if you can get it running and your model suits tpu assumptions (so basic CNN) then yes it can be cost effective. For pytorch even simple cases tend to lose.

discuss

No comments yet.