top | item 16931394

Comparing Google’s TPUv2 against Nvidia’s V100 on ResNet-50

171 points| henningpeters | 8 years ago |blog.riseml.com | reply

127 comments

order
[+] jacksmith21006|8 years ago|reply
Thanks for sharing and very insightful. Guess the TPUs are the real deal. About 1/2 the cost for similar performance.

Would assume Google is able to do that because of the less power required.

I am actually more curious to get a paper on the new speech NN Google is using. Suppose to be 16k samples a second through a NN is hard to imagine how they did that and was able to roll it out as you would think the cost would be prohibitive.

You are ultimately competing with a much less compute heavy solution.

https://cloudplatform.googleblog.com/2018/03/introducing-Clo...

Suspect this was only possible because of the TPUs.

Can't think of anything else where controlling the entire stack including the silicon would be more important than AI applications.

[+] nojvek|8 years ago|reply
Half the cost? Where are you reading that? Yeah on demand rental in AWS is expensive, but both long term and buying V100 yourself is significantly cheaper. Cloud companies have pretty fat margins on on demand rentals.

You can’t buy a TPU, it’s a cloud only thing. They also show it’s not a huge difference in both perf and time to converge (albeit only one architecture)

I would say kudos to V100 and this benchmark that breaks the TPU hype.

[+] eanzenberg|8 years ago|reply
The impression I got was opposite: TPU is not the hot shit that Google claims it is. Pricing is kind of irrelevant since they can subsidize this to create that story.
[+] rprenger|8 years ago|reply
Full disclosure, I currently work at Nvidia on speech synthesis.

You can definitely do this on a GPU. We use the older auto-regressive WaveNets (not Parallel Wavenet) for inference on GPUs, with the newly released nv-wavenet code. Here's a link to a blog post about it:

https://devblogs.nvidia.com/nv-wavenet-gpu-speech-synthesis

That code will generate audio samples at 48khz, or if you're worried about throughput, it'll do a batch of 320 parallel utterances at 16khz.

[+] smallnamespace|8 years ago|reply
> About 1/2 the cost for similar performance.

I would expect a dedicated accelerator to need at least a 5-10X advantage to outweigh all the other infrastructure and ecosystem costs.

GPUs are more useful for a wide variety of data-parallel tasks, and many more NN frameworks work on top of CUDA than work on the TPU.

In terms of horizontal scalability, nvidia has been rapidly iterating on increasing both memory and interlink bandwidth (including NVSwitch [1]), while each 'TPU' is actually 4 chips interconnected so likely has less upward scalability.

Also note that the tensor cores on a V100 take roughly 25-30% of the actual area. If Nvidia wanted to, they could probably easily make a pure tensor chip that beat the TPU in performance, could be produced in volume on their existing process, and also had full compatibility with their entire stack.

All in all, a 2x price/performance advantage for a hyper-specialized accelerator is basically a loss, just like how nobody installs a Soundblaster card anymore, how consumer desktops don't run discrete GPUs even though integrated graphics are a few times slower, or

[1] https://www.nextplatform.com/2018/04/04/inside-nvidias-nvswi...

[+] elmarhaussmann|8 years ago|reply
Hi, author here. The motivation for this article came out of the HN discussion on a previous post (https://news.ycombinator.com/item?id=16447096). There was a lot of valuable feedback - thanks for that.

Happy to answer questions!

[+] puzzle|8 years ago|reply
Don't TPUs get sustained use discounts? I know they're not preemptible. That would be comparable to AWS reserved instances.

EDIT: you don't get sustained use discounts, either, at the moment. You can get either for GCP GPUs, though. Perhaps that will change once TPUs are out of beta?

[+] sdenton4|8 years ago|reply
"As shown above, the top-1 accuracy after 90 epochs for the TPU implementation is 0.7% better. This may seem minor, but making improvements at this already very high level is extremely difficult and, depending on the application, such small improvements may make a big difference in the end."

Any idea of how much variation in accuracy you get on different training runs of the same model on the same hardware? My understanding is that model quality can and does vary from one run to the next on these kinds of large datasets - from a single observation, it's hard to know if the difference is real or noise.

[+] TheLoneAdmin|8 years ago|reply
AMD - Where does their hardware stand in the race for ML? What changes would AMD need to make to be competitive?
[+] shaklee3|8 years ago|reply
Nice work. I've only seen anecdotal stories about how TPU is faster, but never something as detailed as this.
[+] kbob|8 years ago|reply
I am not an ML guy, so I'm asking from a position of ignorance. (-:

But what's going on when some of the implementations of a standard algorithm don't converge, and different hardware has different accuracy rates on the same algorithm? Are DNNs really that flaky? And does it really make sense to be doing performance comparisons when the accuracy performance doesn't match?

Is the root problem that ResNet-50 works best with a smaller batch size?

And how do you do meaningful research into new DNNs if there's always an "Maybe if I ran it again over there I'd get better results" factor?

Thank you.

[+] MrBuddyCasino|8 years ago|reply
I found it interesting that they are so close together in performance - I mean what are the odds that they end up within 2% of each other?
[+] Jabbles|8 years ago|reply
Do you have more information about this bit?

the TPU implementation applies very compute-intensive image pre-processing steps and actually sacrifices raw throughput

Thanks

[+] pakl|8 years ago|reply
What about your LSTM-based model that didn’t converge in your earlier TPU benchmarks in February?
[+] zmarty|8 years ago|reply
Slower alternative: "fastai with @pytorch on @awscloud is currently the fastest to train Imagenet on GPU, fastest on a single machine (faster than Intel-caffe on 64 machines!), and fastest on public infrastructure (faster than @TensorFlow on a TPU!) Big thanks to our students that helped with this." - https://twitter.com/jeremyphoward/status/988852083796291584
[+] jorgemf|8 years ago|reply
One machine with 8 V100 GPUs. If you consider one TPU pod a single machine the TPU is faster. Those numbers also show that 8 GPUs are slower than 8 TPUs (so same conclusion as the article)
[+] dimitry12|8 years ago|reply
An important hidden cost here is coding a model which can take advantage of mixed-precision training. It is not trivial: you have to empirically discover scaling factors for loss functions, at the very least.

It's great that there is now wider choice of (pre-trained?) models formulated for mixed-precision training.

When I was comparing Titan V (~V100) and 1080ti 5 months ago, I was only able to get 90% increase in forward-pass speed for Titan V (same batch-size), even with mixed-precision. And that was for an attention-heavy model, where I expected Titan V to show its best. Admittedly, I was able to use almost double the batch-size on Titan V, when doing mixed-precision. And Titan V draws half the power of 1080ti too :)

At the end my conclusion was: I am not a researcher, I am a practitioner - I want to do transfer learning or just use existing pre-trained models - without tweaking them. For that, tensor cores give no benefit.

[+] elmarhaussmann|8 years ago|reply
Author here.

Yes, thanks for mentioning that! That's what the article is alluding to at the end. There's also something like a "cost-to-model" and that's influenced by how easy it is to make efficient use of the performance and how much tweaking it needs. It's also influenced by the framework you use... However, that's difficult to compare and almost impossible to measure.

[+] bitL|8 years ago|reply
How did you get your hands on Titan V 5 months ago? I still can't find it anywhere in retail in EU...
[+] Nokinside|8 years ago|reply
Nvidia is currently in cashing out phase. They have monopoly and money flows in effortlessly. The cost performance ratio reflects this.

AMD will enter the game soon once they get their software working, Intel will follow.

I suspect that Nvidia will respond with its own specialized machine learning and inference chips to match the cost/performance ratio. As long as Nvidia can maintain high manufacturing volumes and small performance edge, they can still make good profits.

[+] jacksmith21006|8 years ago|reply
"The cost performance ratio reflects this."

But the TPUs are half the cost per this article?

Plus Google does the entire stack and can better optimize the hardware versus Nvidia. So it seem Google can improve faster I would think.

If there ever was a huge advantage doing the entire stack it is with neural networks.

A perfect example is Google new speech doing 16k samples a second with a NN.

https://cloudplatform.googleblog.com/2018/03/introducing-Clo...

Do not think Google could offer this service as a competitive cost without the TPUs.

This new method is replacing the method that was far less compute intensive so to offer at a competitive price requires lowering compute cost which suspect is only possible with the TPUs.

[+] samfisher83|8 years ago|reply
>For GPUs, there are further interesting options to consider next to buying. For example, Cirrascale offers monthly rentals of a server with four V100 GPUs for around $7.5k (~$10.3 per hour). However, further benchmarks are required to allow a direct comparison since the hardware differs from that on AWS (type of CPU, memory, NVLink support etc.).

Can't you just buy some 1080s for cheaper than this. I understand there is electricity and hosting costs, but cloud computing seems expensive compared to buying equipment.

[+] dgacmu|8 years ago|reply
Yes, you can. The problem starts when "you" are a large company -- NVidia restricts "datacenter" use of consumer GPUs (see previous HN discussion of that one: https://news.ycombinator.com/item?id=15983587 ). A single Titan V is somewhere in the 90% range of a V100 at less than 1/3 the cost, and a 1080ti, if you can find one, likely offers a slightly better price/performance spot. 4-GPU training may suffer due the lack of NVlink, but not enough for it to matter too much. As you scale, though, the lack of NVlink will hurt more. And, of course, all of these things come with a capex vs opex tradeoff, and a sysadmin vs cloud tradeoff, that will appeal differently to different situations.
[+] elmarhaussmann|8 years ago|reply
Probably not the best phrasing in the post ("next to buying"). It's only comparing cloud pricing (since the TPUv2 is only available there). If you consider buying hardware the situation is different as you correctly point out.
[+] modeless|8 years ago|reply
1080s don't have the "tensor cores" of V100, or NVLink, so they will not get anywhere near the same performance on this benchmark.
[+] bitL|8 years ago|reply
Excellent! Thanks for these numbers, I wanted to see exactly this kind of benchmarks! Do you plan to try different benchmarks with the same setup for different problems, like semantic segmentation, DenseNet, LSTM training performance etc. as well?
[+] elmarhaussmann|8 years ago|reply
Happy to hear the benchmark is useful to you! We'd love to try different setups and further models/networks. On the other hand, such benchmarks are a LOT of effort (which we underestimated it initially), so we'll have to see.
[+] kyloon|8 years ago|reply
Excellent work. Do you have plans to open source the scripts/implementation details used to reproduce the results? Would be great if others can also validate and repeat the experiment for future software updates (e.g. TensorFlow 1.8) as I expect there will be some performance gain for both TPU and GPU by CUDA and TensorFlow optimizations.

Sidenote: Love the illustrations that accompany most of your blog posts, are they drawn by an in-house artist/designer?

[+] elmarhaussmann|8 years ago|reply
Happy you like the post! The implementations we used are open source (we reference the specific revisions), so reproducing results is possible right now. We haven't thought about publishing our small scripts around that (there's not much to it), but it's a good idea. There's also work towards benchmarking suites like DAWNBench (https://dawn.cs.stanford.edu/benchmark/).

The illustrations are from an artist/designer we contract from time to time. I agree, his work is awesome!

[+] scottlegrand2|8 years ago|reply
What they're not saying is that one can't use all nvlink bandwidth for gradient reduction on a DGX-1V with only 4 GPUs because nvlink is composed of 2 8-node rings. And given the data parallel nature of this benchmark, I'm very interested in where time was spent on each architecture.

That said, they fixed this on NVSwitch so it's just another HW hiccup like int8 was on Pascal.

[+] elmarhaussmann|8 years ago|reply
For this benchmark, NVLink and gradient reduction isn't the bottleneck. The performance scales almost perfectly linearly from one GPU to four.
[+] drej|8 years ago|reply
Thanks for this, just a minor thing:

You have price per hour and performance per second. Thus that ratio is not performance per image per $, you need to scale that. Also, the metric is not "images per second per $", but just "images per $".

[+] wyldfire|8 years ago|reply
How much detail do we know about the TPUs' design? Does Google disclose a block-diagram level? ISA details? Do they release a toolchain for low-level programming or only higher-level functions like TensorFlow?

EDIT: I found [1] which describes "tensor cores", "vector/matrix units" and HBM interfaces. The design sounds similar in concept to GPUs. Maybe they don't have or need interpolation hw or other GPU features?

[1] https://cloud.google.com/tpu/docs/system-architecture

[+] jacksmith21006|8 years ago|reply
Great paper on the Generation 1 TPU. But Google has not shared much details on gen 2 and in some ways kind of hid information.

Suspect we will need a gen 3 to get a paper on the gen 2.

Here is the gen 1 paper and highly recommend. Pretty interesting using 65536 very simple cores.

https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf

[+] ndesaulniers|8 years ago|reply
> Maybe they don't have or need interpolation hw or other GPU features?

Definitely, no need to do any kind of rasterization here.

[+] twtw|8 years ago|reply
Great work, RiseML. This benchmark is sincerely appreciated.

I wonder whether NVLink would make any difference for Resnet-50. Does anyone know whether these implementations require any inter-GPU communication?

[+] elmarhaussmann|8 years ago|reply
They don't require it but some of the ResNet-50 implementations can make use of it (e.g., the ones in the Docker containers on the Nvidia GPU Cloud). But even the ones without seem to scale to 4 GPUs pretty well. This may be a different story for 8 GPUs and larger/deeper networks, e.g., ResNet-152.
[+] threeseed|8 years ago|reply
Was this running the AWS Deep Learning AMI or did you build your own.

Because Intel was involved in its development and made a number of tweaks to improve performance.

Be curious if it actually was significant or not.

[+] elmarhaussmann|8 years ago|reply
On AWS this was using nvidia-docker with the TensorFlow Docker images. Probably, the AWS AMI Deep Learning gives very similar performance (with same versions of CUDA, TensorFlow etc.). There's only so much you can tweak if the GPU itself is the bottleneck...
[+] Tenoke|8 years ago|reply
>For the V100 experiments, we used a p3.8xlarge instance (Xeon E5–[email protected] 16 cores, 244 GB memory, Ubuntu 16.04) on AWS with four V100 GPUs (16 GB of memory each). For the TPU experiments, we used a small n1-standard-4 instance as host ([email protected] two cores, 15 GB memory, Debian 9) for which we provisioned a Cloud TPU (v2–8) consisting of four TPUv2 chips (16 GB of memory each).

A bit odd that the TPUs are provisioned on such a weaker machine compared to the V100s, especially when there were comparisons which included augmentation and other processing outside of the TPU.

[+] elmarhaussmann|8 years ago|reply
All of the computation, including pre-processing, is offloaded to the TPU. The weak machine is really just idling. A bigger one will only cost money and have no measurable effect on the performance.
[+] puzzle|8 years ago|reply
The TPU is not really just the chip. It has an actual machine that is provisioned behind the scenes and accepts RPC calls. Good luck finding out its specs. All you're supposed to care about are the address and port it answers at.