Benchmarking TensorFlow on Nvidia GeForce RTX 3090

[+] jakear|5 years ago|reply

That second table is aa good example of why always including units (or even just a "higher is better") is a good idea... I have no clue what I'm looking at.

Edit: It's been edited, thx Evolution :) (or I totally glossed over it the first time around... but I don't think so)

[+] TomVDB|5 years ago|reply

Even after the edit, it's a bit confusing in that, without looking at the table, you get the impression that FP16 is slower than FP32.

[+] Sayrus|5 years ago|reply

Perhaps it has been edited. Now the table contains the following title "Training performance in images processed per second".

A "Higher is better" might still be interesting although redundant.

[+] hughes|5 years ago|reply

Even after being edited, it's still wrong. It shows the significantly lower Inception4 performance as a "40% speedup" instead of 40% of baseline images/sec.

[+] zamadatix|5 years ago|reply

"Training performance in images processed per second"?

[+] mamon|5 years ago|reply

Do you mean this table which has a caption over it that reads "Training performance in images processed per second" ?? Looks pretty self-explanatory.

[+] ml_hardware|5 years ago|reply

This is a poor comparison of performance. All of these networks are CNNs, and very old architectures at that. They are all probably memory bottlenecked which is why you see the consistent 50% improvement in FP32 perf.

It is also not clear what batch sizes are being used for any of the tests. If you switch to FP16 training, you must increase the batch size to properly utilize the Tensor Cores.

If you compare these cards at FP16 performance on large language models (think GPT-style with large model dimension), I am confident you will see Titan RTX outperform the 3090. The former has 130 TF/s of FP16.32 tensor core performance while the latter has only 70 TF/s.

Link: https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/a...

[+] binarymax|5 years ago|reply

The 3090 RTX is also $1000 cheaper than the Titan, so there's that. It would be nice if there was a good way to express value per dollar. Perhaps in GLUE accuracy and training time.

[+] dplavery92|5 years ago|reply

For many of us, the Inception-style CNN workloads--especially at FP32--are much more realistic than large language models that may be better suited to take advantage of the tensor cores. If I'm going to be memory bottlenecked either way, I probably don't want to spend an extra $1000 on 400 tensor cores I can't take full advantage of.

[+] systemvoltage|5 years ago|reply

Aside: Nit: Don't use gradients for discrete categories in a graph. Use a discrete color palette that perceptually distances colors as much as possible using a tool like this: https://medialab.github.io/iwanthue/

[+] valine|5 years ago|reply

Seems like a good speedup relative to the Titan, especially for the money. I’d be interested to see the performance relative to the 3080 though. There are obviously vram limitations with the 3080 but it would still be interesting to see the difference in raw compute performance.

In games the 3090 only gives a 15% performance bump relative to the 3080. If that pattern holds for machine learning tasks there is probably a scenario where it makes sense to buy two 3080s rather than one 3090.

If you are vram constrained then obviously the 3090 the way to go.

[+] wombatmobile|5 years ago|reply

If this isn't OT...

Could you kindly advise what kind of computer would make sense to purchase to begin learning about ML? I was assuming I'd get a 3080. Should I get a case that could potentially house 2 x 3080's? Does the case require any special cooling considerations, or just whatever will fit the cards? What CPU would you get?

[+] kevingadd|5 years ago|reply

I think for high throughput scenarios the 3090 probably has more headroom due to its higher TDP and better (larger) cooling solution, which might really matter here if you're driving the tensor cores at max the whole time.

[+] wnevets|5 years ago|reply

Most video games probably aren't going to make the most of all of the extra CUDA cores on the 3090. I'm assuming that helps alot with machine learning, can someone who knows for sure confirm?

[+] p1esk|5 years ago|reply

There will be 20GB version of 3080 soon.

[+] hhhhhuu|5 years ago|reply

A 3080 with 20gb is planned already

[+] BookPage|5 years ago|reply

Honestly I had an RTX Titan for home use for a while. Eventually I moved to just using a 2080 Super and it performed at nearly the same power for my models. If you don't need ALL the extra memory and have the space for a triple slot then the better value proposition by far for last gen seemed to be a good super.

[+] arijun|5 years ago|reply

See also Tim Detter's fantastic post on GPU performance (which doesn't use benchmarks for the latest cards but instead calculates performance with a model):

https://timdettmers.com/2020/09/07/which-gpu-for-deep-learni...

HN Discussion:

https://news.ycombinator.com/item?id=24400603

[+] liuliu|5 years ago|reply

Seems to be good speedup overall relative to 2080 Ti (including FP16: see relatives 2080 Ti v Titan: https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks...). This suggests we should see another Titan card that is even more expensive in the pipeline given the FP16 performance? Or maybe TF32 performance is going to be what NVIDIA promotes in this generation (only if they have better number than FP16?)?

[+] jjoonathan|5 years ago|reply

Here's hoping for an A100 titan with un-nerfed FP64. The 3090 is twice as nerfed as previous generations, which were also bad at 1:32. Now it's 1:64 :(

[+] bryan0|5 years ago|reply

Can someone explain the difference between fp16 and fp32 in these benchmarks because the difference is pretty dramatic. I assume it's floating point precision(?) but why would lower precision be slower relatively on the 3090? For training jobs how does the precision impact accuracy of the model?

Edit: clarified that I am referring to slower relative performance

[+] bufo|5 years ago|reply

Nvidia nerfed at the software level the FP16 performance to disincentivize people from using this card as a TITAN / datacenter ML card replacement.

[+] Der_Einzige|5 years ago|reply

Fp16 is faster in this article on most models...

[+] unknown|5 years ago|reply

[deleted]

[+] smallnamespace|5 years ago|reply

FP16 is faster (units are images per second)

[+] danbr|5 years ago|reply

I just want to know how they installed the new nvidia cuda drivers without borking their Ubuntu/tf install.

[+] rkwasny|5 years ago|reply

Hi, we just installed everything using nvidia repo and .deb packages.

+ tf-nightly and other python libraries installed through pipenv

[+] paol|5 years ago|reply

Nvidia has official ppa's with all versions of CUDA, libcudnn and drivers. If you install from there you will not have problems.

It helps keeping to Ubuntu LTS versions though, that's what they support best.

[+] opless|5 years ago|reply

NVidia drivers b0rked rebooting my box for a long time.

A couple of months ago, I removed all the references in apt sources, and followed the newer instructions (several times to get the right driver/cuda/tensorflow match) and my reboots are great, and only one GPU lock up so far (probably due to overheating - I've had to replace a couple of components flag as failed due to the heatwave in summer)

Jupyter hub is just great, I'd like to implement better diagnostics though ... have yet to find a good tutorial for that as yet.

[+] fareesh|5 years ago|reply

If I have a really remote location and I need to do on-premises inference, am I better off buying one of the gaming GPUs or are they far behind the T4, etc.?

[+] motorcitycobra|5 years ago|reply

I thought I read Nvidia was nerfing the GeForce cards. Does this disprove it?

[+] fomine3|5 years ago|reply

NVIDIA nerfs FP64 performance on consumer GeForce for recent years. It's critical for scientific calculations but not needed for ML. Alternatively they banned to run GeForce on datacenter.

[+] bitL|5 years ago|reply

No, 3090 has nerfed tensor cores and in some apps Titan RTX is 5x faster (Siemens NX). FP32 accumulate is at 0.5x like with 2080Ti, while Titan's is at 1x.

[+] unknown|5 years ago|reply

[deleted]

102 comments