Inside Pascal: Nvidia 's Newest Computing Platform

[+] tomkinstinch|10 years ago|reply

Wow. It has 15.3 billion transistors. It's amazing we can buy something with that many engineered parts. Even if the transistors are the result of duplication and lithography, it's an astonishing number. Creating the mask must have taken a while.

Does anyone know what the failure rate is for the transistors (or transistors of a similar production process)? Do they all have to function to spec for a GPU, or are malfunctioning transistors disabled or corrected? What does the QC process look like?

[+] djcapelis|10 years ago|reply

Exact failure and bin rates for most semiconductor companies are considered deep dark internal trade secret. Other than pure scale, yield rates are one of the biggest factors in semiconductor cost and profit margin.

And the answer is it depends. If you lose some transistors, you expect to lose the entire chip. But the vast majority of the transistors on each chip are part of cache or many many duplicate GPU cores, which if they fail to pass tests, can be disabled or downclocked and then the chip is binned into the appropriate product line.

With GPUs this is much easier than other types of chips, because the level of functional duplication that exists allows a lot of flexibility. If a core is bad, you use a different one, and GPU cores are small enough they'd be stupid not to put some spares on each chip. Same with memories.

Generally one can safely assume:

* Most chips that come off the line are binned into a lower category and do not function at max spec for everything, which is why the price jump is so high at the extreme upper end of a hardware series.

* With ASIC lithography most transistor malfunction isn't correctable, you mostly have to either downclock (some types of faults) or disable (the rest) that piece.

* Rates of transistor malfunction is still incredibly fucking amazingly phenomenally low. Like with 15B transistors on a chip, you have trouble affording a failure rate of even one in a billion.

So your line has to be, as the kids say: on fleek.

[+] tcas|10 years ago|reply

I do not have the answers for your questions (and I don't think anyone can share actual failure rates), but I would direct you this video which goes over a lot of modern chip fabrication techniques, circa 2009:

https://www.youtube.com/watch?v=NGFhc8R_uO4

It's crazy stuff.

There are wafer test machines which will interface with the wafer directly and do some testing (which are $$$$), JTAG type tests, which access parts of the chip out of band, and functional testing. Some products, like SD Cards actually have a microcontroller on board that will provide the test routines and error correction without the need of an expensive machine. Design for test is extremely important.

I'm by no means an expert however, I mostly deal with JTAG and functional tests.

[+] cottonseed|10 years ago|reply

I just read that the Xilinx XCVU440 FPGA has >20B transistors, and that's one generation old (20nm, UltraScale+ is on 16nm finfet). Insane.

[+] wyldfire|10 years ago|reply

Half precision ftw! ML is the use case they're designing for, but we all get to reap the benefits.

[+] pavlov|10 years ago|reply

Hasn't half precision (16-bit float) been in NVidia GPUs forever? I could swear it was available back in the very first shader-capable Geforce FX days already.

[+] mtgx|10 years ago|reply

IBM's Power9 and its future Power 3.0 ISA CPUs, which should increasingly focus on deep-learning/big data optimization combined with Nvidia's GPUs which will increasingly optimize for the same, should make an interesting match over the next 5+ years.

On the gaming side, I do hope they continue to optimize for VR. I think AMD is even slightly ahead of them on that.

[+] gnuvince|10 years ago|reply

Please say it's programmable in Pascal.

[+] gnoway|10 years ago|reply

Well, Delphi can call functions in external assemblies so...technically?

https://rosettacode.org/wiki/Call_a_function_in_a_shared_lib...

[+] analognoise|10 years ago|reply

Calling out to FreePascal and Lazarus here - if they provide a DLL, yup!

I was just looking at controlling NGSpice from FreePascal - one of the examples of running a shared instance of NGSpice is done in a Pascal dialect:

http://ngspice.sourceforge.net/shared.html

I like Pascal much better than C++ and think the portable Lazarus GUI toolkit is pretty damn trick. Check it out: http://www.lazarus-ide.org/

[+] sklogic|10 years ago|reply

Why not? Pascal to PTX should be trivial (e.g., there is a lot of LLVM Pascal frontends).

[+] venomsnake|10 years ago|reply

That will be only for the Turbo models. It was awesome language.

[+] boznz|10 years ago|reply

+1 Nostalgia. :-)

[+] overcast|10 years ago|reply

Hush you!

[+] marmaduke|10 years ago|reply

Is there a description somewhere without all the bla bla hype? Comparison with past architectures would also be welcome, also without hype.

[+] pklausler|10 years ago|reply

The article has several tables that juxtapose the specs of the previous, current, and new generations, and I think that you will enjoy reading it.

[+] dr_zoidberg|10 years ago|reply

What I got from the article:

* ~5.5 TFLOPs on FP64

* "About 2x" performance on FP32 (so 11 TFLOPs)

* "Up to 2x" performance on FP16 (compared to FP32, so about 22 TFLOPs).

* FP16 is also aimed to neural nets training, because when the weights of the net are FP16, it's a more compact representation.

* 3840 general purpose processors.

* More/better texture units, memory units, etc. So it's not about raw power, but also about a better design.

Guess that's about the important stuff. I just skimmed the article over the top, reading a bit here and there, but that seemed to be the most remarkable stuff.

[+] drewm1980|10 years ago|reply

Comparison to CPU is also important IMHO, and for that you need to be aware that terminology is very different.

What nvidia calls a "core" is more like one entry in a SIMD unit on a cpu. What nvidia calls a "SM" is closer to a CPU core.

There is more to it than that, i.e. gpu cores are more independent than entries in a cpu vector unit, but on the other hand, but gpu "SM"s are less independent than cpu cores.

It's also worth keeping in mind that mediocre cpu code will run circles around mediocre gpu code. To get the gpu magic you have to invest a lot of effort in tuning for the architecture.

[+] ansible|10 years ago|reply

I think the unified memory is one of the big deals here. This makes working on large data sets much easier.

[+] JustSomeNobody|10 years ago|reply

300 Watts.

Toasty.

[+] timeu|10 years ago|reply

Sorry but does it run Crysis ? ;-)

But seriously quite impressive piece of hardware.

45 comments