(no title)
jhj | 1 month ago
A modern Nvidia GB200 only does about 40 tflop/s in fp64 for instance. You can emulate higher precision/dynamic range arithmetic with multiple passes and manipulations of lower precision/dynamic range arithmetic but without an insane number of instructions it won't meet all the IEEE 754 guarantees for instance.
Certainly if Nvidia wanted to dedicate much more chip area to fp64 they could get a lot higher, but fp64 FMA units alone would be likely >30 times larger than their fp16 cousins and probably 100s of times larger than fp4 versions.
No comments yet.