In their announcing page, the section "How can we fit so much more FLOPS on our chip than GPUs?" tells some details. It's said "only 3.3% of the transistors on an H100 GPU are used for matrix multiplication". They trade off programmbility with computation density. And from the "Isn’t inference bottlenecked on memory bandwidth, not compute?" section, I guess they use similar tricks like Groq. Looking forward to more architecture details and comparation with Groq.
No comments yet.