(no title)
mota7 | 10 months ago
So GPUs have ~120 small systolic arrays, one per SM (aka, a tensorcore), plus passable off-chip bandwidth (aka 16 lines of PCI).
Where has TPUs have one honking big systolic array, plus large amounts of off-chip bandwidth.
This roughly translates to GPUs being better if you're doing a bunch of different small-ish things in parallel, but TPUs are better if you're doing lots of large matrix multiplies.
No comments yet.