(no title)
mrinterweb | 1 month ago
I suspect GPU inference will come to an end soon, as it will likely be wildly inefficient by comparison to purpose built transformer chips. All those Nvidia GPU-based servers may become obsolete should transformer ASICs become mainstream. GPU bitcoin mining is just an absolute waste of money (cost of electricity) now. I believe the same will be true for GPU-based inference soon. The hundreds of billions of dollars being invested on GPU-based inference seems like an extremely risky bet that ASIC transformers won't happen, although Google has already widely deployed their own TPUs.
fooblaster|1 month ago
fpgaminer|1 month ago
Yeah. Even for Bitcoin mining GPUs dominated FPGAs. I created the Bitcoin mining FPGA project(s), and they were only interesting for two reasons: 1) they were far more power efficient, which in the case of mining changes the equation significantly. 2) GPUs at the time had poor binary math support, which hampered their performance; whereas an FPGA is just one giant binary math machine.
teleforce|1 month ago
I think you spoke too soon about their failure, sooner they will be much easier to program [1].
Interestingly, Nvidia GPU now is also moving to tile-based GPU programming model that targets portability for NVIDIA Tensor Cores [2]. Recently there're discussions on the topic at HN [3].
[1] Developing a BLAS Library for the AMD AI Engine [pdf]:
https://uni.tlaan.nl/thesis/msc_thesis_tristan_laan_aieblas....
[2] NVIDIA CUDA Tile:
https://developer.nvidia.com/cuda/tile
[3]CUDA Tile Open Sourced (103 comments):
https://news.ycombinator.com/item?id=46330732
Lerc|1 month ago
ithkuil|1 month ago
imtringued|1 month ago
You say FPGAs won't get dedicated logic for ML, then you say they did.
Why does it matter whether the matrix multiplication units inside the AI Engine are a systolic array or not? The multipliers support 512 bit inputs which means 4x8 times 8x4 for bfloat16 with one multiplication per cycle and bigger multiplications with smaller data types. Since it is a VLIW processor, it is much easier to achieve full utilisation of the matrix multiplication units, because you can run loads, stores and process tiles all simultaneously in the same cycle.
The only thing that might be a challenge is arranging the communication between the AI Engines, but even that should be blatantly obvious. If you are doing matrix multiplication, you should be using the entire array in exactly the pattern you think they should be using internally.
Who knows, maybe there is a way to implement flash attention like that too.
dnautics|1 month ago
I mean, I have worked with FPGAs that outperform H200s in Llama3-class models a while and a half ago.
alanma|1 month ago
liuliu|1 month ago
ASIC transformers won't happen (defined as a chip with single instruction to do sdpa from anything that is not broadly marketed as GPU, and won't have annualized sale more than $3B). Mark my word. I am happy to take a bet on longbets.org with anyone on this for $1000 and my part will go to PSF.
dnautics|1 month ago
https://www.positron.ai/
zhemao|1 month ago
A special-purpose transformer inference ASIC would be like Etched's Sohu chip.
mrinterweb|1 month ago
https://cloud.google.com/tpu
> A TPU is an application-specific integrated circuit (ASIC) designed by Google for neural networks.
tucnak|1 month ago
mrinterweb|1 month ago
seamossfet|1 month ago
Narew|1 month ago
moffkalast|1 month ago
bee_rider|1 month ago