top | item 46469349

(no title)

mrinterweb | 1 month ago

I've been wondering when we will see general purpose consumer FPGAs, and eventually ASICs, for inference. This reminds me of bitcoin mining. Bitcoin mining started with GPUs. I think I remember a brief FPGA period that transitioned to ASIC. My limited understanding of Google's tensor processing unit chips are that they are effectively a transformer ASIC. That's likely a wild over-simplification of Google's TPU, but Gemini is proof that GPUs are not needed for inference.

I suspect GPU inference will come to an end soon, as it will likely be wildly inefficient by comparison to purpose built transformer chips. All those Nvidia GPU-based servers may become obsolete should transformer ASICs become mainstream. GPU bitcoin mining is just an absolute waste of money (cost of electricity) now. I believe the same will be true for GPU-based inference soon. The hundreds of billions of dollars being invested on GPU-based inference seems like an extremely risky bet that ASIC transformers won't happen, although Google has already widely deployed their own TPUs.

discuss

order

fooblaster|1 month ago

FPGAs will never rival gpus or TPUs for inference. The main reason is that GPUs aren't really gpus anymore. 50% of the die area or more is for fixed function matrix multiplication units and associated dedicated storage. This just isn't general purpose anymore. FPGAs cannot rival this with their configurable DSP slices. They would need dedicated systolic blocks, which they aren't getting. The closest thing is the versal ML tiles, and those are entire peoxessors, not FPGA blocks. Those have failed by being impossible to program.

fpgaminer|1 month ago

> FPGAs will never rival gpus or TPUs for inference. The main reason is that GPUs aren't really gpus anymore.

Yeah. Even for Bitcoin mining GPUs dominated FPGAs. I created the Bitcoin mining FPGA project(s), and they were only interesting for two reasons: 1) they were far more power efficient, which in the case of mining changes the equation significantly. 2) GPUs at the time had poor binary math support, which hampered their performance; whereas an FPGA is just one giant binary math machine.

teleforce|1 month ago

>Those have failed by being impossible to program.

I think you spoke too soon about their failure, sooner they will be much easier to program [1].

Interestingly, Nvidia GPU now is also moving to tile-based GPU programming model that targets portability for NVIDIA Tensor Cores [2]. Recently there're discussions on the topic at HN [3].

[1] Developing a BLAS Library for the AMD AI Engine [pdf]:

https://uni.tlaan.nl/thesis/msc_thesis_tristan_laan_aieblas....

[2] NVIDIA CUDA Tile:

https://developer.nvidia.com/cuda/tile

[3]CUDA Tile Open Sourced (103 comments):

https://news.ycombinator.com/item?id=46330732

Lerc|1 month ago

I think it'll get to a point with quantisation that GPUs that run them will be more FPGA like than graphics renderers. If you quantize far enough things begin to look more like gates than floating point units. At that level a FPGA wouldn't run your model, it would be one your model.

ithkuil|1 month ago

Turns out that a lot of interesting computation can be expressed as a matrix multiplication.

imtringued|1 month ago

I feel like your entire comment is a self contradicting mess.

You say FPGAs won't get dedicated logic for ML, then you say they did.

Why does it matter whether the matrix multiplication units inside the AI Engine are a systolic array or not? The multipliers support 512 bit inputs which means 4x8 times 8x4 for bfloat16 with one multiplication per cycle and bigger multiplications with smaller data types. Since it is a VLIW processor, it is much easier to achieve full utilisation of the matrix multiplication units, because you can run loads, stores and process tiles all simultaneously in the same cycle.

The only thing that might be a challenge is arranging the communication between the AI Engines, but even that should be blatantly obvious. If you are doing matrix multiplication, you should be using the entire array in exactly the pattern you think they should be using internally.

Who knows, maybe there is a way to implement flash attention like that too.

dnautics|1 month ago

I don't think this is correct. For inference, the bottleneck is memory bandwidth, so if you can hook up an FPGA with better memory, it has an outside shot at beating GPUs, at least in the short term.

I mean, I have worked with FPGAs that outperform H200s in Llama3-class models a while and a half ago.

alanma|1 month ago

yup, GBs are so much tensor core nowadays :)

liuliu|1 month ago

This is a common misunderstanding from industry observers (not industry practitioners). Each generation of (NVIDIA) GPU is an ASIC with different ISA etc. Bitcoin mining simply was not important enough (last year, only $23B Bitcoin mined in total (at $100,000 per)). There is amped incentive to implement every possible instructions useful into GPU (without worrying about backward compatibility, thanks to PTX).

ASIC transformers won't happen (defined as a chip with single instruction to do sdpa from anything that is not broadly marketed as GPU, and won't have annualized sale more than $3B). Mark my word. I am happy to take a bet on longbets.org with anyone on this for $1000 and my part will go to PSF.

dnautics|1 month ago

I don't know if they'll reach $3B, but at least one company is using FPGA transformers (that perform well) to get revenue in before going to ASIC transformers:

https://www.positron.ai/

zhemao|1 month ago

TPUs aren't transformer ASICs. The Ironwood TPU that Gemini was trained on was designed before LLMs became popular with ChatGPT's release. The architecture was general enough that it ended up being efficient for LLM training.

A special-purpose transformer inference ASIC would be like Etched's Sohu chip.

tucnak|1 month ago

It all comes down to memory and fabric bandwidth. For example, the state of the art developer -friendly (PCIe 5.0) FPGA platform is Alveo V80 which rocks four 200G NIC's. Basically, Alveo currently occupies this niche where it's the only platform on the market to allow programmable in-network compute. However, what's available in terms of bandwidth—lags behind even pathetic platforms like Bluefield. Those in the know are aware of what challenges are there to actually saturate it for inference in practical designs. I think, Xilinx is super well-positioned here, but without some solid hard IP it's still a far cry from purpose silicon.

mrinterweb|1 month ago

As far as I understand all the inference purpose-build silicon out there is not being sold to competitors and kept in-house. Google's TPU, Amazon's Inferentia (horrible name), Microsoft's Maia, Meta's MTIA. It seems that custom inference silicon is a huge part of the AI game. I doubt GPU-based inference will be relevant/competitive soon.

seamossfet|1 month ago

The only time FPGAs / ASICS are better is if there's gains we can make by innovating on the hardware architecture itself. That's pretty hard to do considering GPUs are already heavily optimized for this use case.

Narew|1 month ago

There was in the past. Google had Coral TPU and Intel the Neural Compute Stick (NCS). NCS is from 2018 so it's really outdated now. It was mainly oriented for edge computing so the flops was not comparable to desktop computer.

moffkalast|1 month ago

Even for edge computing neither were really even capable of keeping up with the slowest Jetson's GPU for not much less power draw.

bee_rider|1 month ago

There are also CPU extensions like AVX512-VNNI and AVX512-BF16. Maybe the idea of communicating out to a card that holds your model will eventually go away. Inference is not too memory bandwidth hungry, right?