top | item 43632818

(no title)

_hark | 10 months ago

Can anyone comment on where efficiency gains come from these days at the arch level? I.e. not process-node improvements.

Are there a few big things, many small things...? I'm curious what fruit are left hanging for fast SIMD matrix multiplication.

discuss

vessenes|10 months ago

One big area the last two years has been algorithmic improvements feeding hardware improvements. Supercomputer folks use f64 for everything, or did. Most training was done at f32 four years ago. As algo teams have shown fp8 can be used for training and inference, hardware has updated to accommodate, yielding big gains.

NB: Hobbyist, take all with a grain of salt

jmalicki|10 months ago

Unlike a lot of supercomputer algorithms, where fp error accumulates as you go, gradient descent based algorithms don't need as much precision since any fp errors will still show up at the next loss function calculation to be corrected, which allows you to make do with much lower precision.

muxamilian|10 months ago

In-memory computing (analog or digital). Still doing SIMD matrix multiplication but using more efficient hardware: https://arxiv.org/html/2401.14428v1 https://www.nature.com/articles/s41565-020-0655-z

gautamcgoel|10 months ago

This is very interesting, but not what the Ironside TPU is doing. The blog post says that the TPU uses conventional HBM RAM.

yeahwhatever10|10 months ago

Specialization. Ie specialized for inference.