top | item 44854014

(no title)

cdavid | 6 months ago

I was surprised at previous comparison on omarchy website, because apple m* work really well for data science work that don't require GPU.

It may be explained by integer vs float performance, though I am too lazy to investigate. A weak data point, using a matrix product of N=6000 matrix by itself on numpy:

  - SER 8 8745, linux: 280 ms -> 1.53 Tflops (single prec)
  - my m2 macbook air: it is ~180ms ms -> ~2.4 Tflops (single prec)

This is 2 mins of benchmarking on the computers I have. It is not apple to orange comparison (e.g. I use the numpy default blas on each platform), but not completely irrelevant to what people will do w/o much effort. And floating point is what matters for LLM, not integer computation (which is what the ruby test suite is most likely bottlenecked by)

discuss

Tuna-Fish|6 months ago

It's all about the memory bandwidth.

Apple M chips are slower on the computation that AMD chips, but they have soldered on-package fast ram with a wide memory interface, which is very useful on workloads that handle lots of data.

Strix halo has a 256-bit LPDDR5X interface, twice as wide as the typical desktop chip, roughly equal to the M4 Pro and half of that of the M4 Max.

jychang|6 months ago

You're most likely bottlenecked by memory bandwidth for a LLM.

The AMD AI MAX 395+ gives you 256GB/sec. The M4 gives you 120GB/s, and the M4 Pro gives you 273GB/s. The M4 Max: 410GB/s (14‑core CPU/32‑core GPU) or 546GB/s (16‑core CPU/40‑core GPU).

zargon|6 months ago

It’s both. If you’re using any real amount of context, you need compute too.

cdavid|6 months ago

Yeah, memory bandwidth is often the limitation for floating point operations.