top | item 42509730

Ask HN: Resources for general purpose GPU development on Apple's M* chips?

149 points| thinking_banana | 1 year ago

While Apple M* chips seems to have an incredible unified memory access, the available learning resources seem to be quite restricted and often convoluted. Has anyone been able to get past this barrier? I have some familiarity with general purpose software development with CUDA and C++. I want to figure how to work with/ use Apple's developer resources for general purpose programming.

82 comments

aleinin|1 year ago

If you're looking for a high level introduction to GPU development on Apple silicon I would recommend learning Metal. It's Apple's GPU acceleration language similar to CUDA for Nvidia hardware. I ported a set of puzzles for CUDA called GPU-Puzzles (a collection of exercises designed to teach GPU programming fundamentals)[1] to Metal [2]. I think it's a very accessible introduction to Metal and writing GPU kernels.

[1] https://github.com/srush/GPU-Puzzles

[2] https://github.com/abeleinin/Metal-Puzzles

dylan604|1 year ago

After a quick scan through the [2] link, I have added this to the list of things to look into in 2025

singlepaynews|1 year ago

Can anyone recommend a CUDA equivalent of (2)? That’s a spectacular learning resource and I’d like to use a similar one to upskill for CUDA

morphle|1 year ago

You can help with the reverse engineering of Apple Silicon done by a dozen people worldwide, that is how we find out the GPU and NPU instructions[1-4]. There is over 43 trillion float operations per second to unlock at 8 terabit per second 'unified' memory bandwidth and 270 gigabits per second networking (less on the smaller chips)....

[1] https://github.com/AsahiLinux/gpu

[2] https://github.com/dougallj/applegpu

[3] https://github.com/antgroup-skyward/ANETools/tree/main/ANEDi...

[4] https://github.com/hollance/neural-engine

You can use a high level APIs like MLX, Metal or CoreML to compute other things on the GPU and NPU.

Shadama [5] is an example programming language that translates (with Ometa) matrix calculations into WebGPU or WebGL APIs (I forget which). You can do exactly the same with the MLX, Metal or CoreML APIs and only pay around 3% overhead going through the translation stages.

[5] https://github.com/yoshikiohshima/Shadama

I estimate it will cost around $22K at my hourly rate to completely reverse engineer the latest A16 and M4 CPU (ARMV9), GPU and NPU instruction sets. I think I am halfway on the reverse engineering, the debugging part is the hardest problem. You would however not be able to sell software with it on the APP Store as Apple forbids undocumented API's or bare metal instructions.

MuffinFlavored|1 year ago

This would get rid of needing Metal to be the blackbox and enable things like "nvptx CUDA" equivalent / https://libc.llvm.org/gpu/ right?

Very interesting. A steal for $22k but I guess very niche for now...

JackYoustra|1 year ago

any place you have your current progress written up on? Any methodology I could help contribute on? I've read each one of the four links you've given over the years and it seems vague with how far people have currently gotten and exact issues.

dgfitz|1 year ago

It’s too bad they don’t make this easier on developers, Apple. Is there a reason I don’t see?

KeplerBoy|1 year ago

Where does the 270 gbit/s networking figure come from? Is it the aggregate bandwidth from the pcie slots on the mac pro, which could support nics at that speeds (and above according to my quick maths#), but there is not really any driver support for modern Intel or Mellanox/Nvidia NICs as far as I can tell.

My use case would be hooking up a device which spews out sensor data at 100 gbit/s over qsfp28 ethernet as directly to a GPU as possible. The new mac mini has the GPU power, but there's no way to get the data into it.

# 2x Gen4x16 + 4x Gen3x8 = 2 * 31.508 GB/s + 4 * 7.877 GB/s ≈ 90 GB/s = 720 gbit/s

barkingcat|1 year ago

There is no general purpose GPU development on Apple M series.

There is Metal development. You want to learn Apple M-series gpu and gpgpu development? Learn Metal!

https://developer.apple.com/metal/

kristianp|1 year ago

> There is no general purpose GPU

That's what GPGPU stands for. So your 2 sentences contradict each other.

rgovostes|1 year ago

It's hard to answer not knowing exactly what your aim is, or your experience level with CUDA and how easily the concepts you know will map to Metal, and what you find "restricted and convoluted" about the documentation.

<Insert your favorite LLM> helped me write some simple Metal-accelerated code by scaffolding the compute pipeline, which took most of the nuisance out of learning the API and let me focus on writing the kernel code.

Here's the code if it's helpful at all. https://github.com/rgov/thps-crack

nixpulvis|1 year ago

2024 and still finding cheat codes in Tony Hawk Pro Skater 2. Wild!

billti|1 year ago

If you know CUDA, then I assume you know a bit already about GPUs and the major concepts. There’s just minor differences and different terminology for things like “warps” etc.

With that base, I’ve found their docs decent enough, especially coupled with the Metal Shader Language pdf they provide (https://developer.apple.com/metal/Metal-Shading-Language-Spe...), and quite a few code samples you can download from the docs site (e.g. https://developer.apple.com/documentation/metal/performing_c...).

I’d note a lot of their stuff was still written in Objective-C, which I’m not that familiar with. But most of that is boilerplate and the rest is largely C/C++ based (including the Metal shader language).

I just ported some CPU/SIMD number crunching (complex matrices) to Metal, and the speed up has been staggering. What used to take days now takes minutes. It is the hottest my M3 MacBook has ever been though! (See https://x.com/billticehurst/status/1871375773413876089 :-)

mkagenius|1 year ago

Check out MLX[1]. Its a bit like pytorch/tensorflow with added benefit of Apple Silicon.

1. https://ml-explore.github.io/mlx/build/html/index.html

thetwentyone|1 year ago

I’ve had a good time dabbling with Metal.jl: https://github.com/JuliaGPU/Metal.jl

Archit3ch|1 year ago

Same. It can even run realtime workloads (audio).

dylanowen|1 year ago

People have already mentioned Metal, but if you want cross platform, https://github.com/gfx-rs/wgpu has a vulkan-like API and cross compiles to all the various GPU frameworks. I believe it uses https://github.com/KhronosGroup/MoltenVK to run on Macs. You can also see the metal shader transpilation results for debugging.

rudedogg|1 year ago

With what the OP asked for, I don't think wgpu is the right choice. They want to push the limits of Apple Silicon, or do Apple platform specific work, so an abstraction layer like wgpu is going in the opposite direction in my opinion.

Metal, and Apple's docs are the place to start.

grovesNL|1 year ago

wgpu has its own Metal backend that most people use by default (not MoltenVK).

There is also a Vulkan backend if you want to run Vulkan through MoltenVK though.

feznyng|1 year ago

Besides the official docs you can check out llama.cpp as an example that uses metal for accelerated inference on Apple silicon.

desideratum|1 year ago

I'd reccomend checking out the CUDA mode Discord server! They also have a channel for Metal https://discord.gg/ZqckTYcv

rowanG077|1 year ago

If you are open to run Linux you can use standard opencl and vulkan.

TriangleEdge|1 year ago

Why not OpenCL or OpenGL? You'll not be constrained by the flavor of GPU.

nox101|1 year ago

Sounds like you've never actually tried running those two APis across platforms?

if you want portable use WebGPU either via wgpu for rust or dawn for C++ They actually do run on Windows, Linux, Mac, iOS, and Android portably

unknown|1 year ago

[deleted]

amelius|1 year ago

Apple is known to actively discourage general purpose computing. Better try a different vendor.

saagarjha|1 year ago

idk about “known” considering they basically created OpenGL

codr7|1 year ago

Preferably one that sells computers, not fashion statements.