(no title)
solaarphunk | 1 month ago
The question "why run CPU code on GPU when GPU cores are slower?" assumes you're running ONE program. But GPUs execute in SIMD wavefronts of 32 threads - and here's the trick: each of those 32 lanes can run a DIFFERENT process. Same instruction, different data. Calculator on lane 0, text editor on lane 1, file indexer on lane 2. No divergence, legal SIMD, full utilization. Suddenly you're not running "slow CPU code on GPU" - you're running 32 independent programs in parallel on hardware designed for exactly this pattern.
The win isn't throughput for compute-heavy code. It's eliminating CPU roundtrips for interactive stuff. Every kernel launch, every synchronization, every "GPU done, back to CPU, dispatch next thing" adds latency. A persistent kernel that polls for input, updates state, and renders - all without returning to CPU - changes the responsiveness equation entirely.
A few things to try at home if you're curious:
1. Write a Metal/CUDA kernel with while(true) and an atomic shutdown flag. See how long it runs.
(Spoiler: indefinitely, if you do it right)2. Put 32 different "process states" in a buffer and have each SIMD lane execute instructions for its own process. Watch all 32 make progress simultaneously.
3. Measure the latency from "input event" to "pixel on screen" with CPU orchestration vs GPU polling an input queue directly. The difference surprised me.
The persistent kernel thing has a nasty gotcha though - ALL 32 threads must participate in the while loop. If you do if (tid != 0) return; then while(true), it'll work for a few million iterations then hard-lock. Ask me how I know.
zozbot234|1 month ago
JonChesterfield|1 month ago