top | item 47075299

(no title)

Jay_luci4 | 11 days ago

The CPU to GPU dispatch overhead this aims to eliminate is a real bottleneck I've measured: my multi-pass Winograd kernel on MI300X (github.com/Jayluci4/nova-wino-amd) launches 3 HIP kernels + 1 rocBLAS GEMM per forward pass — 17-57% faster than MIOpen at batch=1, but at batch=8+ the dispatch latency between stages completely dominates and a fused single-dispatch kernel wins by 2-4x. On the AMD wavefront question: CDNA3's 64-lane wavefront vs NVIDIA's 32-lane warp changes the async scheduling model — you get 64-element register swaps via __shfl in one cycle (my transforms do full 8x8 matrix multiplies through wave shuffles with zero shared memory), but 64-wide means coarser divergence granularity for heterogeneous coroutine paths. An async execution model that pipelines multi-pass kernels without CPU round-trips would directly close the batch>1 gap for workloads like Winograd convolution, batched flash attention, and MoE expert dispatch. @magic_at_nodai — happy to help test AMD support when the time comes; I have working HIP kernels with wave shuffles and MFMA accumulation that would be a good real-workload stress test for the async dispatch model

discuss

No comments yet.