Great writeup. The three-pass softmax is where everyone
gets stuck — subtracting the max for numerical stability
is one of those things you can't learn from a textbook.
The pain points you hit (byte alignment, dispatch dims,
strict typing) make me wonder if there's a sweet spot
between raw WGSL and "import pytorch" that keeps you
close to the metal without all the papercuts.
mr_octopus|11 days ago
The pain points you hit (byte alignment, dispatch dims, strict typing) make me wonder if there's a sweet spot between raw WGSL and "import pytorch" that keeps you close to the metal without all the papercuts.