You absolutely cannot implement stream compaction “at the speed of native” as WebGPU is missing the wave/subgroup intrinsics and globally coherent memory necessary to do that efficiently as possible.
It's possible you might not need direct access to wave/subgroup ops to implement efficient stream compaction. There's a great old Nvidia blog post on "warp-aggregated atomics"
where they show that their compiler is sometimes able to automatically convert global atomic operations into the warp local versions, and achieve the same performance as manually written intrinsics.
I was recently curious if 10 years later these same optimizations had made it into other GPUs and platforms besides cuda, so I put together a simple atomics benchmark in WebGPU.
The results seem to indicate that these optimizations are accessible through webgpu on chrome on both MacOS and Linux (with nvidia gpu).
Note that I'm not directly testing stream compaction, just incrementing a single global atomic counter. So that would need to be tested to know for sure if the optimization still holds there.
If you see any issues with the benchmark or this reasoning please let me know! I am hoping to solidify my knowledge in this area :)
I think compilers should be smart enough to substitute group-shared atomics with horizontal ops. If it's not already doing it, it should be!
But anyways, Histogram Pyramids is a more efficient algorithm for implementing parallel scan anyways. It essentially builds a series of 3D buffers, each having half the dimension of the previous level, and each value containing the sum of the amounts in each underlying cells, with the top cube being just a single value, the total amount of cells.
Then instead of doing the second pass where you figure out what index thread is supposed to write to, and writing it to a buffer, you just simply drill down into said cubes and figure out the index at the invocation of the meshing part by looking at your thread index (lets say 1526), and looking at the 8 smaller cubes (okay, cube 1 has 516 entries, so 1100 to go, cube 2 has 1031 entries, so 69 to go, cube 3 has 225 entries, so we go to cube 3), and recursively repeat until you find the index. Since all threads in a group tend go into the same cubes, all threads tend to read the same bits of memory until getting down to the bottom levels, making it very GPU cache friendly (divergent reads kill GPGPU perf).
Forgive me if I got the technical terminology wrong, I haven't actually worked on GPGPU in more than a decade, but it's fun to not that something that I did cca 2011 as an undergrad is suddenly relevant again (in which I implemented HistoPyramids from a 2007ish paper, and Marching Cubes, an 1980s algorithm). Everything old is new again.
You seem knowledgeable, and I’m possibly going back into a GPGPU project after many years out of the game, so: overall do you see a good future for filling these compute-related gaps in the WebGPU API? Really I’m wondering whether wgpu is an okay choice versus raw Vulkan for native GPGPU outside the browser.
Ah, so that's how you do it. Having a template for WebGPU projects is a good idea. I'll have to do the same so I don't waste time setting up web graphics projects.
Cool project btw! Adding this to my long list of graphics blogs to read.
I assume it's just mislabeled, it's a high-angular-momentum hydrogenic orbital, chosen because it looks cool and because it's trivial to evaluate (a spherical harmonic times a simple radial term).
WebGPU has been under development since 2017, and has been a working draft since 2021. What issues are holding the W3C from publishing the final standard? Is there a timeline?
Chrome and Chromelikes are still the only browsers shipping stable WebGPU, on Firefox it's behind a flag, and on Safari it's only on the TP branch. Then on Chrome it's not available on Linux yet, only Windows and Android, and only on a subset of Android GPUs.
WebGPU is most definitely not outdated. It's a unified interface for all things floating point. From the datacenter to the watch on your wrist. However, most folks not deep into the inner workings will ever touch it. What it does do is close the door on the App Store model. Apple already knows this, which is why we have the AVP.
cshenton|1 year ago
tehsauce|1 year ago
https://developer.nvidia.com/blog/cuda-pro-tip-optimized-fil...
where they show that their compiler is sometimes able to automatically convert global atomic operations into the warp local versions, and achieve the same performance as manually written intrinsics. I was recently curious if 10 years later these same optimizations had made it into other GPUs and platforms besides cuda, so I put together a simple atomics benchmark in WebGPU.
https://github.com/PWhiddy/webgpu-atomics-benchmark
The results seem to indicate that these optimizations are accessible through webgpu on chrome on both MacOS and Linux (with nvidia gpu). Note that I'm not directly testing stream compaction, just incrementing a single global atomic counter. So that would need to be tested to know for sure if the optimization still holds there. If you see any issues with the benchmark or this reasoning please let me know! I am hoping to solidify my knowledge in this area :)
FL33TW00D|1 year ago
torginus|1 year ago
But anyways, Histogram Pyramids is a more efficient algorithm for implementing parallel scan anyways. It essentially builds a series of 3D buffers, each having half the dimension of the previous level, and each value containing the sum of the amounts in each underlying cells, with the top cube being just a single value, the total amount of cells.
Then instead of doing the second pass where you figure out what index thread is supposed to write to, and writing it to a buffer, you just simply drill down into said cubes and figure out the index at the invocation of the meshing part by looking at your thread index (lets say 1526), and looking at the 8 smaller cubes (okay, cube 1 has 516 entries, so 1100 to go, cube 2 has 1031 entries, so 69 to go, cube 3 has 225 entries, so we go to cube 3), and recursively repeat until you find the index. Since all threads in a group tend go into the same cubes, all threads tend to read the same bits of memory until getting down to the bottom levels, making it very GPU cache friendly (divergent reads kill GPGPU perf).
Forgive me if I got the technical terminology wrong, I haven't actually worked on GPGPU in more than a decade, but it's fun to not that something that I did cca 2011 as an undergrad is suddenly relevant again (in which I implemented HistoPyramids from a 2007ish paper, and Marching Cubes, an 1980s algorithm). Everything old is new again.
masspro|1 year ago
dekhn|1 year ago
spintin|1 year ago
The browser is dead, the only thing you can use it for is filling out HTML forms and maybe some light inventory management.
The final app is C+Java where you put the right stuff where it is needed. Just like the browser used to be before Oracle did it's magic on the applet.
SuboptimalEng|1 year ago
Cool project btw! Adding this to my long list of graphics blogs to read.
lukko|1 year ago
Isn't it 1s1 in the ground state so the probability distribution would look like a sphere.
plus|1 year ago
bhouston|1 year ago
codewiz|1 year ago
pjmlp|1 year ago
dailykoder|1 year ago
worik|1 year ago
But: "Error: Your browser does not support WebGPU"
Sigh
jsheard|1 year ago
We have a way to go yet.
spxneo|1 year ago
chrysoprace|1 year ago
adfm|1 year ago
ramon156|1 year ago
superkuh|1 year ago
tkzed49|1 year ago