top | item 34407242

(no title)

msbarnett | 3 years ago

> Finally, doesn’t the fact that apple has a fundamentally different rendering pipeline relevant?

Is it still all that fundamentally different? All of the RDNA parts are tile-based renderers (I think even the Vega series GCN parts made that switch?)

discuss

order

ribit|3 years ago

It's pretty different alright. First, there is the tile size. For current crop of desktop GPUs, tiling is primarily about cache locality (if you keep your processing spatially local you are also less likely to trash caches), but they still have very fast RAM and want to keep the triangle binning overhead to the minimum. So the tile size for desktop GPUs is much larger (if I remember correctly, it was about 128x128 pixels or something like that when I last tested it on Navi). Mobile GPUs really want to keep all of the relevant processing in the local memory entirely, so they use much smaller tiles (32x32 or even 16x16) at the expense of more involved and costly binning.

Apple (inherited from PowerVR) adds another twist on top: the rasterised pixel are not shaded immediately but instead collected in a buffer. Once all fragments in a tile are rasterised you basically have an array with visible triangle information for each pixel. Pixel shading is then simply a compute pass over this array. This can be more efficient as you only need to shade visible pixels, and it might utilise the SIMD hardware better (as you are shading 32x32 blocks containing multiple triangles at once rather than shading triangles separately), plus it radically simplifies dealing with pixels (there are never any data races for a given pixel, pixel data write-out is just a block memcpy, programmable blending is super easy and cheap to do) — in fact, I don't believe that Apple even has ROPs. There are of course disadvantages as well — it's very tricky to get right and requires specialised fixed-function hardware, you need to keep transformed primitive data around in memory until all primitives are processed (because shading is delayed), there are tons of corner cases you need to handle which can kill your performance(transparency, primitive buffer overflows etc.). And of course, many modern rendering techniques rely on global memory operations and there is an increasing trend to do rasterisation in a compute shader, where this rendering architecture doesn't really help.

garaetjjte|3 years ago

They might rasterize fragments inside tiles to reduce blending costs, but still very much behave like immediate renderers: single-pass, with vertex shading results passed continuously into fragment shaders. Apple GPU is tile-based deferred renderer: vertex stage runs first storing results into intermediate buffer, then each tile is processed running fragment shader, at the end flushing results to framebuffer. This reduces memory bandwidth but might require multiple passes when eg. intermediate vertex output buffer overflows.

my123|3 years ago

And there are GPUs that have both operating modes: Adreno.