top | item 35075116

(no title)

obl | 3 years ago

  In actual hardware shading is done 32 or 64 pixels at a time, not four. The problem above just got worse.
While it's true that there are "wasted" execution in 2x2 quads for derivative computation, it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient.

I dont think that it's publicly documented how the "packing" of quads into lanes is done in the rasterizer for modern GPUs. I'd guess something opportunistic (maybe per tile) taking advantage of the general spatial coherency of triangles in mesh order.

discuss

order

moonchild|3 years ago

> it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient

I am no GPU expert, but I performed some experiments a while ago indicating that this is in fact how it works, at least on nvidia.

I would expect it simplifies the fragment processing pipeline to have all the interpolants come from the same triangle. Another factor that comes to mind is that, due to the 2x2 quad-padding, you would end up with multiple shader executions corresponding to the same pixel location, coming from different triangles; that would probably involve complicated bookkeeping. Especially given MSAA.

obl|3 years ago

It would be interesting to see how you were testing for that, because at least on AMD it's fairly certain that a single thread can be shading multiple primitives.

For example, from the ISA docs [1], pixel waves are preloaded with an SGPR containing a bit mask indicating just that :

> The new_prim_mask is a 15-bit mask with one bit per quad; a one in this mask indicates that this quad begins a new primitive, a zero indicates it uses the same primitive as the previous quad. The mask is 15 bits, not 16, since the first quad in a wavefront begins a new primitive and so it is not included in the mask

The mask is used by the interp instructions to load the correct interpolants from local memory.

In fact, in the (older) GCN3 docs [2] there is a diagram showing the memory layout of attributes from multiple primitives for a single wavefront (page 99).

That being said, of course I expect this process to be "lazy" : you would not want to buffer execution of a partially filled thread forever, so depending on the workload you might measure different things.

[1] https://developer.amd.com/wp-content/resources/RDNA2_Shader_...

[2] http://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_In...

delusional|3 years ago

The linked AMD guide seems to suggest the author is correct

>Because the quad is rendered using two separate triangles, separate wavefronts are generated for the pixel work associated with each of those triangles. Some of the pixels near the boundary separating those triangles end up being organized into partial wavefronts